Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 21, 2012

The 2012 ACM Computing Classification System toc

Filed under: Classification,Ontology — Patrick Durusau @ 7:39 pm

The 2012 ACM Computing Classification System toc

From the post:

The 2012 ACM Computing Classification System has been developed as a poly-hierarchical ontology that can be utilized in semantic web applications. It replaces the traditional 1998 version of the ACM Computing Classification System (CCS), which has served as the de facto standard classification system for the computing field. It is being integrated into the search capabilities and visual topic displays of the ACM Digital Library. It relies on a semantic vocabulary as the single source of categories and concepts that reflect the state of the art of the computing discipline and is receptive to structural change as it evolves in the future. ACM will a provide tools to facilitate the application of 2012 CCS categories to forthcoming papers and a process to ensure that the CCS stays current and relevant. The new classification system will play a key role in the development of a people search interface in the ACM Digital Library to supplement its current traditional bibliographic search.

The full CCS classification tree is freely available for educational and research purposes in these downloadable formats: SKOS (xml), Word, and HTML. In the ACM Digital Library, the CCS is presented in a visual display format that facilitates navigation and feedback.

Will be looking at how the classification has changed since 1998. And since we have so much data online, should not be all that hard to see how well 1998 categories work for 1988, or 1977?

All for a classification that is “current and relevant.”

Still, don’t want papers dropping off the edge of the semantic world due to changes in classification.

September 10, 2012

Learning Mahout : Classification

Filed under: Classification,Machine Learning,Mahout — Patrick Durusau @ 10:01 am

Learning Mahout : Classification by Sujit Pal.

From the post:

The final part covered in the MIA book is Classification. The popular algorithms available are Stochastic Gradient Descent (SGD), Naive Bayes and Complementary Naive Bayes, Random Forests and Online Passive Aggressive. There are other algorithms in the pipeline, as seen from the Classification section of the Mahout wiki page.

The MIA book has generic classification information and advice that will be useful for any algorithm, but it specifically covers SGD, Bayes and Naive Bayes (the last two via Mahout scripts). Of these SGD and Random Forest are good for classification problems involving continuous variables and small to medium datasets, and the Naive Bayes family is good for problems involving text like variables and medium to large datasets.

In general, a solution to a classification problem involves choosing the appropriate features for classification, choosing the algorithm, generating the feature vectors (vectorization), training the model and evaluating the results in a loop. You continue to tweak stuff in each of these steps until you get the results with the desired accuracy.

Sujit notes that classification is under rapid development. The classification material is likely to become dated.

Some additional resources to consider:

Mahout User List (subscribe)

Mahout Developer List (subscribe)

IRC: Mahout’s IRC channel is #mahout.

Mahout QuickStart

August 30, 2012

MyMiner: a web application for computer-assisted biocuration and text annotation

Filed under: Annotation,Bioinformatics,Biomedical,Classification — Patrick Durusau @ 10:35 am

MyMiner: a web application for computer-assisted biocuration and text annotation by David Salgado, Martin Krallinger, Marc Depaule, Elodie Drula, Ashish V. Tendulkar, Florian Leitner, Alfonso Valencia and Christophe Marcelle. ( Bioinformatics (2012) 28 (17): 2285-2287. doi: 10.1093/bioinformatics/bts435 )

Abstract:

Motivation: The exponential growth of scientific literature has resulted in a massive amount of unstructured natural language data that cannot be directly handled by means of bioinformatics tools. Such tools generally require structured data, often generated through a cumbersome process of manual literature curation. Herein, we present MyMiner, a free and user-friendly text annotation tool aimed to assist in carrying out the main biocuration tasks and to provide labelled data for the development of text mining systems. MyMiner allows easy classification and labelling of textual data according to user-specified classes as well as predefined biological entities. The usefulness and efficiency of this application have been tested for a range of real-life annotation scenarios of various research topics.

Availability: http://myminer.armi.monash.edu.au.

Contacts: david.salgado@monash.edu and christophe.marcelle@monash.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.

A useful tool and good tutorial materials.

I could easily see something similar for CS research (unless such already exists).

August 27, 2012

K-Nearest-Neighbors and Handwritten Digit Classification

Filed under: Classification,Clustering,K-Nearest-Neighbors — Patrick Durusau @ 6:36 pm

K-Nearest-Neighbors and Handwritten Digit Classification by Jeremy Kun.

From the post:

The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

Modulo my concerns about assigning non-metric data to metric spaces, this is a very good post on classification.

July 6, 2012

UCR Time Series Classification/Clustering Page

Filed under: Classification,Clustering,Time Series — Patrick Durusau @ 6:48 pm

UCR Time Series Classification/Clustering Page

I encountered this while hunting down references on the insect identification contest.

How does your thinking about topic maps or other semantic solutions fare against:

Machine learning research has, to a great extent, ignored an important aspect of many real world applications: time. Existing concept learners predominantly operate on a static set of attributes; for example, classifying flowers described by leaf size, petal colour and petal count. The values of these attributes is assumed to be unchanging — the flower never grows or loses leaves.

However, many real datasets are not “static”; they cannot sensibly be represented as a fi xed set of attributes. Rather, the examples are expressed as features that vary temporally, and it is the temporal variation itself that is used for classifi cation. Consider a simple gesture recognition domain, in which the temporal features are the position of the hands, finger bends, and so on. Looking at the position of the hand at one point in time is not likely to lead to a successful classifi cation; it is only by analysing changes in position that recognition is possible.

(Temporal Classi cation: Extending the Classi cation Paradigm to Multivariate Time Series by Mohammed Waleed Kadous (2002))

A decade old now but still a nice summary of the issue.

Can we substitute “identification” for “machine learning research?”

Are you relying “…on a static set of attributes” for identity purposes?

UCR Insect Classification Contest [Classification by Ear]

Filed under: Classification,Identification — Patrick Durusau @ 5:16 pm

UCR Insect Classification Contest Ends November 16, 2012

As I have said before, subject identity is everywhere! 😉

From the details PDF file:

Phase I: July to November 16th 2012 (this contest)

  • The task is to produce the best distance (similarity) measure for insect flight sounds.
  • The contest will be scored by 1-nearest neighbor classification.
  • The prizes include $500 cash and engraved trophies.

I was amused to read in the FAQ:

Note that the “sound” is measured with an optical sensor, rather than an acoustic one. This is done for various pragmatic reasons, however we don’t believe it makes any difference to the task at hand. The sampling rate is 16000 Hz

If you have a bee keeper nearby, can you do an empirical comparison of optical versus acoustic sensors for the capturing the “sound” of insects?

That seems like a first step in establishing computational entomology. BTW, use a range of frequencies, from sub to super sonic. (You are aware they have discovered sub-sonic sounds from whales can travel thousands of miles? Unlikely with insects but just because our ears can’t hear something doesn’t mean other ears cannot as well.)

I first saw this at KDNuggets.

July 2, 2012

The strange case of eugenics:…

Filed under: Classification,Collocative Integrity,Dewey - DDC,Ontogeny — Patrick Durusau @ 4:01 pm

The strange case of eugenics: A subject’s ontogeny in a long-lived classification scheme and the question of collocative integrity by Joseph T. Tennis. (Tennis, J. T. (2012), The strange case of eugenics: A subject’s ontogeny in a long-lived classification scheme and the question of collocative integrity. J. Am. Soc. Inf. Sci., 63: 1350–1359. doi: 10.1002/asi.22686)

Abstract:

This article introduces the problem of collocative integrity present in long-lived classification schemes that undergo several changes. A case study of the subject “eugenics” in the Dewey Decimal Classification is presented to illustrate this phenomenon. Eugenics is strange because of the kinds of changes it undergoes. The article closes with a discussion of subject ontogeny as the name for this phenomenon and describes implications for information searching and browsing.

Tennis writes:

While many theorists have concerned themselves with how to design a scheme that can handle the addition of subjects, very little has been done to study how a subject changes after it is introduced to a scheme. Simply because we add civil engineering to a scheme of classification in 1920 does not signify that it means the same thing today. Almost 100 years have passed, and many things have changed in that subject. We may have subdivided this class in 1950, thereby separating the pre-1950 meaning from the post-1950 meaning and also affecting the collocative power of the class civil engineering. Other classes in the superclass of engineering might be considered too close, and are eliminated over time, affecting the way the classifier does her or his work (cf. Tennis, 2007; Tennis & Sutton, 2008). It is because of these concerns, coupled with the design requirement of collocation in classification, that we need to look at the life of a subject over time—the subject’s scheme history or ontogeny.

Deeply interesting work that has implications for topic map structures and the preservation of “collocative integrity” over time.

One suspects that preservation of “collocative integrity” is an ongoing process that requires more than simple assignments in a scheme.

What factors would you capture to trace the ontogeny of “euqenics” and how would you use them to preserve “collocative integrity” across that history using a topic map? (Remembering that users at any point in that ontogeny may be ignorant of prior (obviously subsequent) changes in its classification.)

June 29, 2012

DDC 23 released as linked data at dewey.info

Filed under: Classification,Dewey - DDC,Linked Data — Patrick Durusau @ 3:14 pm

DDC 23 released as linked data at dewey.info

From the post:

As announced on Monday at the seminar “Global Interoperability and Linked Data in Libraries” in beautiful Florence, an exciting new set of linked data has been added to dewey.info. All assignable classes from DDC 23, the current full edition of the Dewey Decimal Classification, have been released as Dewey linked data. As was the case for the Abridged Edition 14 data, we define “assignable” as including every schedule number that is not a span or a centered entry, bracketed or optional, with the hierarchical relationships adjusted accordingly. In short, these are numbers that you find attached to many WorldCat records as standard Dewey numbers (in 082 fields), as additional Dewey numbers (in 083 fields), or as number components (in 085 fields).

The classes are exposed with full number and caption information and semantic relationships expressed in SKOS, which makes the information easily accessible and parsable by a wide variety of semantic web applications.

This recent addition massively expands the data set by over 38.000 Dewey classes (or, for the linked data geeks out there, by over 1 million triples), increasing the number of classes available almost tenfold. If you like, take some time to explore the hierarchies; you might be surprised to find numbers for Maya calendar or transits of Venus (loyal blog readers will recognize these numbers).

All the old goodies are still there, of course. Depending on which type of user agent is accessing the data (e.g., a browser) a different representation is negotiated (HTML or various flavors of RDF). The HTML pages still include RDFa markup, which can be distilled into RDF by browser plug-ins and other applications without the user ever having to deal with the RDF data directly.

More details follow but that should be enough to capture your interest.

Good thing there is a pointer for the Maya calendar. Would hate for interstellar archaeologists to think we were too slow to invent a classification number for the disaster that is supposed to befall us this coming December.

I have renewed my ACM and various SIG memberships to run beyond December 2012. In the event of an actual disaster refunds will not be an issue. 😉

May 24, 2012

Visual and semantic interpretability of projections of high dimensional data for classification tasks

Filed under: Classification,High Dimensionality,Semantics,Visualization — Patrick Durusau @ 6:10 pm

Visual and semantic interpretability of projections of high dimensional data for classification tasks by Ilknur Icke and Andrew Rosenberg.

A number of visual quality measures have been introduced in visual analytics literature in order to automatically select the best views of high dimensional data from a large number of candidate data projections. These methods generally concentrate on the interpretability of the visualization and pay little attention to the interpretability of the projection axes. In this paper, we argue that interpretability of the visualizations and the feature transformation functions are both crucial for visual exploration of high dimensional labeled data. We present a two-part user study to examine these two related but orthogonal aspects of interpretability. We first study how humans judge the quality of 2D scatterplots of various datasets with varying number of classes and provide comparisons with ten automated measures, including a number of visual quality measures and related measures from various machine learning fields. We then investigate how the user perception on interpretability of mathematical expressions relate to various automated measures of complexity that can be used to characterize data projection functions. We conclude with a discussion of how automated measures of visual and semantic interpretability of data projections can be used together for exploratory analysis in classification tasks.

Rather small group of test subjects (20) so I don’t think you can say much other than more work is needed.

Then it occurred to me that I often speak of studies applying to “users” without stopping to remember that for many tasks, I fall into that self-same category. Subject to the same influences, fatigues and even mistakes.

Anyone know of research by researchers being applied to the same researchers?

May 19, 2012

New Mechanical Turk Categorization App

Filed under: Amazon Web Services AWS,Classification,Mechanical Turk — Patrick Durusau @ 10:52 am

New Mechanical Turk Categorization App

Categorization is one of the more popular use cases for the Amazon Mechanical Turk. A categorization HIT (Human Intelligence Task) asks the Worker to select from a list of options. Our customers use HITs of this type to assign product categories, match URLs to business listings, and to discriminate between line art and photographs.

Using our new Categorization App, you can start categorizing your own items or data in minutes, eliminating the learning curve that has traditionally accompanied this type of activity. The app includes everything that you need to be successful including:

  1. Predefined HITs (no HTML editing required).
  2. Pre-qualified Master Workers (see Jinesh’s previous blog post on Mechanical Turk Masters).
  3. Price recommendations based on complexity and comparable HITs.
  4. Analysis tools.

The Categorization App guides you through the four simple steps that are needed to create your categorization project.

I thought the contrast between gamers (the GPU post) and MTurkers would be a nice to close the day. 😉

Although, there are efforts to create games where useful activity happens, whether intended or not. (Would that take some of the joy out of a game?)

If you use this particular app, please blog or post a note about your experieince.

Thanks!

May 13, 2012

Are visual dictionaries generalizable?

Filed under: Classification,Dictionary,Image Recognition,Information Retrieval — Patrick Durusau @ 7:54 pm

Are visual dictionaries generalizable? by Otavio A. B. Penatti, Eduardo Valle, and Ricardo da S. Torres

Abstract:

Mid-level features based on visual dictionaries are today a cornerstone of systems for classification and retrieval of images. Those state-of-the-art representations depend crucially on the choice of a codebook (visual dictionary), which is usually derived from the dataset. In general-purpose, dynamic image collections (e.g., the Web), one cannot have the entire collection in order to extract a representative dictionary. However, based on the hypothesis that the dictionary reflects only the diversity of low-level appearances and does not capture semantics, we argue that a dictionary based on a small subset of the data, or even on an entirely different dataset, is able to produce a good representation, provided that the chosen images span a diverse enough portion of the low-level feature space. Our experiments confirm that hypothesis, opening the opportunity to greatly alleviate the burden in generating the codebook, and confirming the feasibility of employing visual dictionaries in large-scale dynamic environments.

The authors use the Caltech-101 image set because of its “diversity.” Odd because they cite the Caltech-256 image set, which was created to answer concerns about the lack of diversity in the Caltech-101 image set.

Not sure this paper answers the issues it raises about visual dictionaries.

Wanted to bring it to your attention because representative dictionaries (as opposed to comprehensive ones) may be lurking just beyond the semantic horizon.

May 4, 2012

How do you compare two text classfiers?

Filed under: Classification,Classifier — Patrick Durusau @ 3:43 pm

How do you compare two text classfiers?

Tony Russell-Rose writes:

I need to compare two text classifiers – one human, one machine. They are assigning multiple tags from an ontology. We have an initial corpus of ~700 records tagged by both classifiers. The goal is to measure the ‘value added’ by the human. However, we don’t yet have any ground truth data (i.e. agreed annotations).

Any ideas on how best to approach this problem in a commercial environment (i.e. quickly, simply, with minimum fuss), or indeed what’s possible?

I thought of measuring the absolute delta between the two profiles (regardless of polarity) to give a ceiling on the value added, and/or comparing the profile of tags added by each human coder against the centroid to give a crude measure of inter-coder agreement (and hence difficulty of the task). But neither really measures the ‘value added’ that I’m looking for, so I’m sure there must better solutions.

Suggestions, anyone? Or is this as far as we can go without ground truth data?

Some useful comments have been made. Do you have others?

PS: I wrote at Tony’s blog in a comment:

Tony,

The ‘value added’ by human taggers concept is unclear. The tagging in both cases is the result of human adding of semantics. Once through the rules for the machine tagger and once via the “human” taggers.

Can you say a bit more about what you see as a separate ‘value added’ by the human taggers?

What do you think? Is Tony’s question clear enough?

April 18, 2012

Learning Fuzzy β-Certain and β-Possible rules…

Filed under: Classification,Fuzzy Matching,Fuzzy Sets,Rough Sets — Patrick Durusau @ 6:08 pm

Learning Fuzzy β-Certain and β-Possible rules from incomplete quantitative data by rough sets by Ali Soltan Mohammadi, L. Asadzadeh, and D. D. Rezaee.

Abstract:

The rough-set theory proposed by Pawlak, has been widely used in dealing with data classification problems. The original rough-set model is, however, quite sensitive to noisy data. Tzung thus proposed deals with the problem of producing a set of fuzzy certain and fuzzy possible rules from quantitative data with a predefined tolerance degree of uncertainty and misclassification. This model allowed, which combines the variable precision rough-set model and the fuzzy set theory, is thus proposed to solve this problem. This paper thus deals with the problem of producing a set of fuzzy certain and fuzzy possible rules from incomplete quantitative data with a predefined tolerance degree of uncertainty and misclassification. A new method, incomplete quantitative data for rough-set model and the fuzzy set theory, is thus proposed to solve this problem. It first transforms each quantitative value into a fuzzy set of linguistic terms using membership functions and then finding incomplete quantitative data with lower and the fuzzy upper approximations. It second calculates the fuzzy {\beta}-lower and the fuzzy {\beta}-upper approximations. The certain and possible rules are then generated based on these fuzzy approximations. These rules can then be used to classify unknown objects.

In part interesting because of its full use of sample data to illustrate the process being advocated.

Unless smooth sets in data are encountered by some mis-chance, rough sets will remain a mainstay of data mining for the foreseeable future.

April 16, 2012

Third Challenge on Large Scale Hierarchical Text Classification

Filed under: Classification,Contest — Patrick Durusau @ 7:12 pm

ECML/PKDD 2012 Discovery Challenge: Third Challenge on Large Scale Hierarchical Text Classification

Important dates:

– March 30, start of the challenge
– April 20, opening of the evaluation
– June 29, closing of evaluation
– July 20, paper submission deadline
– August 3, paper notifications

From the website:

This year’s discovery challenge hosts the third edition of the successful PASCAL challenges on large scale hierarchical text classification. The challenge comprises three tracks and it is based on two large datasets created from the ODP web directory (DMOZ) and Wikipedia. The datasets are multi-class, multi-label and hierarchical. The number of categories ranges between 13,000 and 325,000 roughly and the number of documents between 380,000 and 2,400,000.

The tracks of the challenge are organized as follows:

1. Standard large-scale hierarchical classification
a) On collection of medium size from Wikipedia
b) On a large collection from Wikipedia

2. Multi-task learning, based on both DMOZ and Wikipedia category systems

3. Refinement-learning
a) Semi-Supervised approach
b) Unsupervised approach

In order to register for the challenge and gain access to the datasets you must have an account at the challenge Web site.

More fun than repeating someone’s vocabulary. Yes?

April 13, 2012

Classifier Technology and the Illusion of Progress

Filed under: Classification,Classifier — Patrick Durusau @ 4:44 pm

Classifier Technology and the Illusion of Progress by David J. Hand.

Was pointed to in Simply Statistics for 8 April 2012:

Abstract:

A great many tools have been developed for supervised classification, ranging from early methods such as linear discriminant analysis through to modern developments such as neural networks and support vector machines. A large number of comparative studies have been conducted in attempts to establish the relative superiority of these methods. This paper argues that these comparisons often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illusion. In particular, simple methods typically yield performance almost as good as more sophisticated methods, to the extent that the difference in performance may be swamped by other sources of uncertainty that generally are not considered in the classical supervised classification paradigm.

The original pointer didn’t mention there were four published comments and a formal rejoinder:

Comment: Classifier Technology and the Illusion of Progress by Jerome H. Friedman.

Comment: Classifier Technology and the Illusion of Progress–Credit Scoring by Ross W. Gayler.

Elaboration on Two Points Raised in “Classifier Technology and the Illusion of Progress” by Robert C. Holte.

Comment: Classifier Technology and the Illusion of Progress by Robert A. Stine.

Rejoinder: Classifier Technology and the Illusion of Progress by David J. Hand.

Enjoyable reading, one and all!

April 4, 2012

Adobe Releases Malware Classifier Tool

Filed under: Classification,Classifier,Malware — Patrick Durusau @ 3:33 pm

Adobe Releases Malware Classifier Tool by Dennis Fisher.

From the post:

Adobe has published a free tool that can help administrators and security researchers classify suspicious files as malicious or benign, using specific machine-learning algorithms. The tool is a command-line utility that Adobe officials hope will make binary classification a little easier.

Adobe researcher Karthik Raman developed the new Malware Classifier tool to help with the company’s internal needs and then decided that it might be useful for external users, as well.

” To make life easier, I wrote a Python tool for quick malware triage for our team. I’ve since decided to make this tool, called “Adobe Malware Classifier,” available to other first responders (malware analysts, IT admins and security researchers of any stripe) as an open-source tool, since you might find it equally helpful,” Raman wrote in a blog post.

“Malware Classifier uses machine learning algorithms to classify Win32 binaries – EXEs and DLLs – into three classes: 0 for “clean,” 1 for “malicious,” or “UNKNOWN.” The tool extracts seven key features from a binary, feeds them to one or all of the four classifiers, and presents its classification results.”

Adobe Malware Classifier (Sourceforge)

Old hat that malware scanners have been using machine learning but new that you can now see it from the inside.

Lessons to be learned about machine learning algorithms for malware and other uses with software.

Kudos to Adobe!

April 2, 2012

UCR Time Series Classification/Clustering Page

Filed under: Classification,Clustering,Dataset,Time Series — Patrick Durusau @ 5:46 pm

UCR Time Series Classification/Clustering Page

From the webpage:

This webpage has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering.

While chasing the details on Eamonn Keogh and his time series presentation, I encountered this collection of data sets.

March 20, 2012

From counting citations to measuring usage (help needed!)

Filed under: Citation Indexing,Classification,Data — Patrick Durusau @ 3:52 pm

From counting citations to measuring usage (help needed!)

Daniel Lemire writes:

We sometimes measure the caliber of a researcher by how many research papers he wrote. This is silly. While there is some correlation between quantity and quality — people like Einstein tend to publish a lot — it can be gamed easily. Moreover, several major researchers have published relatively few papers: John Nash has about two dozens papers in Scopus. Even if you don’t know much about science, I am sure you can think of a few writers who have written only a couple of books but are still world famous.

A better measure is the number of citations a researcher has received. Google Scholar profiles display the citation record of researchers prominently. It is a slightly more robust measure, but it is still silly because 90% of citations are shallow: most authors haven’t even read the paper they are citing. We tend to cite famous authors and famous venues in the hope that some of the prestige will get reflected.

But why stop there? We have the technology to measure the usage made of a cited paper. Some citations are more significant: for example it can be an extension of the cited paper. Machine learning techniques can measure the impact of your papers based on how much following papers build on your results. Why isn’t it done?

Daniel wants to distinguish important papers that cite his papers from ho-hum papers that cite him. (my characterization, not his)

That isn’t happening now so Daniel has teamed up with Peter Turney and Andre Vellino to gather data from published authors (that would be you), to use in investigating this problem.

Topic maps of scholarly and other work face the same problem. How do you distinguish the important from the less so? For that matter, what criteria do you use? If an author who cites you wins the Nobel Prize for work that doesn’t cite you, does the importance of your paper go up? Stay the same? Goes down? 😉

It is an important issue so if you are a published author, see Daniel’s post and contribute to the data gathering.

February 29, 2012

Will the Circle Be Unbroken? Interactive Annotation!

I have to agree with Bob Carpenter, the title is a bit much:

Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

From the post:

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

Both the paper and Bob’s review merit close reading.

February 21, 2012

Making sense of Wikipedia categories

Filed under: Annotation,Classification,Wikipedia — Patrick Durusau @ 8:00 pm

Making sense of Wikipedia categories

Hal Daume III writes:

Wikipedia’s category hierarchy forms a graph. It’s definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that “Chicago Stags coaches” are a subcategory of “Natural sciences”? If you don’t believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

(subcategories omitted)

I guess it kind of makes sense. There are some other fun ones, like “Rhaeto-Romance languages”, “American World War I flying aces” and “1911 films”. Of course, these are all quite deep in the “hierarchy” (all of those are at depth 15 or higher).

Hal examines several strategies and concludes asking:

Has anyone else tried and succeed at using the Wikipedia category structure?

Some other questions:

Is Hal right that hand annotation doesn’t “scale?”

I have heard that more times than I can count but never seen any studies cited to support it.

After all, Wikipedia was manually edited and produced. Yes? No automated process created its content. So, what is the barrier to hand annotation?

If you think about it, the same could be said about email but most email (yes?) is written by hand. Not produced by automated processes (well, except for spam), so why can’t it be hand annotated? Or at least why can’t we capture semantics of email at the point of composition and annotate it there by automated means?

Hand annotation may not scale for sensor data or financial data streams but is hand annotation needed for such sources?

Hand annotation may not scale for say twitter posts by non-English speakers. But only for agencies with very short-sighted if not actively bigoted hiring/contracting practices.

Has anyone loaded the Wikipedia categories into a graph database? What sort of interface would you suggest for trial arrangement of the categories?

PS: If you are interested in discussing how-to establish assisted annotation for twitter, email or other data streams, with or without user awareness, send me a post.

February 10, 2012

Dragsters, Drag Cars & Drag Racing Cars

I still remember the cover of Hot Rod magazine that announced (from memory) “The 6’s are here!” Don “The Snake” Prudhomme had broken the 200 mph barrier in a drag race. Other memories follow on from that one but I mention it to explain my interest in a recent Subject Authority Cooperative Program decision to not have a cross-reference from dragster (the term I would have used) to more recent terms, drag cars or drag racing cars.

The expected search (in this order) due to this decision is:

Cars (Automobiles) -> redirect to Automobiles -> Automobiles -> narrower term -> Automobiles, racing -> narrower term -> Dragsters

Adam L. Schiff, proposer of drag cars & drag racing cars says below “This just is not likely to happen.”

Question: Is there a relationship between users “work[ing] their way up and down hierarchies” and display of relationships methods? Who chooses which items will be the starting point to lead to other items? How do you integrate a keyword search into such a system?

Question: And what of the full phrase/sentence AI systems where keywords work less well? How does that work with relationship display systems?

Question: I wonder if the relationship display methods are closer to the up and down hierarchies, but with less guidance?

Adam’s Dragster proposal post in full:

Dragsters

Automobiles has a UF Cars (Automobiles). Since the UF already exists on the basic heading, it is not necessary to add it to Dragsters. The proposal was not approved.

Our proposal was to add two additional cross-references to Dragsters: Drag cars, and Drag racing cars. While I understand, in principle, the reasoning behind the rejection of these additional references, I do not see how it serves users. A user coming to a catalog to search for the subject “Drag cars” will now get nothing, no redirection to the established heading. I don’t see how the presence of a reference from Cars (Automobiles) to Automobiles helps any user who starts a search with “Drag cars”. Only if they begin their search with Cars would they get led to Automobiles, and then only if they pursue narrower terms under that heading would they find Automobiles, Racing, which they would then have to follow further down to Dragsters. This just is not likely to happen. Instead they will probably start with a keyword search on “Drag cars” and find nothing, or if lucky, find one or two resources and think they have it all. And if they are astute enough to look at the subject headings on one of the records and see “Dragsters”, perhaps they will then redo their search.

Since the proposed cross-refs do not begin with the word Cars, I do not at all see how a decision like this is in the service of users of our catalogs. I think that LCSH rules for references were developed when it was expected that users would consult the big red books and work their way up and down hierarchies. While some online systems do provide for such navigation, it is doubtful that many users take this approach. Keyword searching is predominant in our catalogs and on the Web. Providing as many cross-refs to established headings as we can would be desirable. If the worry is that the printed red books will grow to too many volumes if we add more variant forms that weren’t made in the card environment, then perhaps there needs to be a way to include some references in authority records but mark them as not suitable for printing in printed products.

PS: According to ODLIS: Online Dictionary for Library and Information Science by Joan M. Reitz, UF, has the following definition:

used for (UF)

A phrase indicating a term (or terms) synonymous with an authorized subject heading or descriptor, not used in cataloging or indexing to avoid scatter. In a subject headings list or thesaurus of controlled vocabulary, synonyms are given immediately following the official heading. In the alphabetical list of indexing terms, they are included as lead-in vocabulary followed by a see or USE cross-reference directing the user to the correct heading. See also: syndetic structure.

I did not attempt to reproduce the extremely rich cross-linking in this entry but commend the entire resource to your attention, particularly if you are a library science student.

January 23, 2012

Mining Text Data

Filed under: Classification,Data Mining,Text Extraction — Patrick Durusau @ 7:46 pm

Mining Text Data Charu Aggarwal and ChengXiang Zhai, Springer, February 2012, Approximately 500 pages.

From the publisher’s description:

Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned.

Mining Text Data introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including the key research content on the topic, and the future directions of research in the field. There is a special focus on Text Embedded with Heterogeneous and Multimedia Data which makes the mining process much more challenging. A number of methods have been designed such as transfer learning and cross-lingual mining for such cases.

Mining Text Data simplifies the content, so that advanced-level students, practitioners and researchers in computer science can benefit from this book. Academic and corporate libraries, as well as ACM, IEEE, and Management Science focused on information security, electronic commerce, databases, data mining, machine learning, and statistics are the primary buyers for this reference book.

Not at the publisher’s site but you can see the Table of Contents and chapter 4, A SURVEY OF TEXT CLUSTERING ALGORITHMS and chapter 6, A SURVEY OF TEXT CLASSIFICATION ALGORITHMS at: www.charuaggarwal.net/text-content.pdf.

The two chapters you can download from Aggarwal’s website will give you a good idea of what to expect from the text.

While an excellent survey work, with chapters written by experts in various sub-fields, it also suffers from the survey work format.

For example, for the two sample chapters, there are overlaps in the bibliographies for both chapters. Not surprising given the closely related subject matter but as a reader I would be interested in discovering that some works are cited in both chapters. Something that is possible given the back of the chapter bibliography format, only by repetitive manual inspection.

Although I rail against examples in standards, expanding the survey reference work format to include more details and examples would only increase its usefulness and possible its life as a valued reference.

Which raises the question of having a print format for survey works at all. The research landscape is changing quickly and a shelf life of 2 to 3 years, if that long, seems a bit brief for the going rate for print editions. Printed versions of chapters as smaller and more timely works on demand, that is a value-add proposition that Springer is in a unique position to bring to its customers.

January 14, 2012

Extract meta concepts through co-occurrences analysis and graph theory

Filed under: Classification,co-occurrence,Indexing — Patrick Durusau @ 7:36 pm

Extract meta concepts through co-occurrences analysis and graph theory

Cristian Mesiano writes:

During The Christmas period I had finally the chance to read some papers about probabilistic latent semantic and its applications in auto classification and indexing.

The main concept behind “latent semantic” lays on the assumption that words that occurs close in the text are related to the same semantic construct.

Based on this principle the LSA (and partially also the PLSA ) builds a matrix to keep track of the co-occurrences of the words in text, and it assign a score to these co-occurrences considering the distribution in the corpus as well.

Often TF-IDF score is used to rank the words.

Anyway, I was wondering if this techniques could be useful also to extract key concepts from the text.

Basically I thought: “in LSA we consider some statistics over the co-occurrences, so: why not consider the link among the co-occurrences as well?”.

Using the first three chapters of “The Media in the Network Society, author: Gustavo Cardoso,” Christian creates a series of graphs.

Christian promises his opinion on classification of texts using this approach.

In the meantime, what’s yours?

December 3, 2011

Evolutionary Subject Tagging in the Humanities…

Filed under: Classification,Digital Culture,Digital Library,Humanities,Tagging — Patrick Durusau @ 8:22 pm

Evolutionary Subject Tagging in the Humanities; Supporting Discovery and Examination in Digital Cultural Landscapes by JackAmmerman, Vika Zafrin, Dan Benedetti, Garth W. Green.

Abstract:

In this paper, the authors attempt to identify problematic issues for subject tagging in the humanities, particularly those associated with information objects in digital formats. In the third major section, the authors identify a number of assumptions that lie behind the current practice of subject classification that we think should be challenged. We move then to propose features of classification systems that could increase their effectiveness. These emerged as recurrent themes in many of the conversations with scholars, consultants, and colleagues. Finally, we suggest next steps that we believe will help scholars and librarians develop better subject classification systems to support research in the humanities.

Truly remarkable piece of work!

Just to entice you into reading the entire paper, the authors challenge the assumption that knowledge is analogue. Successfully in my view but I already held that position so I was an easy sell.

BTW, if you are in my topic maps class, this paper is required reading. Summarize what you think are the strong/weak points of the paper in 2 to 3 pages.

October 3, 2011

Algorithms of the Intelligent Web Review

Algorithms of the Intelligent Web Review by Pearlene McKinley

From the post:

I have always had an interest in AI, machine learning, and data mining but I found the introductory books too mathematical and focused mostly on solving academic problems rather than real-world industrial problems. So, I was curious to see what this book was about.

I have read the book front-to-back (twice!) before I write this report. I started reading the electronic version a couple of months ago and read the paper print again over the weekend. This is the best practical book in machine learning that you can buy today — period. All the examples are written in Java and all algorithms are explained in plain English. The writing style is superb! The book was written by one author (Marmanis) while the other one (Babenko) contributed in the source code, so there are no gaps in the narrative; it is engaging, pleasant, and fluent. The author leads the reader from the very introductory concepts to some fairly advanced topics. Some of the topics are covered in the book and some are left as an exercise at the end of each chapter (there is a “To Do” section, which was a wonderful idea!). I did not like some of the figures (they were probably made by the authors not an artist) but this was only a minor aesthetic inconvenience.

The book covers four cornerstones of machine learning and intelligence, i.e. intelligent search, recommendations, clustering, and classification. It also covers a subject that today you can find only in the academic literature, i.e. combination techniques. Combination techniques are very powerful and although the author presents the techniques in the context of classifiers, it is clear that the same can be done for recommendations — as the Bell Korr team did for the Netflix prize.

Wonder if this will be useful in the Stanford AI course that starts next week with more than 130,000 students? Introduction to Artificial Intelligence – Stanford Class

I am going to order a copy, if for no other reason than to evaluate the reviewer’s claim of explanations “in plain English.” I have seen some fairly clever explanations of AI algorithms and would like to see how these stack up.

September 25, 2011

Modeling Item Difficulty for Annotations of Multinomial Classifications

Filed under: Annotation,Classification,LingPipe,Linguistics — Patrick Durusau @ 7:49 pm

Modeling Item Difficulty for Annotations of Multinomial Classifications by Bob Carpenter

From the post:

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

For your convenience, links for the “…tutorial for LREC with Massimo Poesio” can be found at: LREC 2010 Tutorial: Modeling Data Annotation.

June 23, 2011

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning

Andreas Krause and Daniel Golovin course at CalTech. Lecture notes, readings, this will keep you entertained for some time.

Overview:

How can we gain insights from massive data sets?

Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In particular, in this course we will study:

  • Online learning: How can we learn when we cannot fit the training data into memory? We will cover no regret online algorithms; bandit algorithms; sketching and dimension reduction.
  • Active learning: How should we choose few expensive labels to best utilize massive unlabeled data? We will cover active learning algorithms, learning theory and label complexity.
  • Nonparametric learning on large data: How can we let complexity of classifiers grow in a principled manner with data set size? We will cover large-­scale kernel methods; Gaussian process regression, classification, optimization and active set methods.

Why would a non-strong AI person list so much machine learning stuff?

Two reasons:

1) Machine learning techniques are incredibly useful in appropriate cases.

2) You have to understand machine learning to pick out the appropriate cases.

April 18, 2011

Classify content with XQuery

Filed under: Classification,Text Analytics,XQuery — Patrick Durusau @ 1:40 pm

Classify content with XQuery by James R. Fuller (jim.fuller@webcomposite.com), Technical Director, Webcomposite.

Summary: With the expanding growth of semi-structured and unstructured data (XML) comes the need to categorize and classify content to make querying easier, faster, and more relevant. In this article, try several techniques using XQuery to automatically tag XML documents with content categorization based on the analysis of their content and structure.

Good article on the use of XQuery for basic text analysis and how to invoke web services while using XQuery for more sophisticated text analysis.

March 30, 2011

Machine Learning

Filed under: Classification,Clustering,Machine Learning,Regression — Patrick Durusau @ 12:35 pm

Machine Learning

From the site:

This page documents all the machine learning algorithms present in the library. In particular, there are algorithms for performing classification, regression, clustering, anomaly detection, and feature ranking, as well as algorithms for doing more specialized computations.

A good tutorial and introduction to the general concepts used by most of the objects in this part of the library can be found in the svm example program. After reading this example another good one to consult would be the model selection example program. Finally, if you came here looking for a binary classification or regression tool then I would try the krr_trainer first as it is generally the easiest method to use.

The major design goal of this portion of the library is to provide a highly modular and simple architecture for dealing with kernel algorithms….

Update: Dlib – machine learning. Why I left out the library name I cannot say. Sorry!

December 27, 2010

Python Text Processing with NLTK2.0 Cookbook – Review Forthcoming!

Filed under: Classification,Data Analysis,Data Mining,Natural Language Processing — Patrick Durusau @ 2:25 pm

Just a quick placeholder to say that I am reviewing Python Text Processing with NLTK2.0 Cookbook

Python Text Processing

I should have the review done in the next couple of weeks.

In the longer term I will be developing a set of notes on the construction of topic maps using this toolkit.

While you wait for the review, you might enjoy reading: Chapter No.3 – Creating Custom Corpora (free download).

« Newer PostsOlder Posts »

Powered by WordPress