Archive for the ‘Classification’ Category

Law Classification Added to Library of Congress Linked Data Service

Saturday, April 13th, 2013

Law Classification Added to Library of Congress Linked Data Service by Kevin Ford.

From the post:

The Library of Congress is pleased to make the K ClassLaw Classification – and all its subclasses available as linked data from the LC Linked Data Service, ID.LOC.GOV. K Class joins the B, N, M, and Z Classes, which have been in beta release since June 2012. With about 2.2 million new resources added to ID.LOC.GOV, K Class is nearly eight times larger than the B, M, N, and Z Classes combined. It is four times larger than the Library of Congress Subject Headings (LCSH). If it is not the largest class, it is second only to the P Class (Literature) in the Library of Congress Classification (LCC) system.

We have also taken the opportunity to re-compute and reload the B, M, N, and Z classes in response to a few reported errors. Our gratitude to Caroline Arms for her work crawling through B, M, N, and Z and identifying a number of these issues.

Please explore the K Class for yourself at http://id.loc.gov/authorities/classification/K or all of the classes at http://id.loc.gov/authorities/classification.

The classification section of ID.LOC.GOV remains a beta offering. More work is needed not only to add the additional classes to the system but also to continue to work out issues with the data.

As always, your feedback is important and welcomed. Your contributions directly inform service enhancements. We are interested in all forms of constructive commentary on all topics related to ID. But we are particularly interested in how the data available from ID.LOC.GOV is used and continue to encourage the submission of use cases describing how the community would like to apply or repurpose the LCC data.

You can send comments or report any problems via the ID feedback form or ID listserv.

Not leisure reading for everyone but if you are interested, this is fascinating source material.

And an important source of information for potential associations between subjects.

I first saw this at: Ford: Law Classification Added to Library of Congress Linked Data Service.

Graph Based Classification Methods Using Inaccurate External Classifier Information

Wednesday, January 30th, 2013

Graph Based Classification Methods Using Inaccurate External Classifier Information by Sundararajan Sellamanickam and Sathiya Keerthi Selvaraj.

Abstract:

In this paper we consider the problem of collectively classifying entities where relational information is available across the entities. In practice inaccurate class distribution for each entity is often available from another (external) classifier. For example this distribution could come from a classifier built using content features or a simple dictionary. Given the relational and inaccurate external classifier information, we consider two graph based settings in which the problem of collective classification can be solved. In the first setting the class distribution is used to fix labels to a subset of nodes and the labels for the remaining nodes are obtained like in a transductive setting. In the other setting the class distributions of all nodes are used to define the fitting function part of a graph regularized objective function. We define a generalized objective function that handles both the settings. Methods like harmonic Gaussian field and local-global consistency (LGC) reported in the literature can be seen as special cases. We extend the LGC and weighted vote relational neighbor classification (WvRN) methods to support usage of external classifier information. We also propose an efficient least squares regularization (LSR) based method and relate it to information regularization methods. All the methods are evaluated on several benchmark and real world datasets. Considering together speed, robustness and accuracy, experimental results indicate that the LSR and WvRN-extension methods perform better than other methods.

Doesn’t read like a page-turner does it? ;-)

An example from the paper will help illustrate why this is an important paper:

In this paper we consider a related relational learning problem where, instead of a subset of labeled nodes, we have inaccurate external label/class distribution information for each node. This problem arises in many web applications. Consider, for example, the problem of identifying pages about Public works, Court, Health, Community development, Library etc. within the web site of a particular city. The link and directory relations contain useful signals for solving such a classifi cation problem. Note that this relational structure will be diff erent for di fferent city web sites. If we are only interested in a small number of cities then we can a fford to label a number of pages in each site and then apply transductive learning using the labeled nodes. But, if we want to do the classifi cation on hundreds of thousands of city sites, labeling on all sites is expensive and we need to take a diff erent approach. One possibility is to use a selected set of content dictionary features together with the labeling of a small random sample of pages from a number of sites to learn an inaccurate probabilistic classifi er, e.g., logistic regression. Now, for any one city web site, the output of this initial classifi er can be used to generate class distributions for the pages in the site, which can then be used together with the relational information in the site to get accurate classifi cation.

In topic map parlance, we would say identity was being established by the associations in which a topic participates but that is a matter of terminology and not substantive difference.

My Intro to Multiple Classification…

Saturday, December 29th, 2012

My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

From the post:

After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.

The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books. In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!! The following table describes the numerical breakdown of books and chapters from each author:

Introduction to authorship studies as they were known (may still be) in the academic circles of my youth.

I wonder if the same techniques are as viable today as on the Federalist Papers?

The Wheel of Time example demonstrates the technique remains viable for novel authors.

But what about authorship more broadly?

Can we reliably distinguish between news commentary from multiple sources?

Or between statements by elected officials?

How would your topic map represent purported authorship versus attributed authorship?

Or even a common authorship for multiple purported authors? (speech writers)

LIBOL 0.1.0

Friday, December 28th, 2012

LIBOL 0.1.0

From the webpage:

LIBOL is an open-source library for large-scale online classification, which consists of a large family of efficient and scalable state-of-the-art online learning algorithms for large-scale online classification tasks. We have offered easy-to-use command-line tools and examples for users and developers. We also have made documents available for both beginners and advanced users. LIBOL is not only a machine learning tool, but also a comprehensive experimental platform for conducting online learning research.

In general, the existing online learning algorithms for linear classication tasks can be grouped into two major categories: (i) first order learning (Rosenblatt, 1958; Crammer et al., 2006), and (ii) second order learning (Dredze et al., 2008; Wang et al., 2012; Yang et al., 2009).

Example online learning algorithms in the first order learning category implemented in this library include:

• Perceptron: the classical online learning algorithm (Rosenblatt, 1958);

• ALMA: A New ApproximateMaximal Margin Classification Algorithm (Gentile, 2001);

• ROMMA: the relaxed online maxiumu margin algorithms (Li and Long, 2002);

• OGD: the Online Gradient Descent (OGD) algorithms (Zinkevich, 2003);

• PA: Passive Aggressive (PA) algorithms (Crammer et al., 2006), one of state-of-the-art first order online learning algorithms;

Example algorithms in the second order online learning category implemented in this library include the following:

• SOP: the Second Order Perceptron (SOP) algorithm (Cesa-Bianchi et al., 2005);

• CW: the Confidence-Weighted (CW) learning algorithm (Dredze et al., 2008);

• IELLIP: online learning algorithms by improved ellipsoid method (Yang et al., 2009);

• AROW: the Adaptive Regularization of Weight Vectors (Crammer et al., 2009);

• NAROW: New variant of Adaptive Regularization (Orabona and Crammer, 2010);

• NHERD: the Normal Herding method via Gaussian Herding (Crammer and Lee, 2010)

• SCW: the recently proposed Soft ConfidenceWeighted algorithms (Wang et al., 2012).

LIBOL is still being improved by improvements from practical users and new research results.

More information can be found in our project website: http://libol.stevenhoi.org/

Consider this an early New Year’s present!

EOL Classification Providers [Encyclopedia of Life]

Wednesday, December 26th, 2012

EOL Classification Providers

From the webpage:

The information on EOL is organized using hierarchical classifications of taxa (groups of organisms) from a number of different classification providers. You can explore these hierarchies in the Names tab of EOL taxon pages. Many visitors would expect to see a single classification of life on EOL. However, we are still far from having a classification scheme that is universally accepted.

Biologists all over the world are studying the genetic relationships between organisms in order to determine each species’ place in the hierarchy of life. While this research is underway, there will be differences in opinion on how to best classify each group. Therefore, we present our visitors with a number of alternatives. Each of these hierarchies is supported by a community of scientists, and all of them feature relationships that are controversial or unresolved.

How far from universally accepted?

Consider the sources for classification:

AntWeb
AntWeb is generally recognized as the most advanced biodiversity information system at species level dedicated to ants. Altogether, its acceptance by the ant research community, the number of participating remote curators that maintain the site, number of pictures, simplicity of web interface, and completeness of species, make AntWeb the premier reference for dissemination of data, information, and knowledge on ants. AntWeb is serving information on tens of thousands of ant species through the EOL.

Avibase
Avibase is an extensive database information system about all birds of the world, containing over 6 million records about 10,000 species and 22,000 subspecies of birds, including distribution information, taxonomy, synonyms in several languages and more. This site is managed by Denis Lepage and hosted by Bird Studies Canada, the Canadian copartner of Birdlife International. Avibase has been a work in progress since 1992 and it is offered as a free service to the bird-watching and scientific community. In addition to links, Avibase helped us install Gill, F & D Donsker (Eds). 2012. IOC World Bird Names (v 3.1). Available at http://www.worldbirdnames.org as of 2 May 2012.  More bird classifications are likely to follow

CoL
The Catalogue of Life Partnership (CoLP) is an informal partnership dedicated to creating an index of the world’s organisms, called the Catalogue of Life (CoL). The CoL provides different forms of access to an integrated, quality, maintained, comprehensive consensus species checklist and taxonomic hierarchy, presently covering more than one million species, and intended to cover all know species in the near future. The Annual Checklist EOL uses contains substantial contributions of taxonomic expertise from more than fifty organizations around the world, integrated into a single work by the ongoing work of the CoLP partners. 

FishBase
FishBase is a global information system with all you ever wanted to know about fishes. FishBase is a relational database with information to cater to different professionals such as research scientists, fisheries managers, zoologists and many more. The FishBase Website contains data on practically every fish species known to science. The project was developed at the WorldFish Center in collaboration with the Food and Agriculture Organization of the United Nations and many other partners, and with support from the European Commission. FishBase is serving information on more than 30,000 fish species through EOL.

Index Fungorum
The Index Fungorum, the global fungal nomenclator coordinated and supported by the Index Fungorum Partnership (CABI, CBS, Landcare Research-NZ), contains names of fungi (including yeasts, lichens, chromistan fungal analogues, protozoan fungal analogues and fossil forms) at all ranks.

ITIS
The Integrated Taxonomic Information System (ITIS) is a partnership of federal agencies and other organizations from the United States, Canada, and Mexico, with data stewards and experts from around the world (see http://www.itis.gov). The ITIS database is an automated reference of scientific and common names of biota of interest to North America . It contains more than 600,000 scientific and common names in all kingdoms, and is accessible via the World Wide Web in English, French, Spanish, and Portuguese (http://itis.gbif.net). ITIS is part of the US National Biological Information Infrastructure (http://www.nbii.gov).

IUCN
International Union for Conservation of Nature (IUCN) helps the world find pragmatic solutions to our most pressing environment and development challenges. IUCN supports scientific research; manages field projects all over the world; and brings governments, non-government organizations, United Nations agencies, companies and local communities together to develop and implement policy, laws and best practice. EOL partnered with the IUCN to indicate status of each species according to the Red List of Threatened Species.

Metalmark Moths of the World
Metalmark moths (Lepidoptera: Choreutidae) are a poorly known, mostly tropical family of microlepidopterans. The Metalmark Moths of the World LifeDesk provides species pages and an updated classification for the group.

NCBI
As a U.S. national resource for molecular biology information, NCBI’s mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. The NCBI taxonomy database contains the names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence.

The Paleobiology Database
The Paleobiology Database is a public resource for the global scientific community. It has been organized and operated by a multi-disciplinary, multi-institutional, international group of paleobiological researchers. Its purpose is to provide global, collection-based occurrence and taxonomic data for marine and terrestrial animals and plants of any geological age, as well as web-based software for statistical analysis of the data. The project’s wider, long-term goal is to encourage collaborative efforts to answer large-scale paleobiological questions by developing a useful database infrastructure and bringing together large data sets.

The Reptile Database 
This database provides information on the classification of all living reptiles by listing all species and their pertinent higher taxa. The database therefore covers all living snakes, lizards, turtles, amphisbaenians, tuataras, and crocodiles. It is a source of taxonomic data, thus providing primarily (scientific) names, synonyms, distributions and related data. The database is currently supported by the Systematics working group of the German Herpetological Society (DGHT)

WoRMS
The aim of a World Register of Marine Species (WoRMS) is to provide an authoritative and comprehensive list of names of marine organisms, including information on synonymy. While highest priority goes to valid names, other names in use are included so that this register can serve as a guide to interpret taxonomic literature.

Those are “current” classifications, which don’t reflect historical classifications (used by our ancestors), nor future classifications.

The four states of matter becoming > 500 states of matter for example.

Instead of “universal acceptance,” how does “working agreement for a specific purpose” sound?

Fast rule-based bioactivity prediction using associative classification mining

Sunday, November 25th, 2012

Fast rule-based bioactivity prediction using associative classification mining by Pulan Yu and David J Wild. (Journal of Cheminformatics 2012, 4:29 )

Who moved my acronym? continues: ACM = Association for Computing Machinery or associative classification mining.

Abstract:

Relating chemical features to bioactivities is critical in molecular design and is used extensively in lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, the classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) method, and produce highly interpretable models.

An interesting lead on investigation of associations in large data sets. Pass on those meeting a threshold on for further evaluation?

Constructing a true LCSH tree of a science and engineering collection

Monday, November 19th, 2012

Constructing a true LCSH tree of a science and engineering collection by Charles-Antoine Julien, Pierre Tirilly, John E. Leide and Catherine Guastavino.

Abstract:

The Library of Congress Subject Headings (LCSH) is a subject structure used to index large library collections throughout the world. Browsing a collection through LCSH is difficult using current online tools in part because users cannot explore the structure using their existing experience navigating file hierarchies on their hard drives. This is due to inconsistencies in the LCSH structure, which does not adhere to the specific rules defining tree structures. This article proposes a method to adapt the LCSH structure to reflect a real-world collection from the domain of science and engineering. This structure is transformed into a valid tree structure using an automatic process. The analysis of the resulting LCSH tree shows a large and complex structure. The analysis of the distribution of information within the LCSH tree reveals a power law distribution where the vast majority of subjects contain few information items and a few subjects contain the vast majority of the collection.

After a detailed analysis of records from the McGill University Libraries (204,430 topical authority records) and 130,940 bibliographic records (Schulich Science and Engineering Library), the authors conclude in part:

This revealed that the structure was large, highly redundant due to multiple inheritances, very deep, and unbalanced. The complexity of the LCSH tree is a likely usability barrier for subject browsing and navigation of the information collection.

For me the most compelling part of this research was the focus on LCSH as used and not as it imagines itself. Very interesting reading. A slow walk through the bibliography will interest those researching LCSH or classification more generally.

Demonstration of the power law with the use of LCSH makes one wonder about other classification systems as used.

The 2012 ACM Computing Classification System toc

Friday, September 21st, 2012

The 2012 ACM Computing Classification System toc

From the post:

The 2012 ACM Computing Classification System has been developed as a poly-hierarchical ontology that can be utilized in semantic web applications. It replaces the traditional 1998 version of the ACM Computing Classification System (CCS), which has served as the de facto standard classification system for the computing field. It is being integrated into the search capabilities and visual topic displays of the ACM Digital Library. It relies on a semantic vocabulary as the single source of categories and concepts that reflect the state of the art of the computing discipline and is receptive to structural change as it evolves in the future. ACM will a provide tools to facilitate the application of 2012 CCS categories to forthcoming papers and a process to ensure that the CCS stays current and relevant. The new classification system will play a key role in the development of a people search interface in the ACM Digital Library to supplement its current traditional bibliographic search.

The full CCS classification tree is freely available for educational and research purposes in these downloadable formats: SKOS (xml), Word, and HTML. In the ACM Digital Library, the CCS is presented in a visual display format that facilitates navigation and feedback.

Will be looking at how the classification has changed since 1998. And since we have so much data online, should not be all that hard to see how well 1998 categories work for 1988, or 1977?

All for a classification that is “current and relevant.”

Still, don’t want papers dropping off the edge of the semantic world due to changes in classification.

Learning Mahout : Classification

Monday, September 10th, 2012

Learning Mahout : Classification by Sujit Pal.

From the post:

The final part covered in the MIA book is Classification. The popular algorithms available are Stochastic Gradient Descent (SGD), Naive Bayes and Complementary Naive Bayes, Random Forests and Online Passive Aggressive. There are other algorithms in the pipeline, as seen from the Classification section of the Mahout wiki page.

The MIA book has generic classification information and advice that will be useful for any algorithm, but it specifically covers SGD, Bayes and Naive Bayes (the last two via Mahout scripts). Of these SGD and Random Forest are good for classification problems involving continuous variables and small to medium datasets, and the Naive Bayes family is good for problems involving text like variables and medium to large datasets.

In general, a solution to a classification problem involves choosing the appropriate features for classification, choosing the algorithm, generating the feature vectors (vectorization), training the model and evaluating the results in a loop. You continue to tweak stuff in each of these steps until you get the results with the desired accuracy.

Sujit notes that classification is under rapid development. The classification material is likely to become dated.

Some additional resources to consider:

Mahout User List (subscribe)

Mahout Developer List (subscribe)

IRC: Mahout’s IRC channel is #mahout.

Mahout QuickStart

MyMiner: a web application for computer-assisted biocuration and text annotation

Thursday, August 30th, 2012

MyMiner: a web application for computer-assisted biocuration and text annotation by David Salgado, Martin Krallinger, Marc Depaule, Elodie Drula, Ashish V. Tendulkar, Florian Leitner, Alfonso Valencia and Christophe Marcelle. ( Bioinformatics (2012) 28 (17): 2285-2287. doi: 10.1093/bioinformatics/bts435 )

Abstract:

Motivation: The exponential growth of scientific literature has resulted in a massive amount of unstructured natural language data that cannot be directly handled by means of bioinformatics tools. Such tools generally require structured data, often generated through a cumbersome process of manual literature curation. Herein, we present MyMiner, a free and user-friendly text annotation tool aimed to assist in carrying out the main biocuration tasks and to provide labelled data for the development of text mining systems. MyMiner allows easy classification and labelling of textual data according to user-specified classes as well as predefined biological entities. The usefulness and efficiency of this application have been tested for a range of real-life annotation scenarios of various research topics.

Availability: http://myminer.armi.monash.edu.au.

Contacts: david.salgado@monash.edu and christophe.marcelle@monash.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.

A useful tool and good tutorial materials.

I could easily see something similar for CS research (unless such already exists).

K-Nearest-Neighbors and Handwritten Digit Classification

Monday, August 27th, 2012

K-Nearest-Neighbors and Handwritten Digit Classification by Jeremy Kun.

From the post:

The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

Modulo my concerns about assigning non-metric data to metric spaces, this is a very good post on classification.

UCR Time Series Classification/Clustering Page

Friday, July 6th, 2012

UCR Time Series Classification/Clustering Page

I encountered this while hunting down references on the insect identification contest.

How does your thinking about topic maps or other semantic solutions fare against:

Machine learning research has, to a great extent, ignored an important aspect of many real world applications: time. Existing concept learners predominantly operate on a static set of attributes; for example, classifying flowers described by leaf size, petal colour and petal count. The values of these attributes is assumed to be unchanging — the flower never grows or loses leaves.

However, many real datasets are not “static”; they cannot sensibly be represented as a fi xed set of attributes. Rather, the examples are expressed as features that vary temporally, and it is the temporal variation itself that is used for classifi cation. Consider a simple gesture recognition domain, in which the temporal features are the position of the hands, finger bends, and so on. Looking at the position of the hand at one point in time is not likely to lead to a successful classifi cation; it is only by analysing changes in position that recognition is possible.

(Temporal Classi cation: Extending the Classi cation Paradigm to Multivariate Time Series by Mohammed Waleed Kadous (2002))

A decade old now but still a nice summary of the issue.

Can we substitute “identification” for “machine learning research?”

Are you relying “…on a static set of attributes” for identity purposes?

UCR Insect Classification Contest [Classification by Ear]

Friday, July 6th, 2012

UCR Insect Classification Contest Ends November 16, 2012

As I have said before, subject identity is everywhere! ;-)

From the details PDF file:

Phase I: July to November 16th 2012 (this contest)

  • The task is to produce the best distance (similarity) measure for insect flight sounds.
  • The contest will be scored by 1-nearest neighbor classification.
  • The prizes include $500 cash and engraved trophies.

I was amused to read in the FAQ:

Note that the “sound” is measured with an optical sensor, rather than an acoustic one. This is done for various pragmatic reasons, however we don’t believe it makes any difference to the task at hand. The sampling rate is 16000 Hz

If you have a bee keeper nearby, can you do an empirical comparison of optical versus acoustic sensors for the capturing the “sound” of insects?

That seems like a first step in establishing computational entomology. BTW, use a range of frequencies, from sub to super sonic. (You are aware they have discovered sub-sonic sounds from whales can travel thousands of miles? Unlikely with insects but just because our ears can’t hear something doesn’t mean other ears cannot as well.)

I first saw this at KDNuggets.

The strange case of eugenics:…

Monday, July 2nd, 2012

The strange case of eugenics: A subject’s ontogeny in a long-lived classification scheme and the question of collocative integrity by Joseph T. Tennis. (Tennis, J. T. (2012), The strange case of eugenics: A subject’s ontogeny in a long-lived classification scheme and the question of collocative integrity. J. Am. Soc. Inf. Sci., 63: 1350–1359. doi: 10.1002/asi.22686)

Abstract:

This article introduces the problem of collocative integrity present in long-lived classification schemes that undergo several changes. A case study of the subject “eugenics” in the Dewey Decimal Classification is presented to illustrate this phenomenon. Eugenics is strange because of the kinds of changes it undergoes. The article closes with a discussion of subject ontogeny as the name for this phenomenon and describes implications for information searching and browsing.

Tennis writes:

While many theorists have concerned themselves with how to design a scheme that can handle the addition of subjects, very little has been done to study how a subject changes after it is introduced to a scheme. Simply because we add civil engineering to a scheme of classification in 1920 does not signify that it means the same thing today. Almost 100 years have passed, and many things have changed in that subject. We may have subdivided this class in 1950, thereby separating the pre-1950 meaning from the post-1950 meaning and also affecting the collocative power of the class civil engineering. Other classes in the superclass of engineering might be considered too close, and are eliminated over time, affecting the way the classifier does her or his work (cf. Tennis, 2007; Tennis & Sutton, 2008). It is because of these concerns, coupled with the design requirement of collocation in classification, that we need to look at the life of a subject over time—the subject’s scheme history or ontogeny.

Deeply interesting work that has implications for topic map structures and the preservation of “collocative integrity” over time.

One suspects that preservation of “collocative integrity” is an ongoing process that requires more than simple assignments in a scheme.

What factors would you capture to trace the ontogeny of “euqenics” and how would you use them to preserve “collocative integrity” across that history using a topic map? (Remembering that users at any point in that ontogeny may be ignorant of prior (obviously subsequent) changes in its classification.)

DDC 23 released as linked data at dewey.info

Friday, June 29th, 2012

DDC 23 released as linked data at dewey.info

From the post:

As announced on Monday at the seminar “Global Interoperability and Linked Data in Libraries” in beautiful Florence, an exciting new set of linked data has been added to dewey.info. All assignable classes from DDC 23, the current full edition of the Dewey Decimal Classification, have been released as Dewey linked data. As was the case for the Abridged Edition 14 data, we define “assignable” as including every schedule number that is not a span or a centered entry, bracketed or optional, with the hierarchical relationships adjusted accordingly. In short, these are numbers that you find attached to many WorldCat records as standard Dewey numbers (in 082 fields), as additional Dewey numbers (in 083 fields), or as number components (in 085 fields).

The classes are exposed with full number and caption information and semantic relationships expressed in SKOS, which makes the information easily accessible and parsable by a wide variety of semantic web applications.

This recent addition massively expands the data set by over 38.000 Dewey classes (or, for the linked data geeks out there, by over 1 million triples), increasing the number of classes available almost tenfold. If you like, take some time to explore the hierarchies; you might be surprised to find numbers for Maya calendar or transits of Venus (loyal blog readers will recognize these numbers).

All the old goodies are still there, of course. Depending on which type of user agent is accessing the data (e.g., a browser) a different representation is negotiated (HTML or various flavors of RDF). The HTML pages still include RDFa markup, which can be distilled into RDF by browser plug-ins and other applications without the user ever having to deal with the RDF data directly.

More details follow but that should be enough to capture your interest.

Good thing there is a pointer for the Maya calendar. Would hate for interstellar archaeologists to think we were too slow to invent a classification number for the disaster that is supposed to befall us this coming December.

I have renewed my ACM and various SIG memberships to run beyond December 2012. In the event of an actual disaster refunds will not be an issue. ;-)

Visual and semantic interpretability of projections of high dimensional data for classification tasks

Thursday, May 24th, 2012

Visual and semantic interpretability of projections of high dimensional data for classification tasks by Ilknur Icke and Andrew Rosenberg.

A number of visual quality measures have been introduced in visual analytics literature in order to automatically select the best views of high dimensional data from a large number of candidate data projections. These methods generally concentrate on the interpretability of the visualization and pay little attention to the interpretability of the projection axes. In this paper, we argue that interpretability of the visualizations and the feature transformation functions are both crucial for visual exploration of high dimensional labeled data. We present a two-part user study to examine these two related but orthogonal aspects of interpretability. We first study how humans judge the quality of 2D scatterplots of various datasets with varying number of classes and provide comparisons with ten automated measures, including a number of visual quality measures and related measures from various machine learning fields. We then investigate how the user perception on interpretability of mathematical expressions relate to various automated measures of complexity that can be used to characterize data projection functions. We conclude with a discussion of how automated measures of visual and semantic interpretability of data projections can be used together for exploratory analysis in classification tasks.

Rather small group of test subjects (20) so I don’t think you can say much other than more work is needed.

Then it occurred to me that I often speak of studies applying to “users” without stopping to remember that for many tasks, I fall into that self-same category. Subject to the same influences, fatigues and even mistakes.

Anyone know of research by researchers being applied to the same researchers?

New Mechanical Turk Categorization App

Saturday, May 19th, 2012

New Mechanical Turk Categorization App

Categorization is one of the more popular use cases for the Amazon Mechanical Turk. A categorization HIT (Human Intelligence Task) asks the Worker to select from a list of options. Our customers use HITs of this type to assign product categories, match URLs to business listings, and to discriminate between line art and photographs.

Using our new Categorization App, you can start categorizing your own items or data in minutes, eliminating the learning curve that has traditionally accompanied this type of activity. The app includes everything that you need to be successful including:

  1. Predefined HITs (no HTML editing required).
  2. Pre-qualified Master Workers (see Jinesh’s previous blog post on Mechanical Turk Masters).
  3. Price recommendations based on complexity and comparable HITs.
  4. Analysis tools.

The Categorization App guides you through the four simple steps that are needed to create your categorization project.

I thought the contrast between gamers (the GPU post) and MTurkers would be a nice to close the day. ;-)

Although, there are efforts to create games where useful activity happens, whether intended or not. (Would that take some of the joy out of a game?)

If you use this particular app, please blog or post a note about your experieince.

Thanks!

Are visual dictionaries generalizable?

Sunday, May 13th, 2012

Are visual dictionaries generalizable? by Otavio A. B. Penatti, Eduardo Valle, and Ricardo da S. Torres

Abstract:

Mid-level features based on visual dictionaries are today a cornerstone of systems for classification and retrieval of images. Those state-of-the-art representations depend crucially on the choice of a codebook (visual dictionary), which is usually derived from the dataset. In general-purpose, dynamic image collections (e.g., the Web), one cannot have the entire collection in order to extract a representative dictionary. However, based on the hypothesis that the dictionary reflects only the diversity of low-level appearances and does not capture semantics, we argue that a dictionary based on a small subset of the data, or even on an entirely different dataset, is able to produce a good representation, provided that the chosen images span a diverse enough portion of the low-level feature space. Our experiments confirm that hypothesis, opening the opportunity to greatly alleviate the burden in generating the codebook, and confirming the feasibility of employing visual dictionaries in large-scale dynamic environments.

The authors use the Caltech-101 image set because of its “diversity.” Odd because they cite the Caltech-256 image set, which was created to answer concerns about the lack of diversity in the Caltech-101 image set.

Not sure this paper answers the issues it raises about visual dictionaries.

Wanted to bring it to your attention because representative dictionaries (as opposed to comprehensive ones) may be lurking just beyond the semantic horizon.

How do you compare two text classfiers?

Friday, May 4th, 2012

How do you compare two text classfiers?

Tony Russell-Rose writes:

I need to compare two text classifiers – one human, one machine. They are assigning multiple tags from an ontology. We have an initial corpus of ~700 records tagged by both classifiers. The goal is to measure the ‘value added’ by the human. However, we don’t yet have any ground truth data (i.e. agreed annotations).

Any ideas on how best to approach this problem in a commercial environment (i.e. quickly, simply, with minimum fuss), or indeed what’s possible?

I thought of measuring the absolute delta between the two profiles (regardless of polarity) to give a ceiling on the value added, and/or comparing the profile of tags added by each human coder against the centroid to give a crude measure of inter-coder agreement (and hence difficulty of the task). But neither really measures the ‘value added’ that I’m looking for, so I’m sure there must better solutions.

Suggestions, anyone? Or is this as far as we can go without ground truth data?

Some useful comments have been made. Do you have others?

PS: I wrote at Tony’s blog in a comment:

Tony,

The ‘value added’ by human taggers concept is unclear. The tagging in both cases is the result of human adding of semantics. Once through the rules for the machine tagger and once via the “human” taggers.

Can you say a bit more about what you see as a separate ‘value added’ by the human taggers?

What do you think? Is Tony’s question clear enough?

Learning Fuzzy β-Certain and β-Possible rules…

Wednesday, April 18th, 2012

Learning Fuzzy β-Certain and β-Possible rules from incomplete quantitative data by rough sets by Ali Soltan Mohammadi, L. Asadzadeh, and D. D. Rezaee.

Abstract:

The rough-set theory proposed by Pawlak, has been widely used in dealing with data classification problems. The original rough-set model is, however, quite sensitive to noisy data. Tzung thus proposed deals with the problem of producing a set of fuzzy certain and fuzzy possible rules from quantitative data with a predefined tolerance degree of uncertainty and misclassification. This model allowed, which combines the variable precision rough-set model and the fuzzy set theory, is thus proposed to solve this problem. This paper thus deals with the problem of producing a set of fuzzy certain and fuzzy possible rules from incomplete quantitative data with a predefined tolerance degree of uncertainty and misclassification. A new method, incomplete quantitative data for rough-set model and the fuzzy set theory, is thus proposed to solve this problem. It first transforms each quantitative value into a fuzzy set of linguistic terms using membership functions and then finding incomplete quantitative data with lower and the fuzzy upper approximations. It second calculates the fuzzy {\beta}-lower and the fuzzy {\beta}-upper approximations. The certain and possible rules are then generated based on these fuzzy approximations. These rules can then be used to classify unknown objects.

In part interesting because of its full use of sample data to illustrate the process being advocated.

Unless smooth sets in data are encountered by some mis-chance, rough sets will remain a mainstay of data mining for the foreseeable future.

Third Challenge on Large Scale Hierarchical Text Classification

Monday, April 16th, 2012

ECML/PKDD 2012 Discovery Challenge: Third Challenge on Large Scale Hierarchical Text Classification

Important dates:

- March 30, start of the challenge
- April 20, opening of the evaluation
- June 29, closing of evaluation
- July 20, paper submission deadline
- August 3, paper notifications

From the website:

This year’s discovery challenge hosts the third edition of the successful PASCAL challenges on large scale hierarchical text classification. The challenge comprises three tracks and it is based on two large datasets created from the ODP web directory (DMOZ) and Wikipedia. The datasets are multi-class, multi-label and hierarchical. The number of categories ranges between 13,000 and 325,000 roughly and the number of documents between 380,000 and 2,400,000.

The tracks of the challenge are organized as follows:

1. Standard large-scale hierarchical classification
a) On collection of medium size from Wikipedia
b) On a large collection from Wikipedia

2. Multi-task learning, based on both DMOZ and Wikipedia category systems

3. Refinement-learning
a) Semi-Supervised approach
b) Unsupervised approach

In order to register for the challenge and gain access to the datasets you must have an account at the challenge Web site.

More fun than repeating someone’s vocabulary. Yes?

Classifier Technology and the Illusion of Progress

Friday, April 13th, 2012

Classifier Technology and the Illusion of Progress by David J. Hand.

Was pointed to in Simply Statistics for 8 April 2012:

Abstract:

A great many tools have been developed for supervised classification, ranging from early methods such as linear discriminant analysis through to modern developments such as neural networks and support vector machines. A large number of comparative studies have been conducted in attempts to establish the relative superiority of these methods. This paper argues that these comparisons often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illusion. In particular, simple methods typically yield performance almost as good as more sophisticated methods, to the extent that the difference in performance may be swamped by other sources of uncertainty that generally are not considered in the classical supervised classification paradigm.

The original pointer didn’t mention there were four published comments and a formal rejoinder:

Comment: Classifier Technology and the Illusion of Progress by Jerome H. Friedman.

Comment: Classifier Technology and the Illusion of Progress–Credit Scoring by Ross W. Gayler.

Elaboration on Two Points Raised in “Classifier Technology and the Illusion of Progress” by Robert C. Holte.

Comment: Classifier Technology and the Illusion of Progress by Robert A. Stine.

Rejoinder: Classifier Technology and the Illusion of Progress by David J. Hand.

Enjoyable reading, one and all!

Adobe Releases Malware Classifier Tool

Wednesday, April 4th, 2012

Adobe Releases Malware Classifier Tool by Dennis Fisher.

From the post:

Adobe has published a free tool that can help administrators and security researchers classify suspicious files as malicious or benign, using specific machine-learning algorithms. The tool is a command-line utility that Adobe officials hope will make binary classification a little easier.

Adobe researcher Karthik Raman developed the new Malware Classifier tool to help with the company’s internal needs and then decided that it might be useful for external users, as well.

” To make life easier, I wrote a Python tool for quick malware triage for our team. I’ve since decided to make this tool, called “Adobe Malware Classifier,” available to other first responders (malware analysts, IT admins and security researchers of any stripe) as an open-source tool, since you might find it equally helpful,” Raman wrote in a blog post.

“Malware Classifier uses machine learning algorithms to classify Win32 binaries – EXEs and DLLs – into three classes: 0 for “clean,” 1 for “malicious,” or “UNKNOWN.” The tool extracts seven key features from a binary, feeds them to one or all of the four classifiers, and presents its classification results.”

Adobe Malware Classifier (Sourceforge)

Old hat that malware scanners have been using machine learning but new that you can now see it from the inside.

Lessons to be learned about machine learning algorithms for malware and other uses with software.

Kudos to Adobe!

UCR Time Series Classification/Clustering Page

Monday, April 2nd, 2012

UCR Time Series Classification/Clustering Page

From the webpage:

This webpage has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering.

While chasing the details on Eamonn Keogh and his time series presentation, I encountered this collection of data sets.

From counting citations to measuring usage (help needed!)

Tuesday, March 20th, 2012

From counting citations to measuring usage (help needed!)

Daniel Lemire writes:

We sometimes measure the caliber of a researcher by how many research papers he wrote. This is silly. While there is some correlation between quantity and quality — people like Einstein tend to publish a lot — it can be gamed easily. Moreover, several major researchers have published relatively few papers: John Nash has about two dozens papers in Scopus. Even if you don’t know much about science, I am sure you can think of a few writers who have written only a couple of books but are still world famous.

A better measure is the number of citations a researcher has received. Google Scholar profiles display the citation record of researchers prominently. It is a slightly more robust measure, but it is still silly because 90% of citations are shallow: most authors haven’t even read the paper they are citing. We tend to cite famous authors and famous venues in the hope that some of the prestige will get reflected.

But why stop there? We have the technology to measure the usage made of a cited paper. Some citations are more significant: for example it can be an extension of the cited paper. Machine learning techniques can measure the impact of your papers based on how much following papers build on your results. Why isn’t it done?

Daniel wants to distinguish important papers that cite his papers from ho-hum papers that cite him. (my characterization, not his)

That isn’t happening now so Daniel has teamed up with Peter Turney and Andre Vellino to gather data from published authors (that would be you), to use in investigating this problem.

Topic maps of scholarly and other work face the same problem. How do you distinguish the important from the less so? For that matter, what criteria do you use? If an author who cites you wins the Nobel Prize for work that doesn’t cite you, does the importance of your paper go up? Stay the same? Goes down? ;-)

It is an important issue so if you are a published author, see Daniel’s post and contribute to the data gathering.

Will the Circle Be Unbroken? Interactive Annotation!

Wednesday, February 29th, 2012

I have to agree with Bob Carpenter, the title is a bit much:

Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

From the post:

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

Both the paper and Bob’s review merit close reading.

Making sense of Wikipedia categories

Tuesday, February 21st, 2012

Making sense of Wikipedia categories

Hal Daume III writes:

Wikipedia’s category hierarchy forms a graph. It’s definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that “Chicago Stags coaches” are a subcategory of “Natural sciences”? If you don’t believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

(subcategories omitted)

I guess it kind of makes sense. There are some other fun ones, like “Rhaeto-Romance languages”, “American World War I flying aces” and “1911 films”. Of course, these are all quite deep in the “hierarchy” (all of those are at depth 15 or higher).

Hal examines several strategies and concludes asking:

Has anyone else tried and succeed at using the Wikipedia category structure?

Some other questions:

Is Hal right that hand annotation doesn’t “scale?”

I have heard that more times than I can count but never seen any studies cited to support it.

After all, Wikipedia was manually edited and produced. Yes? No automated process created its content. So, what is the barrier to hand annotation?

If you think about it, the same could be said about email but most email (yes?) is written by hand. Not produced by automated processes (well, except for spam), so why can’t it be hand annotated? Or at least why can’t we capture semantics of email at the point of composition and annotate it there by automated means?

Hand annotation may not scale for sensor data or financial data streams but is hand annotation needed for such sources?

Hand annotation may not scale for say twitter posts by non-English speakers. But only for agencies with very short-sighted if not actively bigoted hiring/contracting practices.

Has anyone loaded the Wikipedia categories into a graph database? What sort of interface would you suggest for trial arrangement of the categories?

PS: If you are interested in discussing how-to establish assisted annotation for twitter, email or other data streams, with or without user awareness, send me a post.

Dragsters, Drag Cars & Drag Racing Cars

Friday, February 10th, 2012

I still remember the cover of Hot Rod magazine that announced (from memory) “The 6′s are here!” Don “The Snake” Prudhomme had broken the 200 mph barrier in a drag race. Other memories follow on from that one but I mention it to explain my interest in a recent Subject Authority Cooperative Program decision to not have a cross-reference from dragster (the term I would have used) to more recent terms, drag cars or drag racing cars.

The expected search (in this order) due to this decision is:

Cars (Automobiles) -> redirect to Automobiles -> Automobiles -> narrower term -> Automobiles, racing -> narrower term -> Dragsters

Adam L. Schiff, proposer of drag cars & drag racing cars says below “This just is not likely to happen.”

Question: Is there a relationship between users “work[ing] their way up and down hierarchies” and display of relationships methods? Who chooses which items will be the starting point to lead to other items? How do you integrate a keyword search into such a system?

Question: And what of the full phrase/sentence AI systems where keywords work less well? How does that work with relationship display systems?

Question: I wonder if the relationship display methods are closer to the up and down hierarchies, but with less guidance?

Adam’s Dragster proposal post in full:

Dragsters

Automobiles has a UF Cars (Automobiles). Since the UF already exists on the basic heading, it is not necessary to add it to Dragsters. The proposal was not approved.

Our proposal was to add two additional cross-references to Dragsters: Drag cars, and Drag racing cars. While I understand, in principle, the reasoning behind the rejection of these additional references, I do not see how it serves users. A user coming to a catalog to search for the subject “Drag cars” will now get nothing, no redirection to the established heading. I don’t see how the presence of a reference from Cars (Automobiles) to Automobiles helps any user who starts a search with “Drag cars”. Only if they begin their search with Cars would they get led to Automobiles, and then only if they pursue narrower terms under that heading would they find Automobiles, Racing, which they would then have to follow further down to Dragsters. This just is not likely to happen. Instead they will probably start with a keyword search on “Drag cars” and find nothing, or if lucky, find one or two resources and think they have it all. And if they are astute enough to look at the subject headings on one of the records and see “Dragsters”, perhaps they will then redo their search.

Since the proposed cross-refs do not begin with the word Cars, I do not at all see how a decision like this is in the service of users of our catalogs. I think that LCSH rules for references were developed when it was expected that users would consult the big red books and work their way up and down hierarchies. While some online systems do provide for such navigation, it is doubtful that many users take this approach. Keyword searching is predominant in our catalogs and on the Web. Providing as many cross-refs to established headings as we can would be desirable. If the worry is that the printed red books will grow to too many volumes if we add more variant forms that weren’t made in the card environment, then perhaps there needs to be a way to include some references in authority records but mark them as not suitable for printing in printed products.

PS: According to ODLIS: Online Dictionary for Library and Information Science by Joan M. Reitz, UF, has the following definition:

used for (UF)

A phrase indicating a term (or terms) synonymous with an authorized subject heading or descriptor, not used in cataloging or indexing to avoid scatter. In a subject headings list or thesaurus of controlled vocabulary, synonyms are given immediately following the official heading. In the alphabetical list of indexing terms, they are included as lead-in vocabulary followed by a see or USE cross-reference directing the user to the correct heading. See also: syndetic structure.

I did not attempt to reproduce the extremely rich cross-linking in this entry but commend the entire resource to your attention, particularly if you are a library science student.

Mining Text Data

Monday, January 23rd, 2012

Mining Text Data Charu Aggarwal and ChengXiang Zhai, Springer, February 2012, Approximately 500 pages.

From the publisher’s description:

Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned.

Mining Text Data introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including the key research content on the topic, and the future directions of research in the field. There is a special focus on Text Embedded with Heterogeneous and Multimedia Data which makes the mining process much more challenging. A number of methods have been designed such as transfer learning and cross-lingual mining for such cases.

Mining Text Data simplifies the content, so that advanced-level students, practitioners and researchers in computer science can benefit from this book. Academic and corporate libraries, as well as ACM, IEEE, and Management Science focused on information security, electronic commerce, databases, data mining, machine learning, and statistics are the primary buyers for this reference book.

Not at the publisher’s site but you can see the Table of Contents and chapter 4, A SURVEY OF TEXT CLUSTERING ALGORITHMS and chapter 6, A SURVEY OF TEXT CLASSIFICATION ALGORITHMS at: www.charuaggarwal.net/text-content.pdf.

The two chapters you can download from Aggarwal’s website will give you a good idea of what to expect from the text.

While an excellent survey work, with chapters written by experts in various sub-fields, it also suffers from the survey work format.

For example, for the two sample chapters, there are overlaps in the bibliographies for both chapters. Not surprising given the closely related subject matter but as a reader I would be interested in discovering that some works are cited in both chapters. Something that is possible given the back of the chapter bibliography format, only by repetitive manual inspection.

Although I rail against examples in standards, expanding the survey reference work format to include more details and examples would only increase its usefulness and possible its life as a valued reference.

Which raises the question of having a print format for survey works at all. The research landscape is changing quickly and a shelf life of 2 to 3 years, if that long, seems a bit brief for the going rate for print editions. Printed versions of chapters as smaller and more timely works on demand, that is a value-add proposition that Springer is in a unique position to bring to its customers.

Extract meta concepts through co-occurrences analysis and graph theory

Saturday, January 14th, 2012

Extract meta concepts through co-occurrences analysis and graph theory

Cristian Mesiano writes:

During The Christmas period I had finally the chance to read some papers about probabilistic latent semantic and its applications in auto classification and indexing.

The main concept behind “latent semantic” lays on the assumption that words that occurs close in the text are related to the same semantic construct.

Based on this principle the LSA (and partially also the PLSA ) builds a matrix to keep track of the co-occurrences of the words in text, and it assign a score to these co-occurrences considering the distribution in the corpus as well.

Often TF-IDF score is used to rank the words.

Anyway, I was wondering if this techniques could be useful also to extract key concepts from the text.

Basically I thought: “in LSA we consider some statistics over the co-occurrences, so: why not consider the link among the co-occurrences as well?”.

Using the first three chapters of “The Media in the Network Society, author: Gustavo Cardoso,” Christian creates a series of graphs.

Christian promises his opinion on classification of texts using this approach.

In the meantime, what’s yours?