Archive for the ‘Concept Drift’ Category

Incremental Classification, concept drift and Novelty detection (IClaNov)

Wednesday, October 8th, 2014

Incremental Classification, concept drift and Novelty detection (IClaNov)

From the post:

The development of dynamic information analysis methods, like incremental clustering, concept drift management and novelty detection techniques, is becoming a central concern in a bunch of applications whose main goal is to deal with information which is varying over time. These applications relate themselves to very various and highly strategic domains, including web mining, social network analysis, adaptive information retrieval, anomaly or intrusion detection, process control and management recommender systems, technological and scientific survey, and even genomic information analysis, in bioinformatics. The term “incremental” is often associated to the terms dynamics, adaptive, interactive, on-line, or batch. The majority of the learning methods were initially defined in a non-incremental way. However, in each of these families, were initiated incremental methods making it possible to take into account the temporal component of a data stream. In a more general way incremental clustering algorithms and novelty detection approaches are subjected to the following constraints:

  • Possibility to be applied without knowing as a preliminary all the data to be analyzed;
  • Taking into account of a new data must be carried out without making intensive use of the already considered data;
  • Result must but available after insertion of all new data;
  • Potential changes in the data description space must be taken into consideration.

This workshop aims to offer a meeting opportunity for academics and industry-related researchers, belonging to the various communities of Computational Intelligence, Machine Learning, Experimental Design and Data Mining to discuss new areas of incremental clustering, concept drift management and novelty detection and on their application to analysis of time varying information of various natures. Another important aim of the workshop is to bridge the gap between data acquisition or experimentation and model building.

ICDM 2014 Conference: December 14, 2014

The agenda for this workshop has been posted.

Does your ontology support incremental classification, concept drift and novelty detection? All of those exist in the ongoing data stream of experience if not within some more limited data stream from a source.

You can work from a dated snapshot of the world as it was, but over time will that best serve your needs?

Remember that for less than $250,000 (est.) the attacks on 9/11 provoked the United States into spending $trillions based on a Cold War snapshot of the world. Probably the highest return on investment for an attack in history.

The world is constantly changing and your data view of it should be changing as well.

TCP Text Creation Partnership

Monday, September 19th, 2011

TCP Text Creation Partnership

From the “mission” page:

The Text Creation Partnership’s primary objective is to produce standardized, digitally-encoded editions of early print books. This process involves a labor-intensive combination of manual keyboard entry (from digital images of the books’ original pages), the addition of digital markup (conforming to guidelines set by a text encoding standard-setting body know as the TEI), and editorial review.

The chief sources of the TCP’s digital images are database products marketed by commercial publishers. These include Proquest’s Early English Books Online (EEBO), Gale’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints. Idiosyncrasies in early modern typography make these collections very difficult to convert into searchable, machine-readable text using common scanning techniques (i.e., Optical Character Recognition). Through the TCP, commercial publishers and over 150 different libraries have come together to fund the conversion of these cultural heritage materials into enduring, digitally dynamic editions.

To date, the EEBO-TCP project has converted over 25,000 books. ECCO- and EVANS-TCP have converted another 7,000+ books. A second phase of EEBO-TCP production aims to convert the remaining 44,000 unique monograph titles in the EEBO corpus by 2015, and all of the TCP texts are scheduled to enter the public domain by 2020.

Several thousand titles from the 18th century collection are already available to the general public.

I mention this as a source of texts for testing search software against semantic drift. The sort of drift that occurs in any living language. To say nothing of the changing mores of our interpretation of languages with no native speakers remaining to defend them.

Probabilistic User Modeling in the Presence of Drifting Concepts

Saturday, December 4th, 2010

Probabilistic User Modeling in the Presence of Drifting Concepts Authors(s): Vikas Bhardwaj, Ramaswamy Devarajan

Abstract:

We investigate supervised prediction tasks which involve multiple agents over time, in the presence of drifting concepts. The motivation behind choosing the topic is that such tasks arise in many domains which require predicting human actions. An example of such a task is recommender systems, where it is required to predict the future ratings, given features describing items and context along with the previous ratings assigned by the users. In such a system, the relationships among the features and the class values can vary over time. A common challenge to learners in such a setting is that this variation can occur both across time for a given agent, and also across different agents, (i.e. each agent behaves differently). Furthermore, the factors causing this variation are often hidden. We explore probabilistic models suitable for this setting, along with efficient algorithms to learn the model structure. Our experiments use the Netflix Prize dataset, a real world dataset which shows the presence of time variant concepts. The results show that the approaches we describe are more accurate than alternative approaches, especially when there is a large variation among agents. All the data and source code would be made open-source under the GNU GPL.

Interesting because not only do concepts drift from user to user but modeling users as existing in neighborhoods of other users was more accurate than purely homogeneous or heterogeneous models.

Questions:

  1. If there is a “neighborhood” effect on users, what, if anything does that imply for co-occurrence of terms? (3-5 pages, no citations)
  2. How would you determine “neighborhood” boundaries for terms? (3-5 pages, citations)
  3. Do “neighborhoods” for terms vary by semantic domains? (3-5 pages, citations)

*****
Be aware that the Netflix dataset is no longer available. Possibly in response to privacy concerns. A demonstration of the utility of such concerns and their advocates.

The AQ Methods for Concept Drift

Saturday, November 6th, 2010

The AQ Methods for Concept Drift Authors: Marcus A. Maloof Keywords:online learning, concept drift, aq algorithm, ensemble methods

Abstract:

Since the mid-1990’s, we have developed, implemented, and evaluated a number of learning methods that cope with concept drift. Drift occurs when the target concept that a learner must acquire changes over time. It is present in applications involving user preferences (e.g., calendar scheduling) and adversaries (e.g., spam detection). We based early efforts on Michalski’s aq algorithm, and our more recent work has investigated ensemble methods. We have also implemented several methods that other researchers have proposed. In this chapter, we survey results that we have obtained since the mid-1990’s using the Stagger concepts and learning methods for concept drift. We examine our methods based on the aq algorithm, our ensemble methods, and the methods of other researchers. Dynamic weighted majority with an incremental algorithm for producing decision trees as the base learner achieved the best overall performance on this problem with an area under the performance curve after the first drift point of .882. Systems based on the aq11 algorithm, which incrementally induces rules, performed comparably, achieving areas of .875. Indeed, an aq11 system with partial instance memory and Widmer and Kubat’s window adjustment heuristic achieved the best performance with an overall area under the performance curve, with an area of .898.

The author offers this definition of concept drift:

Concept drift [19, 30] is a phenomenon in which examples have legitimate labels at one time and have different legitimate labels at another time. Geometrically, if we view a target concept as a cloud of points in a feature space, concept drift may entail the cloud changing its position, shape, and size. From the perspective of Bayesian decision theory, these transformations equate to changes to the form or parameters of the prior and class-conditional distributions.

Hmmm, “legitimate labels,” sounds like a job for topic maps doesn’t it?

Questions:

  1. Has concept drift been used in library classification? (research question)
  2. How would you use concept drift concepts in library classification? (3-5 pages, no citations)
  3. Demonstrate use of concept drift techniques to augment topic map authoring. (project)

On Classifying Drifting Concepts in P2P Networks

Saturday, November 6th, 2010

On Classifying Drifting Concepts in P2P Networks Authors: Hock Hee Ang, Vivekanand Gopalkrishnan, Wee Keong Ng and Steven Hoi Keywords: Concept drift, classification, peer-to-peer (P2P) networks, distributed classification

Abstract:

Concept drift is a common challenge for many real-world data mining and knowledge discovery applications. Most of the existing studies for concept drift are based on centralized settings, and are often hard to adapt in a distributed computing environment. In this paper, we investigate a new research problem, P2P concept drift detection, which aims to effectively classify drifting concepts in P2P networks. We propose a novel P2P learning framework for concept drift classification, which includes both reactive and proactive approaches to classify the drifting concepts in a distributed manner. Our empirical study shows that the proposed technique is able to effectively detect the drifting concepts and improve the classification performance.

The authors define the problem as:

Concept drift refers to the learning problem where the target concept to be predicted, changes over time in some unforeseen behaviors. It is commonly found in many dynamic environments, such as data streams, P2P systems, etc. Real-world examples include network intrusion detection, spam detection, fraud detection, epidemiological, and climate or demographic data, etc.

The authors may well have been the first to formulate this problem among mechanical peers but any humanist could have pointed out examples concept drift between people. Both in the literature as well as real life.

Questions:

  1. What are the implications of concept drift for Linked Data? (3-5 pages, no citations)
  2. What are the implications of concept drift for static ontologies? (3-5 pages, no citations)
  3. Is concept development (over time) another form of concept drift? (3-5 pages, citations, illustrations, presentation)

*****
PS: Finding this paper is an illustration of ambiguity leading to serendipitous discovery. I searched for one of the author’s instead of the exact title of another paper. While scanning the search results I found this paper.