Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 11, 2010

Google Refine 2.0 – Announcement

Filed under: Uncategorized — Patrick Durusau @ 2:02 pm

Google Refine 2.0 has been released.

From the website:

Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions.

Freebase Gridworks 1.0 has already been well received by the data journalism and open government data communities (you can read how the Chicago Tribune, ProPublica and data.gov.uk have used it) and we are very excited by what they and others will be able to do with this new release. To learn more about what you can do with Google Refine 2.0, watch…[screencasts]

If you don’t watch any videos this month, you have to watch http://www.youtube.com/watch?v=m5ER2qRH1OQ!

Google uses the term reconciliation but what is being demonstrated is mapping information to a subject representative.

Note that unlike topic maps, the basis (read properties) for that mapping is not disclosed, so it isn’t possible for a program or person to be sure to repeat the same mapping.

SAMT 2010 – Conference

Filed under: Conferences,Multimedia,Semantics — Patrick Durusau @ 5:39 am

SAMT 2010 – Semantic and Digital Media Technologies

Saarbrücken, Germany, 1-3 December 2010

From the announcement:

Large amounts of multimedia material, such as images, audio, video, and 3D/4D material, as well as computer generated 2D, 3D, and 4D content, already exist and are growing at increasing rates. While these amounts are growing, managing distribution of and access to multimedia material is becoming ever harder, both for lay and professional users.

The SAMT conference series tackles these problems by investigating the semantics and pragmatics of multimedia generation, management, and user access. The conference targets scientifically valuable research tackling the semantic gap between the low-level signal data representation of multimedia material and the high-level meaning that providers, consumers, and prosumers associate with the content.

I won’t be in Germany in early December but would appreciate a note from anyone who can attend this conference.

This is an opportunity to see a very strong program of speakers and to mingle with others working in the field. If you are in Germany on the conference dates, it would be time well spent.

November 10, 2010

MongoDB Indexes and Indexing – Post

Filed under: Indexing,MongoDB,NoSQL — Patrick Durusau @ 2:25 pm

MongoDB Indexes and Indexing and MongoDB Indexing: An Optimization Primer from Alex Popescu provide great coverage of indexing and indexing issues.

Funny how topic maps started with indexing, revolve around the semantic issues of indexes/indexing and have to rely on indexing for reasonable performance.

Will have to see what other indexing resources I can dig up.

Enjoy the videos!

Semantic-Distance Based Clustering for XML Keyword Search

Filed under: Clustering,Uncategorized — Patrick Durusau @ 1:45 pm

Semantic-Distance Based Clustering for XML Keyword Search Authors(s): Weidong Yang, Hao Zhu Keywords: XML, Keyword Search, Clustering

Abstract:

XML Keyword Search is a user-friendly information discovery technique, which is well-suited to schema-free XML documents. We propose a novel scheme for XML keyword search called XKLUSTER, in which a novel semantic-distance model is proposed to specify the set of nodes contained in a result. Based on this model, we use clustering approaches to generate all meaningful results in XML keyword search. A ranking mechanism is also presented to sort the results.

The author’s develop an interesting notion of “semantic distance” and then say:

Strictly speaking, the searching intentions of users can never be confirmed accurately; so different than existing researches, we suggest that all keyword nodes are useful more or less and should be included in
results. Based on the semantic distance model, we divide the set of keyword nodes X into a group of smaller sets, and each of them is called a “cluster”.

Well…, but the goal is to present the user with results relevant to their query, not results relevant to some query.

Still, an interesting paper and one that XML types will enjoy reading.

OpenTSDB

Filed under: HBase,NoSQL — Patrick Durusau @ 12:28 pm

OpenTSDB

From the website:

OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.

Thanks to HBase’s scalability, OpenTSDB allows you to collect many thousands of metrics from thousands of hosts and applications, at a high rate (every few seconds). OpenTSDB will never delete or downsample data and can easily store billions of data points. As a matter of fact, StumbleUpon uses it to keep track of hundred of thousands of time series and collects over 100 million data points per day in their main production cluster.

Imagine having the ability to quickly plot a graph showing the number of active worker threads in your web servers, the number of threads used by your database, and correlate this with your service’s latency (example below). OpenTSDB makes generating such graphs on the fly a trivial operation, while manipulating millions of data point for very fine grained, real-time monitoring.

Imagine how a busy sysadmin would react if those metrics were endowed with subject identity and participated in associations with system documentation.

Or metrics of a power distribution center had subject identity so they could tie into multiple emergency/maintenance networks?

Subjects are cheap, subject identity is useful.
(maybe I should make that my tag line, comments?)

***
I first saw this at OpenTSDB: A HBase Scalable Time Series Database by Alex Popescu

November 9, 2010

ONTOLOGIES AND SOCIAL SEMANTIC WEB FOR INTELLIGENT EDUCATIONAL SYSTEMS (SWEL)

Filed under: Conferences,Ontology,Semantic Web — Patrick Durusau @ 8:04 pm

ONTOLOGIES AND SOCIAL SEMANTIC WEB FOR INTELLIGENT EDUCATIONAL SYSTEMS (SWEL)

Paper deadline: 22 November 2010

Announcement:

Ontologies, the Semantic Web, and the Social Semantic Web offer a new perspective on intelligent educational systems by providing intelligent access to and management of Web information and semantically richer modeling of the applications and their users. This allows for supporting more adequate and accurate representations of learners, their learning goals, learning material and contexts of its use, as well as more efficient access and navigation through learning resources. The goal is to advance intelligent educational systems, so as to achieve improved e-learning efficiency, flexibility and adaptation for single users and communities of users (learners, instructors, courseware authors, etc). This special track follows the workshop series “Ontologies and Semantic Web for e-Learning”- SWEL which was conducted successfully from 2002-2009 at different hosting conferences (http://compsci.wssu.edu/iis/swel/).

BTW, I stole this from a post by Darina Dicheva to the topicmapmail list. CFP: SWEL Special Track at FLAIRS-24 – two weeks to the deadline!

Summarizing Multidimensional Data Streams: A Hierarchy-Graph-Based Approach

Filed under: Authoring Topic Maps,Data Mining — Patrick Durusau @ 7:44 pm

Summarizing Multidimensional Data Streams: A Hierarchy-Graph-Based Approach Authors(s): Yoann Pitarch, Anne Laurent, Pascal Poncelet

When dealing with potentially infinite data streams, storing the whole data stream history is unfeasible and providing a high-quality summary is required. In this paper, we propose a summarization method for multidimensional data streams based on a graph structure and taking advantage of the data hierarchies. The summarization method considers the data distribution and thus overcomes a major drawback of the Tilted Time Window common framework. We adapt this structure for synthesizing frequent itemsets extracted on temporal windows. Thanks to our approach, as users do not analyze any more numerous extraction results, the result processing is improved.

As a text scholar, I would presume that all occurrences are stored.

For high speed data streams too large to store, that are read in one pass, that isn’t an option.

If terabytes of high speed data are on your topic mapping horizon, start here.

****
PS: Posts on temporal modeling with proxies to follow (but not real soon).

Rule Synthesizing from Multiple Related Databases

Filed under: Clustering,Data Mining,Heterogeneous Data,Uncategorized — Patrick Durusau @ 7:33 pm

Rule Synthesizing from Multiple Related Databases Authors(s): Dan He, Xindong Wu, Xingquan Zhu Keywords: Association rule mining, rule synthesizing, multiple databases, clustering

In this paper, we study the problem of rule synthesizing from multiple related databases where items representing the databases may be different, and the databases may not be relevant, or similar to each other. We argue that, for such multi-related databases, simple rule synthesizing without a detailed understanding of the databases is not able to reveal meaningful patterns inside the data collections. Consequently, we propose a two-step clustering on the databases at both item and rule levels such that the databases in the final clusters contain both similar items and similar rules. A weighted rule synthesizing method is then applied on each such cluster to generate final rules. Experimental results demonstrate that the new rule synthesizing method is able to discover important rules which can not be synthesized by other methods.

The authors observe:

…existing rule synthesizing methods for distributed mining commonly assumes that related databases are relevant, share similar data distributions, and have identical items. This is equivalent to the assumption that all stores have the same type of business with identical meta-data structures, which is hardly the case in practice.

I should start collecting quotes that recognize semantic diversity as the rule rather than the exception.

More on that later. Enjoy the article.

XML Data Repository

Filed under: Authoring Topic Maps,Dataset — Patrick Durusau @ 4:00 pm

XML Data Repository.

Data in XML format for testing augmented authoring or search tools.

Whose Logic Binds A Topic Map?

Filed under: Authoring Topic Maps,Semantic Web,TMDM,TMRM,Topic Maps — Patrick Durusau @ 7:15 am

An exchange with Lars Heuer over what the TMRM should say about “ako” and “isa” (see: A Guide to Publishing Linked Data Without Redirects brings up an important but often unspoken issue.

The current draft of the Topic Maps Reference Model (TMRM) says that subclass-superclass relationships are reflexive and transitive. Moreover, “isa” relationships, are non-reflexive and transitive.

Which is all well and good, assuming that accords with your definition of subclass-superclass and isa. The Topic Maps Data Model (TMDM) on the other hand defines “isa” as non-transitive.

Either one is a legitimate choice and I will cover the resolution of that difference elsewhere.

My point here is to ask: “Whose logic binds a topic map?”

My impression is that here and in the Semantic Web, logical frameworks are being created, into which users are supposed to fit their data.

As a user I would take serious exception to fitting my data into someone else’s world view (read logic).

That the real question isn’t it?

Whether IT/SW dictates to users the logic that will bind their data or if users get to define their own “logics?”

Given the popularity of tagging and folksonomies, user “logics” look like the better bet.

November 8, 2010

BibBase and Beyond

Filed under: BibTeX,OWL,RDF,Semantic Web — Patrick Durusau @ 8:38 am

BibBase is an effort to store BibTeX information as RDF triples. For the data, see: BibBase data.

As of 8 November 2010, there are 6178 publications.

Interesting I suppose but the real question is how to enable researchers using BibTeX to disambiguate their terminology as part of their BibTeX entry?

Has to be as easy as BibTeX and consistent with usage patterns in the communities that use it. If you hope for adoption.

Not hard to imagine a helper application that runs through a set of BibTeX entries and suggest 1998 ACM Computing Classification System or 2010 Mathematics Subject Classification entries. Entries which the author could accept or reject.

Not the fine grained, concept by concept (read subject by subject) analysis of a document that I would like to see, but it’s a start.

Combining the Missing Link: An Incremental Topic Model of Document Content and Hyperlink

Filed under: Classification,Data Mining,Link-IPLSI — Patrick Durusau @ 7:59 am

Combining the Missing Link: An Incremental Topic Model of Document Content and Hyperlink Authors: Huifang Ma, Zhixin Li and Zhongzhi Shi Keywords: Topic model, Link-IPLSI, Incremental Learning, Adaptive Asymmetric learning

Abstract:

The content and structure of linked information such as sets of web pages or research paper archives are dynamic and keep on changing. Even though different methods are proposed to exploit both the link structure and the content information, no existing approach can effectively deal with this evolution. We propose a novel joint model, called Link-IPLSI, to combine texts and links in a topic modeling framework incrementally. The model takes advantage of a novel link updating technique that can cope with dynamic changes of online document streams in a faster and scalable way. Furthermore, an adaptive asymmetric learning method is adopted to freely control the assignment of weights to terms and citations. Experimental results on two different sources of online information demonstrate the time saving strength of our method and indicate that our model leads to systematic improvements in the quality of classification.

Questions:

  1. Timed expiration of documents and terms? Appropriate for library settings? (discussion)
  2. Citations treated same as hyperlinks? (Aren’t citations more granular?) (3-5 pages, citations)
  3. What do we lose by citation to documents and not concepts/locations in documents? (3-5 pages, citations)

PS: The updating aspects of this paper are very important. Static data exists but isn’t very common in enterprise applications.

ISWC 2010 Data and Demos

Filed under: Linked Data,RDF,Semantic Web,SPARQL — Patrick Durusau @ 6:27 am

ISWC 2010 Data and Demos.

Data and demos from the International Semantic Web Conference 2010. Includes links to prior data sets and browsers that work with the data sets.

Data sets are always important as well as being able to gauge the current state of semantic software.

Ambiguity and Linked Data URIs

Filed under: Ambiguity,Linked Data,Marketing,RDF,Semantic Web,Topic Maps — Patrick Durusau @ 6:14 am

I like the proposal by Ian Davis to avoid the 303 cloud while try to fix the mistake of confusing identifiers with addresses in an address space.

Linked data URIs are already known to be subject to the same issues of ambiguity as any other naming convention.

All naming conventions are subject to ambiguity and “expanded” naming conventions, such as a list of properties in a topic map, may make the ambiguity a bit more manageable.

That depends on a presumption that if more information is added and a user advised of it, the risk of ambiguity will be reduced.

But the user needs to be able to use the additional information. What if the additional information is to distinguish two concepts in calculus and the reader is innocent of even basic algebra?

That is that say ambiguity can be overcome only in particular contexts.

But overcoming ambiguity in a particular context may be enough. Such as:

  • Interchange between intelligence agencies
  • Interchange between audited entities and their auditors (GAO, SEC, Federal Reserve (or their foreign equivalents))
  • Interchange between manufacturers and distributors

None of those are the golden age of seamless knowledge sharing and universal democratization of decision making or even scheduling tennis matches sort of applications.

They are applications that can reduce incremental costs, improve overall efficiency and perhaps contribute to achievement of organizational goals.

Perhaps that is enough.

A Guide to Publishing Linked Data Without Redirects – Post

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 5:34 am

A Guide to Publishing Linked Data Without Redirects is a proposal by Ian Davis to avoid the 303 while distinguishing between “things” and their descriptions.

A step in the right direction.

November 7, 2010

Parallel Implementation of Classification Algorithms Based on MapReduce

Filed under: Classification,Data Mining,Hadoop,MapReduce — Patrick Durusau @ 8:31 pm

Parallel Implementation of Classification Algorithms Based on MapReduce Authors: Qing He, Fuzhen Zhuang, Jincheng Li and Zhongzhi Shi Keywords: Data Mining, Classification, Parallel Implementation, Large Dataset, MapReduce

Abstract:

Data mining has attracted extensive research for several decades. As an important task of data mining, classification plays an important role in information retrieval, web searching, CRM, etc. Most of the present classification techniques are serial, which become impractical for large dataset. The computing resource is under-utilized and the executing time is not waitable. Provided the program mode of MapReduce, we propose the parallel implementation methods of several classification algorithms, such as k-nearest neighbors, naive bayesian model and decision tree, etc. Preparatory experiments show that the proposed parallel methods can not only process large dataset, but also can be extended to execute on a cluster, which can significantly improve the efficiency.

From the paper:

In this paper, we introduced the parallel implementation of several classification algorithms based on MapReduce, which make them be applicable to mine large dataset. The key is to design the proper key/value pairs. (emphasis in original)

Questions:

  1. Annotated bibliography of parallel classification algorithms (newer than this paper, 3-5 pages, citations)
  2. Report for class on application of parallel classification algorithms (report + paper)
  3. Application of parallel classification algorithm to a library dataset (project)
  4. Can the key/value pairs be interchanged with others? Yes/no, why? (3-5 pages, no citations.)

Orthogonal Nonnegative Matrix Tri-factorization for Semi-supervised Document Co-clustering

Filed under: Clustering,Indexing — Patrick Durusau @ 8:26 pm

Orthogonal Nonnegative Matrix Tri-factorization for Semi-supervised Document Co-clustering Authors: Huifang Ma, Weizhong Zhao, Qing Tan and Zhongzhi Shi Keywords: Semi-supervised Clustering, Pairwise Constraints, Word-Level Constraints, Nonnegative Matrix tri-Factorization

Abstract:

Semi-supervised clustering is often viewed as using labeled data to aid the clustering process. However, existing algorithms fail to consider dual constraints between data points (e.g. documents) and features (e.g. words). To address this problem, in this paper, we propose a novel semi-supervised document co-clustering model OSS-NMF via orthogonal nonnegative matrix tri-factorization. Our model incorporates prior knowledge both on document and word side to aid the new word-category and document-cluster matrices construction. Besides, we prove the correctness and convergence of our model to demonstrate its mathematical rigorous. Our experimental evaluations show that the proposed document clustering model presents remarkable performance improvements with certain constraints.

Questions:

  1. Relies on user input, but is the user input transferable? Or is it document/collection specific? (3-5 pages, no citations)
  2. Is document level retrieval too coarse? (discussion)
  3. Subset selection, understandable for testing/development. Doesn’t it seem odd no tests were done against entire collections? (discussion)
  4. What of the exclusion of words that occur less than 3 times? Aren’t infrequent terms more likely to be significant? (3-5 pages, no citations)

2nd International Conference on Computational & Mathematical Biomedical Engineering – Conference – 2011

Filed under: Biomedical,Conferences — Patrick Durusau @ 8:25 pm

2nd International Conference on Computational & Mathematical Biomedical Engineering

30th March – 1st April 2011

George Mason University, Washington D.C., USA

Abstract/Expression of interest: 15 November 2010
(see site for other details)

Subjects abound with imaging, analysis, management of data.

Are you ahead of the curve?

NoSQL Solution: Evaluation Guide [CHART]

Filed under: NoSQL — Patrick Durusau @ 8:24 pm

NoSQL Solution: Evaluation Guide [CHART]

As the post says, this is hype, but it may be useful hype to read.

What caught my eye was one of the contenders being described as “extremely fast on small data sets (below 20 million rows)….”

OK, that’s not suitable for enterprise purposes but there are a lot of applications that can fit in under a 20 million row limit.

It’s a fun read so let me know what you think about it.

November 6, 2010

The University of Amsterdam’s Concept Detection System at ImageCLEF 2009

Filed under: Concept Detection,Data Mining,Multimedia — Patrick Durusau @ 6:40 am

The University of Amsterdam’s Concept Detection System at ImageCLEF 2009. Authors: Koen E. A. van de Sande, Theo Gevers and Arnold W. M. Smeulders Keywords: Color, Invariance, Concept Detection, Object and Scene Recognition, Bag-of-Words, Photo Annotation, Spatial Pyramid

Abstract:

Our group within the University of Amsterdam participated in the large-scale visual concept detection task of ImageCLEF 2009. Our experiments focus on increasing the robustness of the individual concept detectors based on the bag-of-words approach, and less on the hierarchical nature of the concept set used. To increase the robustness of individual concept detectors, our experiments emphasize in particular the role of visual sampling, the value of color invariant features, the influence of codebook construction, and the effectiveness of kernel-based learning parameters. The participation in ImageCLEF 2009 has been successful, resulting in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 out of 53 individual concepts, we obtain the best performance of all submissions to this task. For the hierarchical evaluation, which considers the whole hierarchy of concepts instead of single detectors, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors.

Good example of the content to expect from ImageCLEF papers.

This is a very important area of rapidly developing research.

ImageCLEF – The CLEF Cross Language Image Retrieval Track

Filed under: Concept Detection,Data Mining,Multimedia — Patrick Durusau @ 6:16 am

ImageCLEF – The CLEF Cross Language Image Retrieval Track.

The European side of working with digital video.

From the 2009 event website:

ImageCLEF is the cross-language image retrieval track run as part of the Cross Language Evaluation Forum (CLEF) campaign. This track evaluates retrieval of images described by text captions based on queries in a different language; both text and image matching techniques are potentially exploitable.

TREC Video Retrieval Evaluation

Filed under: Concept Detection,Data Mining,Multimedia — Patrick Durusau @ 5:59 am

TREC Video Retrieval Evaluation.

Since I have posted several resources on digital video and concept discovery today, listing the TREC track on the same seemed appropriate.

From the website:

The TREC conference series is sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies. The goal of the conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. In 2001 and 2002 the TREC series sponsored a video “track” devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video. Beginning in 2003, this track became an independent evaluation (TRECVID) with a workshop taking place just before TREC.

You will find publications, tools, bibliographies, data sets, etc., first class resource site.

Internet Multimedia Search and Mining

Filed under: Concept Detection,Data Mining,Domain Change,Multimedia — Patrick Durusau @ 5:51 am

Internet Multimedia Search and Mining Authors: Xian-Sheng Hua, Marcel Worring, and Tat-Seng Chua

Abstract:

In this chapter, we address the visual learning of automatic concept detectors from web video as available from services like YouTube. While allowing a much more efficient, flexible, and scalable concept learning compared to expert labels, web-based detectors perform poorly when applied to different domains (such as specific TV channels). We address this domain change problem using a novel approach, which – after an initial training on web content – performs a highly efficient online adaptation on the target domain.

In quantitative experiments on data from YouTube and from the TRECVID campaign, we first validate that domain change appears to be the key problem for web-based concept learning, with much more significant impact than other phenomena like label noise. Second, the proposed adaptation is shown to improve the accuracy of web-based detectors significantly, even over SVMs trained on the target
domain. Finally, we extend our approach with active learning such that adaptation can be interleaved with manual annotation for an efficient exploration of novel domains.

The authors cite authority for the proposition that by 2013 that 91% of all Internet traffic will be digital video.

Perhaps, perhaps not, but in any event, “concept detection” is an important aid to topic map authors working with digital video.

Questions:

  1. Later research on “concept detection” in digital video? (annotated bibliography)
  2. Use in library contexts? (3-5 pages, citations)
  3. How would you design human augmentation of automated detection? (project)

The AQ Methods for Concept Drift

Filed under: Authoring Topic Maps,Classification,Concept Drift,Topic Maps — Patrick Durusau @ 4:51 am

The AQ Methods for Concept Drift Authors: Marcus A. Maloof Keywords:online learning, concept drift, aq algorithm, ensemble methods

Abstract:

Since the mid-1990’s, we have developed, implemented, and evaluated a number of learning methods that cope with concept drift. Drift occurs when the target concept that a learner must acquire changes over time. It is present in applications involving user preferences (e.g., calendar scheduling) and adversaries (e.g., spam detection). We based early efforts on Michalski’s aq algorithm, and our more recent work has investigated ensemble methods. We have also implemented several methods that other researchers have proposed. In this chapter, we survey results that we have obtained since the mid-1990’s using the Stagger concepts and learning methods for concept drift. We examine our methods based on the aq algorithm, our ensemble methods, and the methods of other researchers. Dynamic weighted majority with an incremental algorithm for producing decision trees as the base learner achieved the best overall performance on this problem with an area under the performance curve after the first drift point of .882. Systems based on the aq11 algorithm, which incrementally induces rules, performed comparably, achieving areas of .875. Indeed, an aq11 system with partial instance memory and Widmer and Kubat’s window adjustment heuristic achieved the best performance with an overall area under the performance curve, with an area of .898.

The author offers this definition of concept drift:

Concept drift [19, 30] is a phenomenon in which examples have legitimate labels at one time and have different legitimate labels at another time. Geometrically, if we view a target concept as a cloud of points in a feature space, concept drift may entail the cloud changing its position, shape, and size. From the perspective of Bayesian decision theory, these transformations equate to changes to the form or parameters of the prior and class-conditional distributions.

Hmmm, “legitimate labels,” sounds like a job for topic maps doesn’t it?

Questions:

  1. Has concept drift been used in library classification? (research question)
  2. How would you use concept drift concepts in library classification? (3-5 pages, no citations)
  3. Demonstrate use of concept drift techniques to augment topic map authoring. (project)

On Classifying Drifting Concepts in P2P Networks

Filed under: Ambiguity,Authoring Topic Maps,Classification,Concept Drift — Patrick Durusau @ 4:07 am

On Classifying Drifting Concepts in P2P Networks Authors: Hock Hee Ang, Vivekanand Gopalkrishnan, Wee Keong Ng and Steven Hoi Keywords: Concept drift, classification, peer-to-peer (P2P) networks, distributed classification

Abstract:

Concept drift is a common challenge for many real-world data mining and knowledge discovery applications. Most of the existing studies for concept drift are based on centralized settings, and are often hard to adapt in a distributed computing environment. In this paper, we investigate a new research problem, P2P concept drift detection, which aims to effectively classify drifting concepts in P2P networks. We propose a novel P2P learning framework for concept drift classification, which includes both reactive and proactive approaches to classify the drifting concepts in a distributed manner. Our empirical study shows that the proposed technique is able to effectively detect the drifting concepts and improve the classification performance.

The authors define the problem as:

Concept drift refers to the learning problem where the target concept to be predicted, changes over time in some unforeseen behaviors. It is commonly found in many dynamic environments, such as data streams, P2P systems, etc. Real-world examples include network intrusion detection, spam detection, fraud detection, epidemiological, and climate or demographic data, etc.

The authors may well have been the first to formulate this problem among mechanical peers but any humanist could have pointed out examples concept drift between people. Both in the literature as well as real life.

Questions:

  1. What are the implications of concept drift for Linked Data? (3-5 pages, no citations)
  2. What are the implications of concept drift for static ontologies? (3-5 pages, no citations)
  3. Is concept development (over time) another form of concept drift? (3-5 pages, citations, illustrations, presentation)

*****
PS: Finding this paper is an illustration of ambiguity leading to serendipitous discovery. I searched for one of the author’s instead of the exact title of another paper. While scanning the search results I found this paper.

November 5, 2010

Ambiguity and Serendipity

Filed under: Ambiguity,Authoring Topic Maps,Topic Maps — Patrick Durusau @ 8:35 pm

There was an email discussion recently where ambiguity was discussed as something to be avoided.

It occurred to me, if there were no ambiguity, there would be no serendipity.

Think about the last time you searched for a particular paper. If you remembered enough to go directly to it, you did not see any similar or closely resembling papers along the way.

Now imagine every information request you make results in exactly what you were searching for.

What a terribly dull search experience that would be!

Topic maps can produce the circumstances where serendipity occurs because a subject can be identified any number of ways. Quite possibly several that you are unaware of. And seeing those other ways may spark a memory of another paper, perhaps another line of thought, etc.

I think my list of “other names” for record linkage now exceeds 25 and I really need to cast those into a topic map fragment along with citations to the places they can be found.

I don’t think of topic maps as a means to avoid ambiguity but rather as a means to make ambiguity a manageable part of an information seeking experience.

TMDM-NG – Overloading Occurrence

Filed under: Authoring Topic Maps,TMDM,Topic Maps,XTM — Patrick Durusau @ 4:23 pm

“Occurrence” in topic maps is currently overloaded. Seriously overloaded.

In one sense, “occurrence” is used as it is in a bibliographic reference. That is that subject X “occurs” at volume Y, page Z. A reader expects to find the subject in question at that location.

In the overloaded sense, “occurrence” is used to mean some additional property of a subject.

To me the semantics of “occurrence” weigh against using it for any property associated with a subject.

That has been the definition used in topic maps for a very long time but that to me simply ripens it for correction.

Occurrence should be used only for instances of a subject that are located outside of a topic map.

A property element should be allowed for any topic, name, occurrence or association. Every property should have a type attribute.

It is a property of the subject represented by the construct where it appears.

Previously authored topic maps will continue to be valid since as of yet there are no processors that could validate the use of “occurrence” either in the new or old sense of the term.

Older topic map software will not be able to process newer topic maps but unless topic maps change and evolve (even COBOL has), they will die.

Revisiting the TAO of Topic Maps

Filed under: TMDM,Topic Maps — Patrick Durusau @ 4:07 pm

One of the readings for my course on topic maps is the TAO of Topic Maps.

I was re-reading it the other day while writing a lecture.

Topics can represent anything. That much we all know.

Associations represent “a relationship between two or more topics.”

Isn’t an association an “anything?”

Occurrences are “information resources that are deemed to be relevant to the topic in some way.”

Isn’t an occurrence an “anything?”

Which would mean that both association and occurrences could be represented by topics, but their not.

They have special constructs in ISO 13250. And defined sets of properties.

I thought about that for a while and it occurred to me that topic, association and occurrence are just convenient handles for bundles of properties.

When I say “association,” you know we are about to talk about a relationship between two subjects (topics), their roles, role players, etc.

Same goes for occurrence.

Or to put it differently, “topic,” “association” and “occurrence” facilitate talking about particular subjects and their properties.

November 4, 2010

The Complexity and Application of Syntactic Pattern Recognition Using Finite Inductive Strings

Filed under: Bioinformatics,Biomedical,Pattern Recognition — Patrick Durusau @ 12:26 pm

The Complexity and Application of Syntactic Pattern Recognition Using Finite Inductive Strings Authors: Elijah Myers, Paul S. Fisher, Keith Irwin, Jinsuk Baek, Joao Setubal Keywords: Pattern Recognition, finite induction, syntactic pattern recognition, algorithm complexity

Abstract:

We describe herein the results of implementing an algorithm for syntactic pattern recognition using the concept of Finite Inductive Sequences (FI). We discuss this idea, and then provide a big O estimate of the time to execute for the algorithms. We then provide some empirical data to support the analysis of the timing. This timing is critical if one wants to process millions of symbols from multiple sequences simultaneously. Lastly, we provide an example of the two FI algorithms applied to actual data taken from a gene and then describe some results as well as the associated data derived from this example.

Pattern matching is of obvious important for bioinformatics and in topic map terms, recognizing subjects.

Questions:

  1. What “new problems continue to emerge” that you would use pattern matching to solve? (discussion)
  2. What about those problems makes them suitable for the application of pattern matching? (3-5 pages, no citations)
  3. What about those problems makes them suitable for the particular techniques described in this paper? (3-5 pages, no citations)
« Newer PostsOlder Posts »

Powered by WordPress