Data Wrangling reports a data set of 320 GB sample of Wikipedia traffic.
Thoughts on similar sample data sets for topic maps?
Sizes, subjects, complexity?
Data Wrangling reports a data set of 320 GB sample of Wikipedia traffic.
Thoughts on similar sample data sets for topic maps?
Sizes, subjects, complexity?
Mathew Hurst provides a quick listing of data engines (including his own, which merits a close look).
Watzlawick1 recounts the following experiment:
That there is no necessary connection between fact and explanation was illustrated in a recent experiment by Bavelas (20): Each subject was told he was participating in an experimental investigation of “concept formation” and was given the same gray, pebbly card about which he was to “formulate concepts.” Of every pair of subjects (seen separately but concurrently) one was told eight out of ten times at random that what he said about the card was correct; the other was told five out of ten times at random what he said about the card was correct. The ideas of the subject who was “rewarded” with a frequency of 80 per cent remained on a simple level, which the subject who was “rewarded” only at a frequency of 50 per cent evolved complex, subtle, and abstruse theories about the card, taking into consideration the tiniest detail of the card’s composition. When the two subjects were brought together and asked to discuss their findings, the subject with the simpler ideas immediately succumbed to the “brilliance” of the other’s concepts and agreed the other had analyzed the card correctly.
I repeat this account because it illustrates the impact that “reward” systems can have on results.
Whether the “rewards” are members of a crowd or experts.
This study does raise the interesting question of whether conferences should track and randomly reject authors to encourage innovation.
1. Watzlawick, Paul, Janet Beavin Bavelas, and Don D. Jackson. 1967. Pragmatics of human communication; a study of interactional patterns, pathologies, and paradoxes. New York: Norton.
What are you using to act as the placeholder for an unknown player of a role?
That is in say a news, crime or accident investigation, there is an association with specified roles, but only some facts and not the identity of all the players is known.
For example, in the recent cablegate case, when the story of the leaks broke, there was clearly an association between the leaked documents and the leaker.
The leaker had a number of known characteristics, the least of which was ready access to a wide range of documents. I am sure there were others.
To investigate that leak with a topic map, I would want to have a representative for the player of that role, to which I can assign properties.
I started to publish a subject identifier for the subject idk (I Don’t Know) to act as that placeholder but then thought it needs more discussion.
This has been in my blog queue for a couple of weeks so another week or so before creating a subject identifier won’t hurt.
The problem, which you already spotted, is that TMDM governed topic maps are going to merge topics with the idk (I Don’t Know) subject identifier. Which would in incorrect in many cases.
Interesting that it would not be wrong in all cases. That is I could have two associations, both of which have idk (I Don’t Know) subject identifiers and I want them to merge on the basis of other properties. So in that case the subject identifiers should merge.
I am leaning towards simply defining the semantics to be non-merger in the absence of merger on some other specified basis.
PS: I kept writing the expansion idk (I Don’t Know) because a popular search engine suggested Insane Dutch Killers as the expansion. Wanted to avoid any ambiguity.
Classification and Novel Class Detection in Data Streams with Active Mining Authors(s): Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham
We present ActMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and limited labeled data. Most of the existing data stream classification techniques address only the infinite length and concept-drift problems. Our previous work, MineClass, addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. Concept-evolution occurs in the stream when novel classes arrive. However, most of the existing data stream classification techniques, including MineClass, require that all the instances in a data stream be labeled by human experts and become available for training. This assumption is impractical, since data labeling is both time consuming and costly. Therefore, it is impossible to label a majority of the data points in a high-speed data stream. This scarcity of labeled data naturally leads to poorly trained classifiers. ActMiner actively selects only those data points for labeling for which the expected classification error is high. Therefore, ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems. It outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.
I would have liked this article better had it not said that the details of the test data could be found in another article.
Specifically: Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B.M.: “Integrating novel class detection with classification for concept-drifting data streams.” In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD c 2009. LNCS, vol. 5782, pp. 79–94. Springer, Heidelberg (2009)
Which directed me to: “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams,” http://www.utdallas.edu/?mmm058000/reports/UTDCS-13-09.pdf
I leave it as an exercise for the readers to guess the names of the authors of the last paper.
Otherwise interesting research marred by presentation in dribs and drabs.
Now that I have all three papers I will have to see what questions arise, other than questionable publishing practices.
Searching with Tags: Do Tags Help Users Find Things? Authors: Margaret E.I. Kipp, and D. Grant Campbell
This study examines the question of whether tags can be useful in the process of information retrieval. Participants searched a social bookmarking tool specialising in academic articles (CiteULike) and an online journal database (Pubmed). Participant actions were captured using screen capture software and they were asked to describe their search process. Users did make use of tags in their search process, as a guide to searching and as hyperlinks to potentially useful articles. However, users also made use of controlled vocabularies in the journal database to locate useful search terms and of links to related articles supplied by the database.
Good review of the literature, such as it is, on use of user supplied tagging for searching.
Worth reading on the question raised about the use of tags but there is another question lurking in the background.
The authors say in various forms:
The ability to discover useful resources is of increasing importance where web searches return 300 000 (or more) sites of unknown relevance and is equally important in the realm of digital libraries and article databases. The question of the ability to locate information is an old one and led directly to the creation of cataloguing and classification systems for the organisation of knowledge. However, such systems have not proven to be truly scalable when dealing with digital information and especially information on the web.
Since at least 1/3 of the web is pornography and that is not usually relevant to scientific, technical or medical searching, we can reduce the searching problem by 1/3 right there. I don’t know the percentage for shopping, email archives, etc., but when you come down to the “core” literature for field, it really isn’t all that large is it?
First seen at: ResourceBlog
Google Refine 2.0 has been released.
From the website:
Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions.
Freebase Gridworks 1.0 has already been well received by the data journalism and open government data communities (you can read how the Chicago Tribune, ProPublica and data.gov.uk have used it) and we are very excited by what they and others will be able to do with this new release. To learn more about what you can do with Google Refine 2.0, watch…[screencasts]
If you don’t watch any videos this month, you have to watch http://www.youtube.com/watch?v=m5ER2qRH1OQ!
Google uses the term reconciliation but what is being demonstrated is mapping information to a subject representative.
Note that unlike topic maps, the basis (read properties) for that mapping is not disclosed, so it isn’t possible for a program or person to be sure to repeat the same mapping.
Semantic-Distance Based Clustering for XML Keyword Search Authors(s): Weidong Yang, Hao Zhu Keywords: XML, Keyword Search, Clustering
XML Keyword Search is a user-friendly information discovery technique, which is well-suited to schema-free XML documents. We propose a novel scheme for XML keyword search called XKLUSTER, in which a novel semantic-distance model is proposed to specify the set of nodes contained in a result. Based on this model, we use clustering approaches to generate all meaningful results in XML keyword search. A ranking mechanism is also presented to sort the results.
The author’s develop an interesting notion of “semantic distance” and then say:
Strictly speaking, the searching intentions of users can never be confirmed accurately; so different than existing researches, we suggest that all keyword nodes are useful more or less and should be included in
results. Based on the semantic distance model, we divide the set of keyword nodes X into a group of smaller sets, and each of them is called a “cluster”.
Well…, but the goal is to present the user with results relevant to their query, not results relevant to some query.
Still, an interesting paper and one that XML types will enjoy reading.
Rule Synthesizing from Multiple Related Databases Authors(s): Dan He, Xindong Wu, Xingquan Zhu Keywords: Association rule mining, rule synthesizing, multiple databases, clustering
In this paper, we study the problem of rule synthesizing from multiple related databases where items representing the databases may be different, and the databases may not be relevant, or similar to each other. We argue that, for such multi-related databases, simple rule synthesizing without a detailed understanding of the databases is not able to reveal meaningful patterns inside the data collections. Consequently, we propose a two-step clustering on the databases at both item and rule levels such that the databases in the final clusters contain both similar items and similar rules. A weighted rule synthesizing method is then applied on each such cluster to generate final rules. Experimental results demonstrate that the new rule synthesizing method is able to discover important rules which can not be synthesized by other methods.
The authors observe:
…existing rule synthesizing methods for distributed mining commonly assumes that related databases are relevant, share similar data distributions, and have identical items. This is equivalent to the assumption that all stores have the same type of business with identical meta-data structures, which is hardly the case in practice.
I should start collecting quotes that recognize semantic diversity as the rule rather than the exception.
More on that later. Enjoy the article.
Our apparent inability to imagine other audiences keeps nagging at me.
If that is true for hierarchical arrangements, then it must be true for indexes as well.
So far, standard topic maps sort of thinking.
What if that applies to explanations as well?
That is I create better explanations when I imagine the audience to be like me.
And don’t try to guess what others will find a good explanation?
Why not test explanations with audiences?
Make explanation, even of topic maps, a matter of empirical investigation rather than formal correctness.
Using Tag Clouds to Promote Community Awareness in Research Environments Authors: Alexandre Spindler, Stefania Leone, Matthias Geel, Moira C. Norrie Keywords: Tag Clouds – Ambient Information – Community Awareness
Tag clouds have become a popular visualisation scheme for presenting an overview of the content of document collections. We describe how we have adapted tag clouds to provide visual summaries of researchers’ activities and use these to promote awareness within a research group. Each user is associated with a tag cloud that is generated automatically based on the documents that they read and write and is integrated into an ambient information system that we have implemented.
One of the selling points of topic maps has been the serendipitous discovery of new information. Discovery is predicated on awareness and this is an interesting approach to that problem.
The Linking Open Data cloud diagram is maintained by This page is maintained by Richard Cyganiak and Anja Jentzsch.
I suppose having DBpedia at the center of linked data is better than the CIA Factbook. 😉
I find large visualizations like this one useful as marketing tools or “that’s cool” examples, but not terribly useful for actual analysis.
Has your experience been different?
Destined to be a deeply influential resource.
Read the paper, use the application for a week Chem2Bio2RDF, then answer these questions:
Extra credit: What one thing would you change about any of the identifications in this system? Why?
Experience in Extending Query Engine for Continuous Analytics by Qiming Chen and Meichun Hsu has this problem statement:
Streaming analytics is a data-intensive computation chain from event streams to analysis results. In response to the rapidly growing data volume and the increasing need for lower latency, Data Stream Management Systems (DSMSs) provide a paradigm shift from the load-first analyze-later mode of data warehousing….
Moving from load-first analyze-later has implications for topic maps over data warehouses. Particularly when events that are subjects may only have a transient existence in a data stream.
This is on my reading list to prepare to discuss TMQL in Leipzig.
PS: Only five days left to register for TMRA 2010. It is a don’t miss event.
Apologies for the lack of a post for August 18, 2010.
I was working on a post late yesterday evening when my ISP lost connectivity to the Net. 🙁
I could not stay up late enough to see if it would be repaired before the end of the day.
Hence, no post for August 18, 2010.
Have a lot of stuff in the queue so will try to get an early post out most days.
Our starting premise: Users want to say things of interest to them, as simply as possible, for them.
Note the focus on users. Not on description logic. Not on formal ontologies. Not on reasoning, artificial or otherwise. Not even on complex mappings between identifications. But on users.
All of those other things are worthwhile enterprises, some of them anyway, which you can pursue your own leisure.
The question is how to empower users to say things about what interests them? And if possible, how to do so without re-writing the WWW to deal with 303 clouds, etc. ?
Our answer to those questions: PGS – Pretty Good Semantics. It asks very little of users yet can annotate any identifier on the WWW to say whatever a user likes.
It uses existing HTML techniques and works with existing web servers and search engines.
In an exchange over a MapReduce resource, Robert Barta observed how large that ecosystem has grown in just a year, and suggested there is a lesson for the TM community in that growth. But what lesson is that? (He didn’t say, but I have written to ask.)
“MapReduce” isn’t a cooler a name than “Topic Maps” so that’s not lesson.
MapReduce isn’t less complex than topic maps so that’s not the lesson as well.
Two issues that MapReduce does not face:
MapReduce doesn’t face the first issue because users can create whatever mapping they wish, without ever saying explicitly what subjects are involved. It also preserves the special nature of insider vocabularies since it has no explicit mechanism for identifying subjects.
Are those the lessons? If they are, are there work arounds? Are there other lessons?
Pragmatic Topic Map Streaming by Jan Schreiber raises some interesting questions about how to construct a data stream for a topic map.
I particularly like the idea of creating mini-topic maps as it were. See his post for the details.
He did not touch on was how topic map stream software would recognize subjects. A topic map stream creator with a configurable subject recognition would be really useful. Most of us could use the “topic maps subjects” recognition filter while others, interested in dull subjects like the World Cup (just teasing) could have a subject filter for it. Some of us could have both, feeding different topic maps.
The You Say God Is Dead? There’s an App for That story in the New York Times, July 2, 2010, looks like an opportunity for topic maps.
For publishers, it would be possible to map responses on the basis of topics and let the topic map handle the details of where that is the appropriate response to an “opposing” app. It should shorten the update/production cycle as new material is added to counter new arguments or variations of old ones.
On the product side, publishers could use topic maps to enable users to respond to a variety of ways of naming or phrasing particular issues. In debates over religion, as in all other areas, differences in terminology can make it difficult to come to grips with the opposing side.
Depending on how it was implemented, a topic map app could integrate other resources, ranging from study materials to personal contacts as they relate to this application. Think of a topic map as being able to bridge between data held in mini-silos on an iPhone. So users could add in information into the app that was useful to them in such debates.
Any other critical points I should make as I contact publishers of these apps to recommend topic maps?
A graph is a data structure composed of dots (i.e. vertices) and lines (i.e. edges). The dots and lines of a graph can be organized into intricate arrangements. The ability for a graph to denote objects and their relationships to one another allow for a surprisingly large number of things to be modeled as a graph. From the dependencies that link software packages to the wood beams that provide the framing to a house, most anything has a corresponding graph representation. However, just because it is possible to represent something as a graph does not necessarily mean that its graph representation will be useful. If a modeler can leverage the plethora of tools and algorithms that store and process graphs, then such a mapping is worthwhile. This article explores the world of graphs in computing and exposes situations in which graphical models are beneficial.
The value of having a distributed architecture (did I hear “Internet?”) has been lost on the Semantic Web advocates. With topic maps you can have multiple locations that “resolve” identifiers to other identifiers and pass on information about something that has been identified.
Most existing topic maps look like data silos but that is more a matter of habit than architectural limitation.
I should put in a plug for the Springer Alert Service, which brought the article with the same title, JErlang: Erlang with Joins to my attention. Highly recommended as a way to stay current on the latest CS research. Remember articles don’t have to say “topic map” in the title or abstract to be relevant.
PS: Topic map observations: The final report and article have the same name. In topic maps the different locations for the items would be treated as subject locators, thus allowing them to retain the same name but being distinguished one from the other. Note that the roles differ with the two subjects as well. Susan Eisenbach is the supervisor of the final report and is a co-author of the article reported by Springer.
Jack Park points to The Forth Paradigm: Data-Intensive Scientific Discovery as a book that merits our attention.
Indeed it does! Lurking just beneath the surface of data-intensive research are questions of semantics. Diverse semantics. How does data-intensive research occur in a multi-semantic world?
Paul Ginsparg (Cornell University), in Text in a Data-centric World, has the usual genuflection towards “linked data” without stopping to consider the cost of evaluating every URI to decide if it is an identifier or a resource. Nor why adding one more name to the welter of names we have now (that is the semantic diversity problem) is going to make our lives any better?
Such an articulated semantic structure [linked data] facilitates simpler algorithms acting on World Wide Web text and data and is more feasible in the near term than building a layer of complex artificial intelligence to interpret free-form human ideas using some probabilistic approach.
Solving the “perfect language” problem, which has never been solved, is more feasible than “…building a layer of complex artificial intelligence to interpret free-form human ideas using some probabilistic approach” to solve it for us?
Perhaps so but one wonders why that is a useful observation?
On the “perfect language” problem, see The Search for the Perfect Language by Umberto Eco.
The Future of the Journal is another slide deck by Anita de Waard that reads like a promotional piece for topic maps, sans any mention of topic maps.
While Anita makes a strong case for annotation of data in science publishing, the same is true for government, legal, environmental, business, finance, etc., publications. All publications are as complex as depicted on these slides. It isn’t as obvious in the humanities because that “data” has been locked away so long that we have forgotten it is there.
The more complex the information we record, via “annotations” or some other mechanism, the greater the need for librarians to organize it and help us find it. Self-help in research is like the guy about to do a self-appendectomy with his doctor’s advice over the phone. Doable, maybe, but the results are pretty much what you would expect.
Rather than future of the journal, I would say: Future of Information.