Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 26, 2011

Hypertable 0.9.5.0

Filed under: Hypertable,NoSQL — Patrick Durusau @ 5:19 pm

Hypertable 0.9.5.0

My first encounter with this project lead me to: http://www.hypertable.com, which is a commercial venture offering support for open source software.

Except that that wasn’t really clear from the .com homepage.

I finally tracked links back to: http://code.google.com/p/hypertable/ to discover its GNU GPL v2 license.

The list of ventures using Hypertable is an impressive one.

Linking to the documentation at the .org site from the .com site would be a real plus.

A bit more attention to the .com site might attract more business, use cases, that sort of thing.

Scala Quick Reference

Filed under: Scala — Patrick Durusau @ 5:18 pm

Scala Quick Reference

If you are exploring Scala, this will be handy.

It isn’t often that I wish for a color printer but this is one of those times.

March 25, 2011

Deep Knowledge Representation Challenge Workshop

Filed under: Conferences,Knowledge Representation — Patrick Durusau @ 4:33 pm

Deep Knowledge Representation Challenge Workshop

From the website:

This workshop will provide a forum to discuss difficult problems in representing complex knowledge needed to support deep reasoning, question answering, explanation and justification systems. The goals of the workshop are: (1) to create a comprehensive set of knowledge representation (KR) challenge problems suitable for a recurring competition, and (2) begin to develop KR techniques to meet those challenges. A set of difficult to represent sentences from a biology textbook are included as an initial set of KR challenges. Cash prizes will be awarded for the most creative and comprehensive solutions to the selected challenges.

The workshop will be a highly interactive event with brief presentations of problems and solutions followed by group discussion. To submit a paper to the workshops, the participants should select a subset of the challenge sentences and present approaches for representing them along with an approach to use that representation in a problem solving task (question answering or decision support). Participants are free to add to the list of challenge sentences, for example, from other chapters of the textbook, or within the spirit of their own projects and experience but should base their suggestions on concrete examples, if possible, from real applications.

Important Dates:

  • 7 May: Submissions due
  • 16 May: Notification of participants
  • 13 June: Final camera ready material for workshop web site and all material for discussion
  • 25 June: Initial workshop – with report back and further, during KCAP. Details to be announced.

I mention this because deep knowledge as in identification of and navigation to is part and parcel of topic maps.

It seems to me that any “…deep reasoning, question answering, explanation and justification system.” is going to succeed or fail based on its identification of subjects.

Or to put it differently, it is difficult to reason effectively if you don’t know what you are talking about. (I could mention several examples from recent news casts but I will forego the opportunity.)

TMQL Slides for Prague (was incorrectly Leipzig)!

Filed under: TMQL — Patrick Durusau @ 4:32 pm

TMQL Slides for Prague are now available!

Rani Pinchuk prepared slides for discussion in Prague next week.

We should all be appreciative and use this opportunity to provide useful feedback to the editors.

That does not imply that anyone will agree with any particular point but it is possible to express disagreement in a polite way.

I will try to remind myself of that as much as anyone else. 😉

******
Apologies for the incorrect title! Thanks Benjamin!

Open-source Data Science Toolkit

Filed under: Dataset,Geographic Data,Geographic Information Retrieval,Software — Patrick Durusau @ 4:32 pm

Open-source Data Science Toolkit

From Flowingdata.com:

Pete Warden does the data community a solid and wraps up a collection of open-source tools in the Data Science Toolkit to parse, geocode, and process data.

Mostly geographic material but some other interesting tools, such as extracting the “main” story from a document. (It has never encountered one of my longer email exchanges with Newcomb. 😉 )

It is interesting to me that so many tools and data sets related to geography appear so regularly.

GIS (geographic information systems) can be very hard but perhaps they are easier than the semantic challenges of say medical or legal literature.

That is it is easier to say here you are with regard to a geographic system than to locate a subject in a conceptual space which has been partially captured by a document.

Suspect the difference in hardness could only be illustrated by example and not by some test. Will have to give that some thought.

Elastic Lists Celebrates Five Years of Information Aesthetics

Filed under: Graphics,Interface Research/Design — Patrick Durusau @ 4:31 pm

Elastic Lists Celebrates Five Years of Information Aesthetics

From the website:

In celebration of Information Aesthetics’ birthday, Moritz Stefaner of Well-formed Data adapted his elastic lists concept to all five years of infosthetics posts. Each white-bordered rectangle represents a post, and colors within rectangles indicate post categories.

Select categories on the right, and the list updates to show related categories. Similarly, filter posts by year, author, and/or number of categories. Select a rectangle to draw up the actual post.

Go on, give it a try for yourself. Excellent work.

And then head over to infosthetics and wish it a happy birthday.

From the “new to me” corner.

Very interesting presentation of data. Suspect there are any number of data sets where this would be appropriate.

Oh, btw, like the post says: check out Infosthetics.

Open Data is not Transparency

Filed under: Dataset — Patrick Durusau @ 4:31 pm

Open Data is not Transparency

From the blog:

There are many encouraging signs of late in the general area of open data. However, one thing that has to be kept in mind with this movement is that open data is only part of transparency – it is necessary but not sufficient. If the data is not understandable by the intended audience (and open data suggests a very broad audience) then there is no transparency. The information and knowledge locked in the data will be hiding in plain sight.

This thought suggests that any open data movement has to be combined with a ‘plain English’ (or ‘plain ‘) programme and an investment in data literacy. In addition, to take the whole movement to its obvious conclusion, there should be some well defined success criteria. What is the answer to the question: what happens to whom when citizens experience open data?

For a similar take, see my: Baltimore – Semi-Transparent or Semi-Opaque?

What surprises me is that in response to the Cablegate scandal, that the State Department did not simply start dumping all their daily output to the web. In all its inconsistent formats, vocabularies, etc.

The ensuing flood of data would effectively hide any secrets they may have far more effectively than any security protocol.

I can imagine the news conference now: “You found a document that said what? Imagine that!”

With no attribution it could be anyone from the janitor writing a novel on their lunch break to Hillary finally saying good-bye to Bill.

And that would allow the reservation of “top-secret” for things like launch codes, where is the red button, stuff like that.

Journal of Digital Information

Filed under: Digital Library — Patrick Durusau @ 4:30 pm

Journal of Digital Information

Publishing papers on the management, presentation and uses of information in digital environments, JoDI is a peer-reviewed Web journal supported by Texas A&M University Libraries.

First publishing papers in 1997, the Journal of Digital Information is an electronic-only, peer-reviewed journal covering the broad topics related to digital libraries, hypertext and hypermedia systems, and the issues of digital information. JoDI is supported by the Texas A&M University Libraries through the Digital Initiatives, Research and Technology group, and hosted by the Texas Digital Library.

Looks like an interesting venue to explore for material on digital libraries.

March 24, 2011

Twice in the Same Semantic Stream?

Filed under: Semantics,Subject Identity — Patrick Durusau @ 7:54 pm

I don’t think anyone disagrees with the proposition that the meaning, semantics of words, changes over time. And across social groups and settings.

It is like the stream described by Heraclitus, in which we can never step twice.

What meanings we assign to words, one medium of communication, are chosen from that stream at various points.

Note that I said chosen and not caught.

The words continue downstream where they may be chosen by other people with other meanings.

The notion that we can somehow fix the meaning of words, is contrary to our common and universal experience.

I wonder then why anyone would think that data structures, which are after all are composed of words and liable to the same shifting semantics as any other words, could have a fixed semantic.

That somehow data structures reside outside what we know to be the ebb and flow of semantics.

Both the words that we think of as being “data” and the words that we assign structures to hold or describe data (metadata if you like), are all part and parcel of the same stream.

The well known case of the shifting semantics of owl:sameAs is a case in point.

But you could as well pick terminology from any other vocabulary, semantic or not to illustrate the same point.

That isn’t to say that RDF or OWL aren’t useful. They are. For any number of purposes.

But, like any vocabulary, whether for data or structure, they should be used with two cautions:

1) Any term in a vocabulary stands for a subject is also represented by other terms in other vocabularies.

That is to say that a term that is used for a subject is a matter of convenience and custom, not some underlying truth.

2) Any term in a vocabulary exists in the context of other terms that represents other subjects.

A term can be best understood and communicated to others if it is documented or explained in the context of other subjects.

To say nothing of mapping terms for a subject to other terms for the same subject.

To act otherwise, as though semantics are fixed, is an attempt to step twice in the same location in a semantic stream.

Wasn’t possible for Heraclitus, isn’t possible now.

March 23, 2011

RDF and Semantic Web

Filed under: RDF,Semantic Web,Topic Maps — Patrick Durusau @ 6:03 am

RDF and Semantic Web: can we reach escape velocity?

Jenni Tennison’s slides from TPAC 2010 are an interesting insight into how an “insider” views the current state of RDF and the Semantic Web.

I disagree with her on a couple of crucial points:

RDF’s only revolution, but the key one, is using URIs to name things, including properties and classes

identifying things with URIs does two really useful things

  • disambiguates, enabling joins with other data using same URI
    • mash-ups beyond mapping things on a Google Map
  • provides something at the end of the URI
    • extra information, explanation, context
    • in a basic entity-attribute-value model that enables combination without either up-front agreement or end-user jiggerypokery

First, the “identifying things with URIs” is re-use of a very old idea, the perfect language, which has a universal and unbroken record of failure. (see my Blast from the Past and citations therein.)

Second, how is combination possible without either up-front agreement or end-user jiggerypokery?

Combining information without either up-front agreement or end-user jiggerypokery is why we get such odd search results now.

Let’s take a simple example. Search for “democracy” and see what results you get.

Now, do you really think that “democracy” (limiting my remarks to the US at the moment) from documents in the 18th and 19th centuries means the same thing as “democracy” after the fall of slavery but prior to women getting the right to vote? Or does it means the same thing as it does today? Or does it mean the same thing as its use in Egypt, where classes other than the moneyed ones may be favored?

No doubt you will say that someone could create URIs for all those senses of democracy, which is true, but the question is will we use them consistently? The answer to that has been no up to this point.

People are inconsistent, semantically speaking and there is no showing that is going to change.

Which brings me to the second major area of my disagreement.

RDF and the Semantic Web are failing (present tense) because they are the answer to a problem RDF and Semantic Web followers are interested in solving.

But not the answer to problems that interests anyone else.

At least not enough to pay the price of RDF and the Semantic Web.

To be fair, topic maps faces the same issue.

But at least topic maps started off with a particular problem (combining indexes) and then expanded to be a general solution.

The Semantic Web started off as a general solution in search of problems that would justify the cost of adoption. Not the best strategy.

I do like Jenni’s emphasis on assisting governments to make their data usefully available. That is a good thing and one that we agree on.

Both topic maps and RDF/SW need to analyze the problems of governments (and others) in making such data available.

Then, understanding the issues they face, derive as low cost a solution as possible within their paradigms to solve that problem.

That could involve URIs, for example, assuming there was a URI + N properties serve to identify a subject protocol.

Not that such a protocol makes us any more semantically consistent, but having more than one property to be inconsistent about, may (emphasis on may) reduce the range of semantic inconsistency.

Take my democracy example. If I had http://NotRealURI/democracy and a property of range, 1800-1850 and to match my sense of democracy required matching both the URI and the date range, that would be a step towards reducing semantic inconsistency.

It is the lack of a requirement that more than one property be matched for identity that underlies the technical failure of RDF/Semantic Web.

Its social failure is in not answering questions that are of interest to developers and ultimately users.

Providing useful answers to problems, seen by users as problems, is the way forward for both topic maps and RDF/Semantic Web.

Microsoft Research Watch: AI, NoSQL and Microsoft’s Big Data Future

Filed under: Artificial Intelligence,Graphs,NoSQL — Patrick Durusau @ 6:01 am

Microsoft Research Watch: AI, NoSQL and Microsoft’s Big Data Future

From ReadWriteCloud channel:

Probase is a Microsoft Research project described as an “ongoing project that focuses on knowledge acquisition and knowledge serving.” Its primary goal is to “enable machines to understand human behavior and human communication.” It can be compared to Cyc, DBpedia or Freebase in that it is attempting to compile a massive collection of structured data that can be used to power artificial intelligence applications.

It’s powered by a new graph database called Trinity, which is also a Microsoft Research project. Trinity was spotted today by MyNoSQL blogger Alex Popescu, and that led us to Probase. Neither project seems to be available to the public yet.

Err, did they say graph database?

Now, if they can just avoid the one-world-semantic trap this could prove to be very interesting.

Well, it will be interesting in any case but avoiding that particular dead end would give MS a robustness that would be hard to match.

Linked Data: Evolving the Web into a Global Data Space (The Online Book)

Filed under: Linked Data,RDF,Topic Maps — Patrick Durusau @ 6:01 am

Linked Data: Evolving the Web into a Global Data Space (The Online Book)

The Principles of Linked Data:

1. Use URIs as names for things.
2. Use HTTP URIs, so that people can look up those names.
3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
4. Include links to other URIs, so that they can discover more things.

First observation/question

What made the original WWW proposal different from all hypertext systems before it?

There had been a number of hypertext systems before the WWW, some with capabilities that the WWW continues to lack.

But what make it different or perhaps better, successful?

That links could fail?

Oh, but if we are going to have a global data space that identifies stuff, it can’t fail. Yes?

So, we are taking a flexible, fault tolerant system (the World Wide Web) and making it into an inflexible, brittle system (the Semantic Web).

That sounds like a very, very bad plan.

Second observation/question

Global data space?

Even allowing for marketing puff, that is a bit of a stretch. Well, more than that, it is an outright lie.

Consider all the data that is now being collected by the Large Hadron Collider in the CERN. So much data that data has to be discarded. Simply can’t keep it all.

Or all the data from previous space missions and astronomical observations, both visible and in other bands.

Or all the legal (and one assumes illegal) records of government activity.

Or all the other information, records, data from human activity.

And not just the documents, but the stuff people talk about in them and the relationships between the things they talk about.

Some of that can be addressed or obtained over the web, but that isn’t the same thing as identifying all the stuff talked about in that material on the WWW.

Now, if Linked Data wanted to claim that the WWW was a global data space for information of interest to a particular group, well, that comes closer to being believable at least.

*****

However silly a single, unifying data model may sound, it is true that making data more accessible, by any means, makes it easier to make sensible use of it.

Despite having a drank the Kool-Aid perspective on linked data, this book is a useful introduction to it as a technology.

Ignore the “…put your hand on the radio and feel the power…” type stuff.

Keep saying to yourself: “it’s just another format, it’s just another format…,” and you will be fine.

The TEDS Framework…

Filed under: Interface Research/Design — Patrick Durusau @ 6:00 am

The TEDS Framework for Assessing Information Systems From a Human Actors’ Perspective: Extending and Repurposing Taylor’s Value-Added Model

Scholl, H. J., Eisenberg, M. B., Dirks, L. and Carlson, T. S. (2011), The TEDS framework for assessing information systems from a human actors’ perspective: Extending and repurposing Taylor’s Value-Added Model. Journal of the American Society for Information Science and Technology, 62: 789–804.

Abstract:

Developed in the early 1980s—well before Internet and web-based technologies had arrived—Taylor’s Value-Added Model introduced what is now better known as the human-actors’ needs perspective on information systems/information technology (IS/IT) artifacts. Taylor distinguished six top-level criteria that mattered most to human actors when using IS/IT artifacts. We develop this approach further and present the TEDS framework as an analytical instrument for actor- and utilization-specific evaluation of IS/IT artifacts as well as a practical tool for moderating and formulating design specifications. We use the empirical case of a comprehensive comparative professional sports team web site evaluation project to illustrate the power and versatility of the extended analytical framework.

Interesting article for a couple of reasons.

First and foremost, to reinforce the notion that interface design is an interactive exercise with users and not a train the user to do it right one.

Second, advancing models for understanding the interaction of users with interfaces is another step towards making good interface design less of a hit and miss type proposition.

NoSQL: Guides, Tutorials, Books, Papers

Filed under: NoSQL — Patrick Durusau @ 5:59 am

NoSQL: Guides, Tutorials, Books, Papers

If you are new to NoSQL or just want to see what beginner material is available (to avoid writing your own), Alex Popescu has a growing collection of materials on NoSQL.

Bookmark it to send to your co-workers and even clients.

WebGL and simulations on GPU

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 5:59 am

WebGL and simulations on GPU

Another interesting visualization resource.

Although I must confess to being a text person, I do appreciate the need for and utility of graphical interfaces for some information sets.

How to apply Naive Bayes Classifiers to document classification problems.

Filed under: Bayesian Models,Classifier — Patrick Durusau @ 5:59 am

How to apply Naive Bayes Classifiers to document classification problems.

Nils Haldenwang does a good job of illustrating the actual application of a naive Bayes classifier to document classification.

A good introduction to an important topic for the construction of topic maps.

Computational Category Theory

Filed under: Category Theory — Patrick Durusau @ 5:58 am

Computational Category Theory (404 as of 21 August 2015)

Published in 1990 by D.E. Rydeheard and R.M. Burstall, Computational Category Theory uses ML to illustrate the relevance of category theory to CS.


Updated link: Computational Category Theory by D.E. Rydeheard and R.M. Burstall.

BTW, ML code is available as well.

Enjoy!

March 22, 2011

…competent villian down the hall?

Filed under: Searching,Topic Maps — Patrick Durusau @ 7:02 pm

Shall I Google it or ask the competent villain down the hall? The moderating role of information need in information source selection.

Lu, L., & Yuan, Y. (2011). Shall I Google it or ask the competent villain down the hall? The moderating role of information need in information source selection. Journal of the American Society for Information Science & Technology, 62(1), 133-145.

Abstract:

Previous studies have found that both (a) the characteristics (e.g., quality and accessibility) (e.g., Fidel & Green, 2004) and (b) the types of sources (e.g., relational and nonrelational sources) (e.g., Zimmer, Henry, & Butler, 2007) influence information source selection. Different from earlier studies that have prioritized one source attribute over the other, this research uses information need as a contingency factor to examine information seekers’ simultaneous consideration of different attributes. An empirical test from 149 employees’ evaluations of eight information sources revealed that (a) low-and high-information-need individuals favored information source quality over accessibility while medium-information-need individuals favored accessibility over quality; and (b) individuals are more likely to choose relational over nonrelational sources as information need increases.

OK, I confess. I started reading this article because of the title. I don’t have a “competent villain down the hall” so I was interested in the experience of others.

That’s my story and I am sticking to it. 😉

Seriously, in addition to being a good study of search behavior, I think the break point of being above medium need maybe useful for further investigation for topic maps.

After all, topic maps do require more effort than simply pounding out an ASCII text, even more than setting yourself up as a semantic authority and ignoring the efforts of others.

So the very real question becomes: When do topic maps pay a return for the effort of their creation?

I would suspect the verdict would be yes for medical research where finding all the references can be critical, but only a maybe if the question concerned the location of a particular brand of canned beans at the local grocery. Substitute your own important/unimportant examples there.

Suggestions on what areas look fruitful for testing need in the perception of users for topic maps?

Disease Named Entity Recognition

Filed under: Entity Extraction,Machine Learning — Patrick Durusau @ 7:02 pm

Disease named entity recognition using semisupervised learning and conditional random fields.

Nichalin, S., Zhu, Z., & Hsinchun, C. (2011). Disease named entity recognition using semisupervised learning and conditional random fields. Journal of the American Society for Information Science & Technology, 62(4), 727-737.

Abstract:

Information extraction is an important text-mining task that aims at extracting prespecified types of information from large text collections and making them available in structured representations such as databases. In the biomedical domain, information extraction can be applied to help biologists make the most use of their digital-literature archives. Currently, there are large amounts of biomedical literature that contain rich information about biomedical substances. Extracting such knowledge requires a good named entity recognition technique. In this article, we combine conditional random fields (CRFs), a state-of-the-art sequence-labeling algorithm, with two semisupervised learning techniques, bootstrapping and feature sampling, to recognize disease names from biomedical literature. Two data-processing strategies for each technique also were analyzed: one sequentially processing unlabeled data partitions and another one processing unlabeled data partitions in a round-robin fashion. The experimental results showed the advantage of semisupervised learning techniques given limited labeled training data. Specifically, CRFs with bootstrapping implemented in sequential fashion outperformed strictly supervised CRFs for disease name recognition.

Not to take anything away from this sort of technique, which would stand in good stead for topic map construction, but I am left feeling like it stops short of the mark.

In other words, say that I am happy with the result of its recognition, how do I share that with someone else, who has another set of identified subjects, perhaps from the same data?

Or for that matter, how do I combine it with data that I myself have extracted from the same data?

Can’t very well ask the software why it “recognized” one name or another can I?

Thinking I would have to add what seemed to me to be useful information to the name, in order to re-use it with other data.

Starting to sound like a topic map isn’t it?

March 21, 2011

Providing Recommendations in Social Networks Using Python: AtePassar Study Case

Filed under: Python,Social Networks — Patrick Durusau @ 8:55 am

Providing Recommendations in Social Networks Using Python: AtePassar Study Case

From post:

Recently I’ve been working on recommendations, specially related to social networks. One of my tasks is to investigate, create and analyze a recommendation engine capable of generating suggestions of friends, study groups, videos and related content to a registered user in a social network.

The social network that I am working on is called AtePassar, a brazilian social network for people who wants apply for positions at brazilian civil (government) services. One of the great features of this social network is because people can share their interests about studies and meet people all around Brazil with same interests or someone that will apply for the same exam as him. Can you imagine the possibilities ?

Applications that assist in the authoring of topic maps (to say nothing of recommending topics from topic maps) are going to make “recommendations.”

Homonymous Authors

Filed under: Homonymous,Indexing — Patrick Durusau @ 8:53 am

A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search.

Onodera, Natsuo, Mariko Iwasawa, Nobuyuki Midorikawa, Fuyuki Yoshikane, Kou Amano, Yutaka Ootani, Tadashi Kodama, Yasuhiko Kiyama, Hiroyuki Tsunoda, and Shizuka Yamazaki. 2011. “A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search.” Journal of the American Society for Information Science & Technology 62, no. 4: 677-690.

Abstact:

This paper proposes a methodology which discriminates the articles by the target authors (‘true’ articles) from those by other homonymous authors (‘false’ articles). Author name searches for 2,595 ‘source’ authors in six subject fields retrieved about 629,000 articles. In order to extract true articles from the large amount of the retrieved articles, including many false ones, two filtering stages were applied. At the first stage any retrieved article was eliminated as false if either its affiliation addresses had little similarity to those of its source article or there was no citation relationship between the journal of the retrieved article and that of its source article. At the second stage, a sample of retrieved articles was subjected to manual judgment, and utilizing the judgment results, discrimination functions based on logistic regression were defined. These discrimination functions demonstrated both the recall ratio and the precision of about 95% and the accuracy (correct answer ratio) of 90-95%. Existence of common coauthor(s), address similarity, title words similarity, and interjournal citation relationships between the retrieved and source articles were found to be the effective discrimination predictors. Whether or not the source author was from a specific country was also one of the important predictors. Furthermore, it was shown that a retrieved article is almost certainly true if it was cited by, or cocited with, its source article. The method proposed in this study would be effective when dealing with a large number of articles whose subject fields and affiliation addresses vary widely.

Interesting study of heuristics that may be of assistance in creating topic maps from academic literature.

I suspect there are other “patterns” as it were in other forms of information that await discovery.

MG4J – Managing Gigabytes for Java

Filed under: Indexing,Search Engines,Searching — Patrick Durusau @ 8:52 am

MG4J – Managing Gigabytes for Java

From the website:

The main points of MG4J are:

  • Powerful indexing. Support for document collections and factories makes it possible to analyse, index and query consistently large document collections, providing easy-to-understand snippets that highlight relevant passages in the retrieved documents.
  • Efficiency. We do not provide meaningless data such as “we index x GiB per second” (with which configuration? which language? which data source?)—we invite you to try it. MG4J can index without effort the TREC GOV2 collection (document factories are provided to this purpose) and scales to hundreds of millions of documents.
  • Multi-index interval semantics. When you submit a query, MG4J returns, for each index, a list of intervals satisfying the query. This provides the base for several high-precision scorers and for very efficient implementation of sophisticated operators. The intervals are built in linear time using new research algorithms.
  • Expressive operators. MG4J goes far beyond the bag-of-words model, providing efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries. Each operator is represented internally by an abstract object, so you can easily plug in your favourite syntax.
  • Virtual fields. MG4J supports virtual fields—fields containing text for a different, virtual document; the typical example is anchor text, which must be attributed to the target document.
  • Flexibility. You can build much smaller indices by dropping term positions, or even term counts. It’s up to you. Several different types of codes can be chosen to balance efficiency and index size. Documents coming from a collection can be renumbered (e.g., to match a static rank or experiment with indexing techniques).
  • Openness. The document collection/factory interfaces provide an easy way to present your own data representation to MG4J, making it a breeze to set up a web-based search engine accessing directly your data. Every element along the path of query resolution (parsers, document-iterator builders, query engines, etc.) can be substituted with your own versions.
  • Distributed processing. Indices can be built for a collection split in several parts, and combined later. Combination of indices allows non-contiguous indices and even the same document can be split across different collections (e.g., when indexing anchor text).
  • Multithreading. Indices can be queried and scored concurrently.
  • Clustering. Indices can be clustered both lexically and documentally (possibly after a partitioning). The clustering system is completely open, and user-defined strategies decide how to combine documents from different sources. This architecture makes it possible, for instance, to load in RAM the part of an index that contains terms appearing more frequently in user queries.

EuroHCIR 2011: The 1st European Workshop on Human-Computer Interaction and Information Retrieval

Filed under: Conferences,Information Retrieval,Interface Research/Design — Patrick Durusau @ 8:49 am

EuroHCIR 2011: The 1st European Workshop on Human-Computer Interaction and Information Retrieval

From the website:

HCIR, or Human-Computer Information Retrieval, was a phrase coined by Gary Marchionini in 2005 and is representative of the growing interest in uniting both those who are interested in how information systems are built (the Information Retrieval community) and those who are interested in how humans search for information (the Human-Computer Interaction and Information Seeking communities). Four increasingly popular workshops and an NSF funded event , have brought focus to this multi-disciplinary issue in the USA , and the aim of EuroHCIR 2011 is to focus the European community in the same way.

Consequently, the EuroHCIR workshop has four main goals:

  • Present and discuss novel HCIR designs, systems, and findings.
  • Identify and unite European researchers and industry professionals working in this area.
  • Facilitate and encourage collaboration and joint academic and industry ventures.
  • Define and coordinate a vision for the community for future EuroHCIR events.

The topics for the workshop look quite interesting:

  • Novel interaction techniques for information retrieval.
  • Modelling and evaluation of interactive information retrieval.
  • Exploratory search and information discovery.
  • Information visualization and visual analytics.
  • Applications of HCI techniques to information retrieval needs in specific domains.
  • Ethnography and user studies relevant to information retrieval and access.
  • Scale and efficiency considerations for interactive information retrieval systems.
  • Relevance feedback and active learning approaches for information retrieval.

Important dates:

Submissions: 1st May 2011

Notifications: 20th May 2011

Camera Ready: 2nd June 2011

Workshop: 4th July 2011

KDD Cup

Filed under: Dataset,Examples,Music Retrieval — Patrick Durusau @ 8:48 am

KDD Cup

From the website:

People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups. It comes as no surprise that human tastes in music are remarkably diverse, as nicely exhibited by the famous quotation: “We don’t like their sound, and guitar music is on the way out” (Decca Recording Co. rejecting the Beatles, 1962).

Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.

Such an exciting analysis introduces new scientific challenges. The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy.

Important dates:

March 15, 2011 Competition begins

June 30, 2011 Competition ends

July 3, 2011 Winners notified

August 21, 2011 Workshop

An interesting data set that focuses on machine learning and prediction.

Equally interesting would be merging this data set with other music data sets.

March 20, 2011

a practical guide to noSQL

Filed under: Marketing,NoSQL — Patrick Durusau @ 1:26 pm

a practical guide to noSQL by Denise Mura strikes me as deeply problematic.

First, realize that Denise is describing the requirements that a MarkLogic server is said to meet.

That may or may not be the same as your requirements.

The starting point for evaluating any software, MarkLogic (which I happen to like) or not, must be with your requirements.

I mention this in part because I can think of several organizations and more than one government agency that has bought software that met a vendors requirements, but not their own.

The result was a sale for the vendor but a large software dog that everyone kept tripping over but pride and unwillingness to admit error kept it around for a very long time.

Take for example her claim that MarkLogic deliver[s] real-time updates, search, and retrieval results…. Well, ok, but if I run weekly reports on data that is uploaded on a daily basis, then real-time updates, search, and retrieval results may not be one of my requirements.

You need to start with your requirements (you do have written requirements, yes?) and not those of a vendor or what “everyone else” requires.

The same lesson holds true for construction of a topic map. It is your world view that it needs to reflect.

Second, it can also be used as a lesson in reading closely.

For example, of Lucene, Solr, and Sphinx, Denise says:

Search engines lie to you all the time in ways that are not always obvious because they need to take shortcuts to make performance targets. In other words, they don’t provide for a way to guarantee accuracy.

It isn’t clear from the context what lies Denise thinks we are being told. Or what it would mean to …guarantee accuracy?

I can’t think of any obvious ways that a search engine has ever lied to me, much less any non-obvious ones. (That may be because they are non-obvious.)

There are situations where noSQL, SQL, MarkLogic and topic maps solutions are entirely appropriate. But as a consumer you will need cut through promotional rhetoric to make the choice that is right for you.

99 Problems, But The Search Ain’t One

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 1:25 pm

99 Problems, But The Search Ain’t One

Slides and video from UK PHP presentation by Andrei Zmievski on ElasticSearch.

From the webpage:

ElasticSearch is the new kid on the search block. Built on top of Lucene and adhering to the best concepts of so-called NoSQL movement, ElasticSearch is a distributed, highly available, fast RESTful search engine, ready to be plugged into Web applications. Come to this session and learn how to set up, index, search, and tune ElasticSearch in less time than it takes to order a latte (disclaimer: at sufficiently busy central Starbucks locations. Side effects may include euphoria, stuff getting done, and extra time to spend with girlfriend).

While I appreciate an optimistic (enthusiastic?) presentation and I like ElasticSearch, predictions of the end of searching problems is a bit premature. 😉

I commend the article to you but would note that the search problems addressed by topic maps, such as:

  1. Different identifications of the same subject
  2. Re-use of the same identifiers for different subjects
  3. Inability to reliably merge indexes from more than one source

All remain with ElasticSearch.

The Ideal Large Scale Learning Class

Filed under: Machine Learning — Patrick Durusau @ 1:24 pm

The Ideal Large Scale Learning Class

Interesting collection of topics with pointers to resources on different types of scaling.

Overview of Text Extraction Algorithms

Filed under: Data Mining,Text Extraction — Patrick Durusau @ 1:24 pm

Overview of Text Extraction Algorithms

Short review and pointers to posts by computer science student Tomaž Kova?i?e listing resources for text extraction.

If you are building topic maps based on text extraction from web pages in particular, well worth the time to take a look.

Next Generation of Apache Hadoop MapReduce – The Scheduler

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:23 pm

Next Generation of Apache Hadoop MapReduce – The Scheduler

From the post:

The previous post in this series covered the next generation of Apache Hadoop MapReduce in a broad sense, particularly its motivation, high-level architecture, goals, requirements, and aspects of its implementation.

In the second post in a series unpacking details of the implementation, we’d like to present the protocol for resource allocation and scheduling that drives application execution on a Next Generation Apache Hadoop MapReduce cluster.

See also: The Next Generation of Apache Hadoop MapReduce

The Next Generation of Apache Hadoop MapReduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:22 pm

The Next Generation of Apache Hadoop MapReduce

From the post:

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Start of an important series of posts on the next generation of Apache Hadoop MapReduce.

« Newer PostsOlder Posts »

Powered by WordPress