Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 19, 2011

Bio4j

Filed under: Bioinformatics,Biomedical,Graphs,Neo4j — Patrick Durusau @ 6:10 pm

Bio4j

From the website:

Bio4j is a bioinformatics graph based DB including most data available in UniProt (SwissProt + Trembl), Gene Ontology (GO) and UniRef (50,90,100).

Bio4j provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own structure. On the contrary, traditional relational databases must flatten the data they represent into tables, creating “artificial” ids in order to connect the different tuples; which can in some cases eventually lead to domain models that have almost nothing to do with the actual structure of data.

I am particularly interested in incorporate you own data feature:

New data sources and features will be added from time to time and what it’s more important, the Java API allows you to easily incorporate your own data to Bio4j so you can make the best out of it.

Domain-Specific Languages:
An Annotated Bibliography

Filed under: Domain-Specific Languages,Java — Patrick Durusau @ 6:04 pm

Domain-Specific Languages: An Annotated Bibliography

Interesting but a decade old.

Anyone have a suggestion for a more recent bibliography of DSLs?

For Java I have seen: DSLs in Java from Pure Danger Tech Alex Miller’s technical blog.

DSLs are important for two reasons:

  1. Their use in creating topic map authoring languages for particular domains.
  2. The use of topic maps to map between DSLs, topic map kind and otherwise.

From the topic map side of the house, any suggestion for subjects that would merit a DSL for authoring topic maps?

PS: To what extent would you include defaulting subjects for later insertion in the construction of a DSL for topic maps?

Such as entry of say baseball players in a baseball DSL defaults a player association with their last known team unless otherwise specified?

Think Stats: Probability and Statistics for Programmers

Filed under: Python,Statistics — Patrick Durusau @ 5:57 pm

Think Stats: Probability and Statistics for Programmers by Allen B. Downey

From the website:

Think Stats is an introduction to Probability and Statistics for Python programmers.

If you have basic skills in Python, you can use them to learn concepts in probability and statistics. This new book emphasizes simple techniques you can use to explore real data sets and answer interesting statistical questions.

Important no only for data set exploration, preparatory to building topic maps but also for evaluating the statistics offered by others.

Recalling that figures don’t lie but all liars can figure.

HeyStaks launches: Social and Collaborative Web Search App

Filed under: Collaboration,Interface Research/Design,Searching — Patrick Durusau @ 5:54 pm

HeyStaks launches: Social and Collaborative Web Search App

Jeff Dalton’s preliminary notes on a new collaborative web application.

Implementing Replica Set Priorities

Filed under: MongoDB,NoSQL — Patrick Durusau @ 5:54 pm

Implementing Replica Set Priorities, Kristina Chodorow explains replica sets:

Replica set priorities will, very shortly, be allowed to vary between 0.0 and 100.0. The member with the highest priority that can reach a majority of the set will be elected master. (The change is done and works, but is being held up by 1.8.0… look for it after that release.) Implementing priorities was kind of an interesting problem, so I thought people might be interested in how it works. Following in the grand distributed system lit tradition I’m using the island nation of Replicos to demonstrate.

Should be of interest for anyone planning distributed topic map stores/distributions.

Busses Come In Threes, Why Do Proofs Come In Two’s? – Post

Filed under: Dataset,Mathematics Indexing — Patrick Durusau @ 5:53 pm

Busses Come In Threes, Why Do Proofs Come In Two’s?

Dick Lipton, at Gödel’s Lost Letter explores:

Why do theorems get proved independently at the same time

Jacques Hadamard and Charles-Jean de laVallée Poussin, Neil Immerman and Robert Szelepcsenyi, Steve Cook and Leonid Levin, Georgy Egorychev and Dmitry Falikman, Sanjeev Arora and Joseph Mitchell, are pairs of great researchers. Each pair proved some wonderful theorem, yet they did this in each case independently and at almost the same time.

Interesting in its own right but I mention here to raise the issue of the use to topic maps to bridge the use of different nomenclatures.

Would that increase the incidence of discovery of independent proofs of theorems?

Even harder to answer: Would bridging different nomenclatures increase the incidence of independent proofs of theorems?

Thinking that all such proofs need not be of famous theorems.

Could have independent proofs of lesser theorems as well.

The 2010 Mathematics Subject Classification is no doubt very useful but too crude to assist in the discovery of duplicate proofs (or beyond general areas to look) proofs altogether.

March 18, 2011

Complex Indexing?

Filed under: Indexing,Subject Identity,Topic Maps — Patrick Durusau @ 6:52 pm

The post The Joy of Indexing made me think about the original use case for topic maps, the merging of indexes prepared by different authors.

Indexing that relies either on a token in the text (simple indexing) or even a contextual clue, the compound indexing mentioned in the Joy of Indexing post, but fall short in terms of enabling the merging of indexes.

Why?

In my comments on the Joy of Indexing I mentioned that what we need is a subject indexing engine.

That is an engine that indexes the subjects that are appear in a text and not merely the manner of their appearance.

(Jack Park, topic map advocate and my friend would say I am hand waving at this point so perhaps an example will help.)

Say that I have a text where I use the words George Washington.

That could be a reference to the first president of the United States or it could be a reference to George Washington rabbit (my wife is a children’s librarian).

A simple indexing engine could not distinguish one from the other.

A compound indexing engine might list one under Presidents and the other under Characters but without more in the example we don’t know for sure.

A complex indexing engine, that is one that took into account more than simply the token in the text, say that it created its entry from that token plus other attributes of the subject it represents, would not mistake a president for a rabbit or vice versa.

Take Lucene for example. For any word in a text, it records

The position increment, start, and end offsets and payload are the only additional metadata associated with the token that is recorded in the index.

That pretty much isolates the problem is a nutshell. If that is all the metadata we get, which isn’t much, the likelihood we are going to do any reliable subject matching is pretty low.

Not to single Lucene out, I think all the search engines operate pretty much the same way.

To return to our example, what if while indexing, when we encounter George Washington, instead of the bare token we record, respectively:

George Washington – Class = Mammalia

George Washington – Class = Mammalia

Hmmm, that didn’t help much did it?

How about:

George Washington – Class = Mammalia Order = Primate

George Washington – Class = Mammalia Order = Lagomorpha

So that I can distinguish these two cases but can also ask for all instances of class = Mammalia.

Of course the trick is that no automated system is likely to make that sort of judgement reliably, at least left to its own devices.

But it doesn’t have to does it?

Imagine that I am interested in U.S. history and want to prepare an index of the Continental Congress proceedings. I could simply create an index by tokens but that will encounter all the problems we know that comes from merging indexes. Or searching across tokens as seen by such indexes. See Google for example.

But, what if I indexed the Continental Congress proceedings using more complex tokens? Ones that had multiple properties that could be indexed for one subject and that could exist in relationship to other subjects?

That is for some body of material, I declared the subjects that would be identified and what would be known about them post-identification?

A declarative model of subject identity. (There are other, equally legitimate, models of identity, that I will be covering separately.)

More on the declarative model anon.

Topic Maps with Django – Post

Filed under: Django,TMAPI — Patrick Durusau @ 6:51 pm

Topic Maps with Django

From the post:

As part of a rewrite of the Entity Authority Tool Set, I have written an implementation of the Topic Maps API in Django, cunningly titled TMAPI in Django.

As it stands, it passes its 288 unit tests, but has no UI, since I only need it for the internal use of EATS. It would of course be useful to have both a visualisation and an editing interface added to it, but I won’t be doing it any time soon without some inducement.

Anyone willing to assist?

Learning Data Science Skills

Filed under: Data Mining — Patrick Durusau @ 6:51 pm

Learning Data Science Skills

Christopher Bare has a useful collection of links to resources for wannabe data scientists.

Interested to know what tools, tutorials, etc. that you have found to be the most helpful.

MADLib

Filed under: Analytics,Machine Learning — Patrick Durusau @ 6:50 pm

MADLib

From the website:

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.

The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development.

Targeted at PostgreSQL and Greenplum.

A Markov Chain in TiKz – Post

Filed under: Graphs — Patrick Durusau @ 6:50 pm

A Markov Chain in TiKz

From the post:

When you want to draw a Markov Chain (or any other graph-theoretical structure) you can toil around with various drawing programs, or you can simply turn to PGF/TiKz, the LaTeX drawing environment. Since true scientists publish their work in LaTeX, this immediately eases the work required for your next paper.

(You should also consider Graphviz.)

March 17, 2011

The Joy of Indexing

Filed under: Indexing,MongoDB,NoSQL,Subject Identity — Patrick Durusau @ 6:52 pm

The Joy of Indexing

Interesting blog post on indexing by Kyle Banker of MongoDB.

Recommended in part to understanding the limits of traditional indexing.

Ask yourself, what is the index in Kyle’s examples indexing?

Kyle says the example are indexing recipes but is that really true?

Or is it the case that the index is indexing the occurrence of a string at a location in the text?

Not exactly the same thing.

That is to say there is a difference between a token that appears in a text and a subject we think about when we see that token.

It is what enables us to say that two or more words that are spelled differently are synonyms.

Something other that the two words as strings is what we are relying on to make the claim they are synonyms.

A traditional indexing engine, of the sort described here, can only index the strings it encounters in the text.

What would be more useful would be an indexing engine that indexed the subjects in a text.

I think we would call such a subject-indexing engine a topic map engine. Yes?

Questions:

  1. Do you agree/disagree that a word indexing engine is not a subject indexing engine? (3-5 pages, no citations)
  2. What would you change about a word indexing engine (if anything) to make it a subject indexing engine? (3-5 pages, no citations)
  3. What texts/subjects would you use as test cases for your engine? (3-5 pages, citations of the test documents)

Graph-based Clustering for Computational Linguistics: A Survey

Filed under: Computational Linguistics,Graphs — Patrick Durusau @ 6:51 pm

Graph-based Clustering for Computational Linguistics: A Survey

Slides by Zheng Chen and Heng Ji, City University of New York, July 2010.

A very concise summary of graph methods with citations to the literature.

You won’t be able to run off and become a hairy-chested graph warrior with these slides but you will have a better idea of why graphs are important.

MySQL 5.5 Released

Filed under: MySQL,SQL — Patrick Durusau @ 6:49 pm

MySQL 5.5 Released

Performance gains for MySQL 5.5, from the release:

In recent benchmarks, the MySQL 5.5 release candidate delivered significant performance improvements compared to MySQL 5.1. Results included:

  • On Windows: Up to 1,500 percent performance gains for Read/Write operations and up to 500 percent gain for Read Only.(1)
  • On Linux: Up to 360 percent performance gain in Read/Write operations and up to 200 percent improvement in Read Only.(2)

If you are using MySQL as a backend for your topic map application, these and other improvements will be welcome news.

MongoDB 1.8 Released

Filed under: MongoDB,NoSQL — Patrick Durusau @ 6:49 pm

MongoDB 1.8 Released

Highlights from the announcement:

Cassandra – London Podcasts

Filed under: Cassandra,NoSQL — Patrick Durusau @ 6:48 pm

Cassandra – London Podcasts

Podcasts from the London Cassandra User Group.

Cassandra – Thrift Application Jools Enticknap: 21 February 2011

Cassandra in TWEETMEME Nick Telford: 21 February 2011

Cassandra Meetup January 17th Jan 2011

Cassandra London Meetup Jake Luciani : 8th Dec 2010

March 16, 2011

Topic Map Server Language?

Filed under: Erlang,Topic Map Software — Patrick Durusau @ 3:21 pm

I ran across a comparison of the Apache web server and Yaws the other day.

Apache vs. Yaws

I haven’t given web servers much thought and had someone asked (apologies to friends at MS) I would have said Apache, just by default.

It is what I remember from web work when I was concerned about that sort of thing.

Anyway, I am looking at this comparison and Apache falls over on its side at about 8,000 concurrent sessions and Yaws is humming along at 80,000.

That’s quite a difference.

Enough of a difference that Erlang, the language in which Yaws was written, should be seriously considered for topic map engines.

Legendary Plots

Filed under: Marketing,Topic Maps — Patrick Durusau @ 3:19 pm

Legendary Plots

Mostly focused on R but I have included it here because of the discussion of the legend.

Particularly improving the focus of the information presented.

I rarely want all the information about a subject.

I just want the information that is helpful in my particular context.

A topic map may hold far more information that it ever displays to me.

And only display some small part of that information.

Otherwise it is like drinking from a news sewer (insert your favorite example) during a disaster.

Lots of information, little of it makes any sense.

Hank: A Fast, Open-Source, Batch-Updatable, Distributed Key-Value Store

Filed under: Hank,Key-Value Stores,NoSQL — Patrick Durusau @ 3:17 pm

Hank: A Fast, Open-Source, Batch-Updatable, Distributed Key-Value Store

From the RapLeaf blog:

We’re really excited to announce the open-source debut of a cool piece of Rapleaf’s internal infrastructure, a distributed database project we call Hank.

Our use case is very particular: we have tons of data that needs to get processed, producing a lot of data points for individual people, which then need to be made randomly accessible so they can be served through our API. You can think of it as the “process and publish” pattern.

For the processing component, Hadoop and Cascading were an obvious choice. However, making our results randomly accessible for the API was more challenging. We couldn’t find an existing solution that was fast, scalable, and perhaps most importantly, wouldn’t degrade performance during updates. Our API needs to have lightning-fast responses so that our customers can use it in realtime to personalize their users’ experiences, and it’s just not acceptable for us to have periods where reads contend with writes while we’re updating.

Requirements:

  1. Random reads need to be fast – reliably on the order of a few milliseconds.
  2. Datastores need to scale to terabytes, with keys and values on the order of kilobytes.
  3. We need to be able to push out hundreds of millions of updates a day, but they don’t have to happen in realtime. Most will come from our Hadoop cluster.
  4. Read performance should not suffer while updates are in progress.

Non-requirements

  1. During the update process, it doesn’t matter if there is more than one version of our datastores available. Our application is tolerant of this inconsistency.
  2. We have no need for random writes.

If you read updates as merges then the relevance of this posting to topic maps becomes a bit clearer. 😉

Not all topic map systems will have the same requirements and non-requirements.

(This resource pointed out to me by Jack Park.)

Data Integration: Moving Beyond ETL

Filed under: Data Governance,Data Integration,Marketing — Patrick Durusau @ 3:16 pm

Data Integration: Moving Beyond ETL

A sponsored white-paper by DataFlux, www.dataflux.com.

Where ETL = Extract Transform Load

Many of the arguments made in this paper fit quite easily with topic map solutions.

DataFlux appears to be selling data governance based solutions, although it appears to take an evolutionary approach to implementing such solutions.

It occurs to me that topic maps could be one stage in the documentation and evolution of data governance solutions.

High marks for a white paper that doesn’t claim IT salvation from a particular approach.

KNIME – 4th Annual User Group Meeting

Filed under: Data Analysis,Heterogeneous Data,Mapping,Subject Identity — Patrick Durusau @ 3:14 pm

KNIME – 4th Annual User Group Meeting

From the website:

The 4th KNIME Workshop and Users Meeting at Technopark in Zurich, Switzerland took place between February 28th and March 4th, 2011 and was a huge success.

The meeting was very well attended by more than 130 participants. The presentations ranged from customer intelligence and applications of KNIME in soil and fuel research through to high performance data analytics and KNIME applications in the Life Science industry. The second meeting of the special interest group attracted more than 50 attendees and was filled with talks about how KNIME can be put to use in this fast growing research area.

Presentations are available.

A new version of KNIME is available for download with the features listed in ChangeLog 2.3.3.

Focused on data analytics and work flow, another software package that could benefit from an interchangeable subject-oriented approach.

Latent Dirichlet Allocation in C

Filed under: Latent Dirichlet Allocation (LDA) — Patrick Durusau @ 3:13 pm

Latent Dirichlet Allocation in C

From the website:

This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .

This code contains:

  • an implementation of variational inference for the per-document topic proportions and per-word topic assignments
  • a variational EM procedure for estimating the topics and exchangeable Dirichlet hyperparameter

Do be aware that the use of topic in this technique and papers discussing it is not the same thing as topic as defined by ISO 13250-2.

It comes closer to the notion of subject as defined in ISO 13250-2.

Update:

I was sent a pointer to David M. Blei’s
http://www.cs.princeton.edu/~blei/topicmodeling.html, which has more code and other goodies.

March 15, 2011

WeatherSpark

Filed under: Dataset,Mashups — Patrick Durusau @ 5:33 am

WeatherSpark

Courtesy of Flowingdata.com, the WeatherSpark site is a graphic and historical representation of weather conditions.

From the site:

WeatherSpark is a new type of weather website, with interactive weather graphs that allow you to pan and zoom through the entire history of any weather station on earth.

Get multiple forecasts for the current location, overlaid on records and averages to put it all in context.

Unlike some mashups it is fairly apparent what is being used as a binding point. Which would make re-use of this data easier.

For example, if I were looking for weak points in a transportation system, I would take the traffic accident/delay records and then map them against the weather records from this site.

Thereby enabling predictions of when and where disruptive activity would have the greatest multiplier effect from natural weather conditions, time of day, etc.

2010 Data Miner Survey

Filed under: Data Mining,News — Patrick Durusau @ 5:29 am

2010 Data Miner Survey

R comes out as the #1 tool which is how I heard about the survey.

I have requested a copy, mostly so I can see what other tools are reported as used by data miners.

Topic maps start with the discovery of data that becomes part of or subject to a topic map and end with the delivery of data to a user.

They are data, end to end.

Couchbase Techzone

Filed under: Couchbase,CouchDB,Membase — Patrick Durusau @ 5:16 am

Couchbase Techzone

Along with the launch of Couchbase, the technical zone was also unveiled.

It has all the usual things one expects, albeit with a cleaner design than I am accustomed to seeing for such projects.

This is going to sound silly but I read a lot of documentation and my favorite part of the documentation pages was:

Any questions or issues with the documentation should be directed to the Techzone Editor.

Where Techzone editor was a mailto: link.

Not that I had a problem but if I did, I would not have to hunt for 20 minutes for a buried link or form for submission of a comment.

I haven’t started playing with the software but that sort of consideration for users/developers is likely to take Couchbase a long way.

Now, I need to find an issue to see if they answer email sent to that address. 😉 (Just teasing.)

Couchbase

Filed under: CouchDB,Membase,NoSQL — Patrick Durusau @ 5:15 am

Couchbase

From the website:

Couchbase Server is powered by Apache CouchDB, the industry’s most advanced and widely deployed document database technology. It boasts many advanced NoSQL capabilities, such as the ability to execute complex queries, to maintain indices and to store data with ACID transaction semantics. Plus it incorporates geospatial indexing so developers can easily create location-aware applications. Couchbase Server provides an exceptionally flexible data management platform, offering the rich data management operations that developers expect from their database.

Couchbase Server is simple.

  • Flexible Views and Querying. Built-in javascript-based map/reduce indexing engine is a powerful way to analyze and query your data.
  • Schemaless Data Repository. Couchbase document model is a perfect fit for web applications, providing significant data flexibility.
  • Geo-spatial Indexing. Built-in GeoCouch lets developers easily create location-aware apps.

Couchbase Server is fast.

  • Durable Speed Without Compromising Safety. You get safety and speed with our architecture, no compromises.
  • Indexing. Rapidly retrieve data in any format you demand, across clusters.

Couchbase Server is elastic.

  • Peer-to-Peer Replication. Unmatched peer-based replication capabilities, each replica allowing full queries, updates and additions..
  • Mobile Synchronization. Couchbase is ported to popular mobile devices and because it doesn’t depend on a constant Internet connection, users can access their data anytime, anywhere.

Hammurabi

Filed under: Domain-Specific Languages,Scala — Patrick Durusau @ 5:14 am

Hammurabi

From the website:

Hammurabi is a rule engine written in Scala that tries to leverage the features of this language making it particularly suitable to implement extremely readable internal Domain Specific Languages. Indeed, what actually makes Hammurabi different from all other rule engines is that it is possible to write and compile its rules directly in the host language. Anyway the Hammurabi’s rules also have the important property of being readable even by non technical person. As usual a practical example worth more than a thousand words.

I have to admit that my heart leaped at seeing a name from Ancient Near Eastern studies!

Then to discover it was for a rule engine written in Scala.

Well, still looks quite interesting, even if not ready for prime time project.

Not for any time soon, but it would be interesting to write a set of rules in Akkadian for use in constructing a topic map of Akkadian grammar.

That would be way cool.

And a nice way to brush up on my Akkadian.

Which I must admit has gotten rusty as I have worked on technical standards far a field from ancient language studies.

Era of the Interest Graph

Filed under: Graphs,Time,Topic Maps,Versioning — Patrick Durusau @ 5:11 am

Era of the Interest Graph

From the blog:

Social media is maturing as are the people embracing its most engaging tools and networks. Perhaps most notably, is the maturation of relationships and how we are expanding our horizons when it comes to connecting to one another. What started as the social graph, the network of people we knew and connected to in social networks, is now spawning new branches that resemble how we interact in real life.

This is the era of the interest graph – the expansion and contraction of social networks around common interests and events. Interest graphs represent a potential goldmine for brands seeking insight and inspiration to design more meaningful products and services as well as new marketing campaigns that better target potential stakeholders.

While many companies are learning to listen to the conversations related to their brands and competitors, many are simply documenting activity and mentions as a reporting function and in some cases, as part of conversational workflow. However, there’s more to Twitter intelligence than tracking conversations.

We’re now looking beyond the social graph as we move into focused networks that share more than just a relationship.

What struck me about this post was the sense that the graph was a non-stable construct.

Whereas most of the topic maps I have seen are not only stable, but their subjects are as well.

Which is fine for some areas of information, but not all.

A dynamic topic map seems to have different requirements than one that is a fixed editorial product, or at least it seems so to me.

Rather than versioning, for example, a dynamic topic map should have a tracking mechanism to show what information was available at any point in time.

So that say a physician relying upon a dynamic topic map for drug warning information can establish that a warning was or was not available at the time he prescribed a medication.

Oh, that’s not commonly possible even with static topic maps is it?

Hmmm, will have to give some thought to that issue.

It may just be the maps I have looked at but there is a timeless nature to them.

Much like governments, whatever is the case has always been the case. And if you remember differently, well, you are just wrong. If not subversive.

RDF – Gravity

Filed under: Graphs,RDF — Patrick Durusau @ 5:10 am

RDF – Gravity

From the website:

RDF Gravity is a tool for visualising RDF/OWL Graphs/ ontologies.

Its main features are:

  • Graph Visualization
  • Global and Local Filters (enabling specific views on a graph)
  • Full text Search
  • Generating views from RDQL Queries
  • Visualising multiple RDF files

RDF Gravity is implemented by using the JUNG Graph API and Jena semantic web toolkit.

Truly stunning work.

Too bad that RDF will never progress beyond simple indexing to complex and interchangeable indexing.

I say that. So long as Tim Berners-Lee clings to the notion of new names (URLs) as overcoming the problems with old names (anything else), RDF is unlikely to improve.

If an RDF could identify a subject using multiple properties and then inform others of that complex identification, then at least there would be an opportunity to either agree or disagree with an identification.

As it is now, who knows what anyone is identifying with a URL?

Your guess is as good as mine.

So if I were to say that “http://semweb.salzburgresearch.at/apps/rdf-gravity/index.html” is truly stunning work, do I mean the software? The website? Something I saw at the website?

If that sounds trivial, imagine the same situation and the URL is a pointer to a procedure for the coolant system on a nuclear reactor. Not quite so trivial is it?

Best to know what we are talking about in most situations.

« Newer PostsOlder Posts »

Powered by WordPress