Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 12, 2011

Pattern recognition and machine learning

Filed under: Machine Learning,Pattern Recognition — Patrick Durusau @ 5:26 pm

Pattern recognition and machine learning by Christoper M. Bishop was mentioned in Which Automatic Differentiation Tool for C/C++?, a post by Bob Carpenter.

I ran across another reference to it today that took me to a page with exercise solutions, corrections and other materials that will be of interest if you are using the book for a class or self-study.

See: PRML: Pattern Recognition and Machine Learning

I was impressed enough by the materials to go ahead and order a copy of it.

It is fairly long and I have to start up a blog on ODF (Open Document Format), so don’t expect a detailed summary any time soon.

BigCouch

Filed under: BigCouch,Clustering (servers),CouchDB,NoSQL — Patrick Durusau @ 5:25 pm

BigCouch 0.3 release.

From the website:

BigCouch is our open-source flavor of CouchDB with built-in clustering capability.

The main difference between BigCouch and standalone Couch is the inclusion of an OTP application that ‘clusters’ CouchDB across multiple servers.

For now, BigCouch is a stand-alone fork of CouchDB. In the future, we believe (and hope!) that many of the upgrades we’ve made will be incorporated back into CouchDB proper.

Many worthwhile topic map applications can be written without clustering, but “clustering” is one of those buzz words to include your response to an RFP, grant proposal, etc.

Good to have some background on what clustering means/requires in general and beating on several of the clustering solutions will develop that background.

Not to mention that you will know when it makes sense to actually implement clustering.

Inconsistency?

Filed under: Conferences,Heterogeneous Data,Semantic Diversity,Semantics — Patrick Durusau @ 5:23 pm

Managing and Reasoning in the Presence of Inconsistency

The International Journal of Semantic Computing describes this Call for Papers as follows:

Inconsistency is ubiquitous in the real world, in human behaviors, and in the computing systems we build. Inconsistency manifests itself in a plethora of phenomena at different level in the depth of knowledge, ranging from data, information, knowledge, meta-knowledge, to expertise. Data inconsistency arises when patterns in data do not conform to an established range, distribution or interpretation. The exponentially growing volumes of data stemming from almost all types of data being created in digital form, a proliferation of sensors and sensor networks, and other sources such as social networks, complex computer simulations, space explorations, and high-resolution imagery and video, have made data inconsistency an inevitability. Information inconsistency occurs when meanings of the same data values become conflicting or when the same attribute for an entity has different data values. Knowledge inconsistency happens when propositions of either declarative or procedural beliefs, in either explicit or tacit form, yield antagonistic outcomes for the same circumstance. Inconsistency can also emerge from meta-knowledge or from expertise. How to manage and reason in the presence of inconsistency in computing systems is a very important issue in semantic computing, social computing, and other data-rich or knowledge-rich computing paradigms. It requires that we understand the causes and circumstances of inconsistency, establish proper metrics for inconsistency, adopt formalisms to represent inconsistency, develop ways to recognize and analyze different types of inconsistency, and devise mechanisms and methodologies to manage and handle inconsistency.

Refreshing in that inconsistency is recognized as an omnipresent and everlasting fact of our environments. Including computing environments.

The phrase, “…establish proper metrics for inconsistency,…” betrays a world view that we can stand outside of our inconsistencies and those of others.

For all the useful work that will appear in this volume (and others like it), there is no place to stand outside of our environments and their inconsistencies.

Important Dates
Submission deadline: May 20, 2011
Review result notification: July 20, 2011
Revision due: August 20, 2011
Final version due: August 31, 2011
Tentative date of publication: September, 2011 (Vol.5, No.3)

February 11, 2011

MILK: Machine Learning in Python

Filed under: Natural Language Processing,Software — Patrick Durusau @ 1:12 pm

MILK: Machine Learning in Python

From the website:

Milk is a machine learning toolkit in Python.

Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems.

For unsupervised learning, milk supports k-means clustering and affinity propagation.

Milk is flexible about its inputs. It optimised for numpy arrays, but can often handle anything (for example, for SVMs, you can use any dataype and any kernel and it does the right thing).

There is a strong emphasis on speed and low memory usage. Therefore, most of the performance sensitive code is in C++. This is behind Python-based interfaces for convenience.

Another NLP tool for your topic map construction toolkit.

I need to work on creating a listing for such tools by features and capacity, to make it easier to find the tool necessary for some particular project.

Million Song Dataset

Filed under: Dataset — Patrick Durusau @ 1:01 pm

Million Song Dataset.

Yes, one million song dataset.

A 280 GB dataset. Site suggests you ask someone you know if they already have a copy. Not your average music download.

Amendment: There is no music included in this download. My reference to music download was sarcasm.

From the website:

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:

  • To encourage research on algorithms that scale to commercial sizes
  • To provide a reference dataset for evaluating research
  • As a shortcut alternative to creating a large dataset with The Echo Nest’s API
  • To help new researchers get started in the MIR field

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.

The Million Song Dataset is a collaborative project between The Echo Nest and LabROSA. It is supported in part by the NSF.

Two things to notice:

  1. Not a small data set (remember the post about dealing with data?)
  2. National Science Foundation funding on #1.

Note the combination: big data + funding. Nuff said?

Dealing with Data

Filed under: Data Analysis,Data Mining,Marketing — Patrick Durusau @ 12:45 pm

Dealing with Data

From the website:

In the 11 February 2011 issue, Science joins with colleagues from Science Signaling, Science Translational Medicine, and Science Careers to provide a broad look at the issues surrounding the increasingly huge influx of research data. This collection of articles highlights both the challenges posed by the data deluge and the opportunities that can be realized if we can better organize and access the data.

Science is making access to this entire collection FREE (simple registration is required for non-subscribers).

The growing concern over the influx of data represents a golden marketing opportunity for topic maps!

First, the predictions about increasing amounts of data are coming true.

That means impressive numbers to cite and even more impressive predictions about the future.

Second, the coming data deluge represents a range of commercial opportunities.

Opportunities for reuse, comparison, and mining such data abound. And, only increase as more data comes online.

Are you going to be the Facebook of some data area?

Third, and the reason unique to topic maps:

The format that contains data is recognized as composed of subjects.

Subjects that can be identified, placed in associations, have properties added to them,

That one insight is critical to re-use, combination and comparison of data in the data deluge.

If you identify the subjects that compose those structures, as well as the subject thought to be recognized by those data structures, you can then create maps between diverse data sets.

It is the identification of subjects that enables the creation and interchange of maps of where to swim in this vast sea of data.

*****
PS: I am going to take a slow walk through these articles and will be posting about opportunities that I see for topic maps. Your comments/feedback welcome!

Bayesian identity resolution – Post

Filed under: Bayesian Models,Duplicates — Patrick Durusau @ 9:05 am

Bayesian identity resolution

Lars Marius Garshol walks through finding duplicate records in data records.

As Lars notes, there are commercial products for this same task but I think this is a useful exercise.

Isn’t that hard to imagine the creation of test data sets with a variety of conditions to underscore lessons about detecting duplicate records.

I suspect such training data may already be available.

Will have to see what I can find and post about it.

*****
PS: Lars is primary editor of the TMDM, working on TMCL and several other parts of the topic maps standard.

Sowa on Watson

Filed under: Cyc,Ontology,Semantic Web,Subject Identifiers,Subject Identity — Patrick Durusau @ 6:43 am

John Sowa’s posting on Watson merits reproduction in its entirety (lite editing to make it format for easy reading):

Peter,

Thanks for the reminder:

Dave Ferrucci gave a talk on UIMA (the Unstructured Information Management Architecture) back in May-2006, entitled: “Putting the Semantics in the Semantic Web: An overview of UIMA and its role in Accelerating the Semantic Revolution”

I recommend that readers compare Ferrucci’s talk about UIMA in 2006 with his talk about the Watson system and Jeopardy in 2011. In less than 5 years, they built Watson on the UIMA foundation, which contained a reasonable amount of NLP tools, a modest ontology, and some useful tools for knowledge acquisition. During that time, they added quite a bit of machine learning, reasoning, statistics, and heuristics. But most of all, they added terabytes of documents.

For the record, following are Ferrucci’s slides from 2006:

http://ontolog.cim3.net/file/resource/presentation/DavidFerrucci_20060511/UIMA-SemanticWeb–DavidFerrucci_20060511.pdf

Following is the talk that explains the slides:

http://ontolog.cim3.net/file/resource/presentation/DavidFerrucci_20060511/UIMA-SemanticWeb–DavidFerrucci_20060511_Recording-2914992-460237.mp3

And following is his recent talk about the DeepQA project for building and extending that foundation for Jeopardy:

http://www-943.ibm.com/innovation/us/watson/watson-for-a-smarter-planet/building-a-jeopardy-champion/how-watson-works.html

Compared to Ferrucci’s talks, the PBS Nova program was a disappointment. It didn’t get into any technical detail, but it did have a few cameo appearances from AI researchers. Terry Winograd and Pat Winston, for example, said that the problem of language understanding is hard.

But I thought that Marvin Minsky and Doug Lenat said more with their tone of voice than with their words. My interpretation (which could, of course, be wrong) is that both of them were seething with jealousy that IBM built a system that was competing with Jeopardy champions on national TV — and without their help.

In any case, the Watson project shows that terabytes of documents are far more important for commonsense reasoning than the millions of formal axioms in Cyc. That does not mean that the Cyc ontology is useless, but it undermines the original assumptions for the Cyc project: commonsense reasoning requires a huge knowledge base of hand-coded axioms together with a powerful inference engine.

An important observation by Ferrucci: The URIs of the Semantic Web are *not* useful for processing natural languages — not for ordinary documents, not for scientific documents, and especially not for Jeopardy questions:

1. For scientific documents, words like ‘H2O’ are excellent URIs. Adding an http address in front of them is pointless.

2. A word like ‘water’, which is sometimes a synonym for ‘H2O’, has an open-ended number of senses and microsenses.

3. Even if every microsense could be precisely defined and cataloged on the WWW, that wouldn’t help determine which one is appropriate for any particular context.

4. Any attempt to force human being(s) to specify or select a precise sense cannot succeed unless *every* human understands and consistently selects the correct sense at *every* possible occasion.

5. Given that point #4 is impossible to enforce and dangerous to assume, any software that uses URIs will have to verify that the selected sense is appropriate to the context.

6. Therefore, URIs found “in the wild” on the WWW can never be assumed to be correct unless they have been guaranteed to be correct by a trusted source.

These points taken together imply that annotations on documents can’t be trusted unless (a) they have been generated by your own system or (b) they were generated by a system which is at least as trustworthy as your own and which has been verified to be 100% compatible with yours.

In summary, the underlying assumptions for both Cyc and the Semantic Web need to be reconsidered.

You can see the post at: http://ontolog.cim3.net/forum/ontolog-forum/2011-02/msg00114.html

I don’t always agree with Sowa but he has written extensively on conceptual graphs, knowledge representation and ontological matters. See http://www.jfsowa.com/

I missed the local showing but found the video at: Smartest Machine on Earth.

You will find a link to an interview with Minsky at that same location.

I don’t know that I would describe Minsky as “…seething with jealousy….”

While I enjoy Jeopardy and it is certainly more cerebral than say American Idol, I think Minsky is right in seeing the Watson effort as something other than artificial intelligence.

Q: In 2011, who was the only non-sentient contestant on the TV show Jeopardy?

A: What is IBM’s Watson?

Data talks and keynotes from O’Reilly Strata conference

Filed under: Conferences,Data Mining — Patrick Durusau @ 6:30 am

Data talks and keynotes from O’Reilly Strata conference high lighted by FlowingData.com.

Embedded at FlowingData:

  • Hilary Mason, “What Data Tells Us”
  • Mark Madsen, “The Mythology of Big Data”
  • Werner Vogels, “Data Without Limits”

Other presentations and interviews are on YouTube.

February 10, 2011

Topic Maps, Google and the Billion Fact Parade

Filed under: Freebase,Subject Identity,TMRM,Topic Maps — Patrick Durusau @ 2:54 pm

Andrew Hogue (Google) actually titled his presentation on Google’s plan for Freebase: The Structured Search Engine.

Several minutes into the presentation Hogue points out that to answer the question, “when was Martin Luther King, Jr. born?” that date of birth, date born, appeared, dob were all considered synonyms that expect the date type.

Hmmm, he must mean keys that represent the same subject and so subject to merging and possibly, depending on their role in a subject representative, further merging of those subject representatives. Can you say Steve Newcomb and the TMRM?

Yes, attribute names represent subjects just like collections of attributes are thought to represent subjects. And benefit from rules specifying subject identity, other properties and merging rules. (Some of those rules can be derived from mechanical analysis, others probably not.)

Second, Hogue points out that Freebase had 13 million entities when purchased by Google. He speculates on taking that to 1 billion entities.

Let’s cut to the chase, I will see Hogue’s 1 billion entities and raise him 9 billion entities for a total pot of 10 billion entities.

Now what?

Let’s take a simple question that Hogue’s 10 billion entity Google/Freebase cannot usefully answer.

What is democracy?

Seems simple enough. (viewers at home can try this with their favorite search engine.)

1) United States State Department: Democracy means a state that support Israel, keeps the Suez canal open and opposes people we don’t like in the U.S. Oh, and that protects the rights and social status of the wealthy, almost forgot that one. Sorry.

2) Protesters in Egypt (my view): Democracy probably does not include some or all of the points I mention for #1.

3) Turn of the century U.S.: Effectively only the white male population participates.

4) Early U.S. history: Land ownership is a requirement.

I am sure examples can be supplied from other “democracies” and their histories around the world.

This is a very important term and it differing use by different people in different contexts, is going to make discussion and negotiations more difficult.

There are lots of terms where no single “entity” or “fact” that is going to work for everyone.

Subject identity is a tough question and the identification of a subject changes over time, social context, etc. Not to mention that the subjects identified by particular identifications change as well.

Consider that at one time cab was not used to refer to a method of transportation but to a brothel. You may object that was “slang” usage but if I am searching an index of police reports for that time period for raids on brothel’s, your objection isn’t helpful. Doesn’t matter if the usage is “slang” or not, I need to obtain accurate results.

User expectations and needs cannot (or at least should not in my opinion) be adapted to the limitations of a particular approach or technology.

Particularly when we already know of strategies that can help with, not solve, the issues surrounding subject identity.

The first step that Hogue and Google have taken, recognizing that attribute names can have synonyms, is a good start. In topic map terms, recognizing that information structures are composed of subjects as well. So that we can map between information structures, rather than replacing one with another. (Or having religious discussions about which one is better, etc.)

Hogue and Google are already on the way to treating some subjects as worthy of more effort than others, but for those that merit the attention, solving the issue of to reliable, repeatable subject identification, is non-trivial.

Topic maps can make a number of suggestions that can help with that task.

Building Interfaces for Data Engines – Post

Building Interfaces for Data Engines is a summary by Matthew Hurst of six data engines that provide access to data released by others.

If you are a data user, definitely worth a visit to learn about current data engines.

If you are a data developer, definitely worth a visit to glean where we might be going next.

If it is any consolation, the art of book design, that is the layout of text and images on a page remains more art than science.

Research on that topic, layout of print and images, has been underway for approximately 2,000 years, with no signs of slacking off now.

User interfaces face a similar path in my estimation.

Hadoop and MapReduce: Big Data Analytics (Gartner Report)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:58 pm

Hadoop and MapReduce: Big Data Analytics (Gartner Report)

I sacrificed my email address to view a copy of this “free” report from Gartner. Sponsored by Cloudera.

Care to guess what the second bulleted take away said?

Enterprises should consider adopting a packaged Hadoop distribution (e.g., Cloudera’s Distribution for Hadoop) to reduce the technical risk and increase speed of implementation of the Hadoop initiative.

The rest of it was don’t use a hair dryer while sitting in a bathtub full of water sort of advice.

Except tailored to Hadoop and MapReduce.

Save your email address for another day.

Spend your time at Cloudera, where you will find useful information about Hadoop and MapReduce.

The unreasonable effectiveness of simplicity

Filed under: Authoring Topic Maps,Crowd Sourcing,Data Analysis,Subject Identity — Patrick Durusau @ 1:50 pm

The unreasonable effectiveness of simplicity from Panos Ipeirotis suggests that simplicity should be considered in the construction of information resources.

The simplest aggregation technique: Use the majority vote as the correct answer.

I am mindful of the discussion several years ago about visual topic maps. Which was a proposal to use images as identifiers. Certainly doable now but the simplicity angle suggests an interesting possibility.

Would not work for highly abstract subjects, but what if users were presented with images when called upon to make identification choices for a topic map?

For example, marking entities in a newspaper account, the user is presented with images near each marked entity and chooses yes/no.

Or in legal discovery or research, a similar mechanism, along with the ability to annotate any string with an image/marker and that image/marker appears with that string in the rest of the corpus.

Unknown to the user is further information about the subject they have identified that forms the basis for merging identifications, linking into associations, etc.

A must read!

February 9, 2011

Erjang – A JVM-based Erlang VM

Filed under: Erjang,Erlang — Patrick Durusau @ 5:48 pm

Erjang-A-JVM-based-Erlang-VM by Kresten Krab Thorup

Presentation, slides and mp3 on Erjang. Kresten blogs about this project at: http://www.javalimit.com/.

Erlang and Erjang are not the basis for universal answers to all topic map needs.

However, robust message passing systems, such as Erlang and Erjang support, can figure into some topic map based solutions.

Just as there is no one method of subject identity that fits all purposes, there is no one topic map solution that fits all needs.

Discussions of solutions should always start with exploration and documentation of your requirements, not the capabilities of particular software, whether commercial, open source or home grown.

There will be time for you to learn about various software packages and why their capabilities matter more than your requirements later in the process.

Oyster: A Configurable ER Engine

Filed under: Entity Resolution,Record Linkage,Semantic Web,Subject Identity — Patrick Durusau @ 4:55 pm

Oyster: A Configurable ER Engine

John Talburt writes a very enticing overview of an entity resolution engine he calls Oyster.

From the post:

OYSTER will be unique among freely available systems in that it supports identity management and identity capture. This allows the user to configure OYSTER to not only run as a typical merge-purge/record linking system, but also as an identity capture and identity resolution system. (Emphasis added)

Yes, record linking we have had since the late 1950’s in a variety of guises and over twenty (20) different names that I know of.

Adding identity management and identity capture (FYI, SW uses universal identifier assignment) will be something truly different.

As in topic map different.

Will be keeping a close watch on this project and suggest that you do the same.

The Silent “a” In Mashup

Filed under: Associations,Mashups,Topic Maps — Patrick Durusau @ 4:16 pm

The “a” in mashup is silent because mashups are missing information that is represented in a topic map by associations.

That isn’t necessarily a criticism of mashups. How much or how little information you represent in any data set or application is up to you.

It is helpful to have a framework for understanding what information you have included or excluded by explicit choice. Why you made those choices or on what basis is entirely up to you.

As of 08-02-2010, there are fifteen definitions of mashup in English reported by define:Mashup in Google.

Most of the definitions of mashup do not exclude (necessarily) what is defined as an association in a topic map, but the general theme is one of juxtaposition of data from different resources.

That juxtaposition leaves the following subjects undefined (at least explicitly):

  1. role players in an association (play #2)
  2. roles in an association
  3. type of an association

Not to mention any TMCL (Topic Maps Constraint Language) constraints on those associations. (Something we will cover on another day.)

You can choose to leave subjects undefined, which is easier than defining them (witness the popularity of mashups), but there is a cost to leaving them undefined.

Defining or leaving subjects undefined is a decision that need to take into account factors such as ease of authoring versus the very real cost of leaving subjects undefined, as well as other factors. Such as your particular project’s requirements for maintenance, semantic integration and interchange.

For example, if the role players (#1 above) are left undefined in a mashup, what are the consequences?

From a topic map perspective, that means the role player subjects are not represented by topics, which means you cannot:

  1. attach other information about those subjects, such as variant names
  2. judge whether those are the same subjects as found in other associations
  3. find all the associations where those subjects are role players (since they are not explicitly identified)
  4. …among other things.

As I said, you can make that choice but while easier, that is less work, you also get less return from your mashup.

Another choice in a mashup, assuming that you identified the role players as topics, would be to simply not identify the roles they play in the mashup (read association).

If you don’t identify the roles as subjects (represented by topics), you can’t:

  1. compare those roles to roles in other mashups
  2. compare the roles being played by role players to roles they play in other associations
  3. discover associations with the same role players playing the same roles, but identified differently
  4. …among other things.

Assuming you have defined role players, the roles they play, there remains the type of the association (read mashup), which could help you locate other associations (mashups) that would be of interest to you.

Even if you defined a type for a mashup, I am not real sure where you would put it. That’s not an issue with a topic map association. It has an explicit type.

Mashups are easier to author than associations because they carry less information.

Which is a legitimate choice on your part.

What if after creating mashups we decide that it would be nice to add some more information?

Can topic maps help with that task?

We will take up the answer to that question tomorrow.

Another Word For It – #1,000

Filed under: Marketing,Topic Maps — Patrick Durusau @ 2:17 pm

This makes my 1,000th post to Another Word For It.

I wanted to take a moment to think about where I would like to go next with the next 1,000 posts.

First, I want to become more systematic with the academic literature that is relevant to topic maps. As you have seen, it is spread from bioinformatics and computer science to library science and semiotics.

One of the things that makes articles/books/presentations slow to add is that I read/view all of them before actually posting them to the blog.

I suppose I could go just on titles/abstracts but then you would have to duplicate my wading through stuff that never makes it onto the blog. That doesn’t seem like a value-add.

Second, along with that, I want to provide more in the way of assistance in that jump between, “ok, topic maps sound great,” and having an operational topic map that provides a meaningful result.

Being more systematic about the literature isn’t going to be easy and providing generalized assistance for topic map authoring is going to be even harder.

My proposal, subject to your comments and suggestions, is to create what I am calling starter maps that have a lot of the basic infrastructure topics, types, etc., plus topics, associations, etc., for a particular domain.

For example, I might want to offer a starter map for say NASCAR racing, that has all the usual structural topics but also all the racetracks, races and competitors for the last decade. Plus relevant associations. Not everything someone would want but enough that getting visible results would not be all that hard.

A boost over the topic map fence as it were.

At least initially, those are mostly going to be in topic map format type data resources. Doesn’t really scale for semantically diverse resources but everyone has to start using topic maps somewhere.

Third, in addition to bare bones starter maps, I would like to create outlines of data sources that look particularly interesting.

Not nearly as easy to use as the starter maps but easier for me to author.

The sort of thing that points out subjects and relationships with subjects in other data sets that may not be readily apparent.

Fourth, I want to continue to discover interesting approaches and resources to bring to your attention.

Those will range from new technologies, such as NoSQL and graph databases, to new algorithms for data processing, to new ways to think about subjects and their identifications, etc. Some of them will prove to be very useful in connection with topic maps and others will prove to be less so, if useful at all.

The key criteria for that last item being that it is interesting material. It is impossible (IMHO) to say what information will or will not spark the next great idea about topic maps in my readers.

Finally, in order to devote the cycles necessary to make all of the foregoing happen, I need donations/sponsorships for these activities.

If you like what you see here on a daily basis and this sounds like a good plan, please use the donate button.





Sponsors welcome as well, please inquire. patrick@durusau.net

PS: Topic map consulting/teaching/training also available.

First, you need to Get the Data – Post

Filed under: Authoring Topic Maps,Data Source — Patrick Durusau @ 5:01 am

First, you need to Get the Data is a post by Mathew Hurst about a site for asking questions about data sets (and getting answers).

A couple of the questions just to give you an idea about the site:

  • How can I compile a log of Wikipedia articles by date of creation?
  • Are there any indexes of available data sets?

There are useful answers to both of those questions.

Before starting off to build a data set, this is one site to check first.

A listing of sites to check for existing data sets would make an useful chapter in a book on topic maps.

February 8, 2011

Stochastic Modelling for Systems Biology

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 11:07 am

Stochastic Modelling for Systems Biology

I stumbled across this while running down material on Monte Carlo models.

From the website:

Although stochastic kinetic models are increasingly accepted as the best way to represent and simulate genetic and biochemical networks, most researchers in the field have limited knowledge of stochastic process theory. Stochastic Modeling for Systems Biology provides an accessible introduction to this theory using examples that are familiar to systems biology researchers. Focusing on computer simulation, the author examines the use of stochastic processes for modeling biological systems. Along with the latest simulation techniques and research material, such as parameter inference, the text includes many examples and figures as well as software code in R for various applications.

The art of constructing or at least reading models is an important skill for topic map authors.

Systems biology has been, is and will continue to be a hot property.

Bringing the advantages of topic maps to models in systems biology would be a win-win situation.

Which Automatic Differentiation Tool for C/C++?

Which Automatic Differentiation Tool for C/C++?

OK, not immediately obvious why this is relevant to topic maps.

Nor is Bob Carpenter’s references:

I’ve been playing with all sorts of fun new toys at the new job at Columbia and learning lots of new algorithms. In particular, I’m coming to grips with Hamiltonian (or hybrid) Monte Carlo, which isn’t as complicated as the physics-based motivations may suggest (see the discussion in David MacKay’s book and then move to the more detailed explanation in Christopher Bishop’s book).

particularly useful.

I suspect the two book references are:

but I haven’t asked. In part to illustrate the problem of resolving any entity reference. Both authors have authored other books touching on the same subjects so my guesses may or may not be correct.

Oh, relevance to topic maps. The technique automatic differentiation is used in Hamiltonian Monte Carlo methods for the generation of gradients. Still not helpful? Isn’t to me either.

Ah, what about Bayesian models in IR? That made the light go on!

I will be discussing ways to show more immediate relevance to topic maps, at least for some posts, in post #1000.

It isn’t as far away as you might think.

The Matrix Cookbook – Post

Filed under: Matrix — Patrick Durusau @ 9:50 am

The Matrix Cookbook by Bob Carpenter (LingPipe Blog) points to a “matrix cheatsheet” by K. B. Petersen and M. S. Pedersen.

Good for those of you starting to use matrix methods for IR and hence building topic maps.

Topic Mapping BoingBoing Data?

Filed under: Dataset,Examples,Marketing — Patrick Durusau @ 6:15 am

A recent entry on Simon Willison’s blog, How we made an API for BoingBoing in an evening caught my eye.

It was based on the release of eleven years worth of post from BoingBoing, which you can download at: Eleven years’ worth of Boing Boing posts in one file!

Curious what subjects you would choose first for creating a topic map of this data set?

And having chosen them, how would you manage their identity to make it easier on yourself to incorporate other blog content?

I am mindful of Robert Barta’s approach of no data before its time for incorporation into a topic map.

Would that make a difference in your design of the topic map or only in the infrastructure that supports it?

Redis Tutorial – April 2010

Filed under: NoSQL,Redis — Patrick Durusau @ 5:52 am

Redis Tutorial – April 2010

If you are just getting started with Redis or simply want to explore it a bit, Simon Willison’s tutorial is a good place to start.

Try Redis – Try Topic Maps?

Filed under: Examples,Marketing,NoSQL,Redis — Patrick Durusau @ 5:36 am

Try Redis is a clever introduction to Redis.

I recommend it to you as an introduction to Redis and NoSQL in general.

It also makes me wonder if it would be possible to create a similar resource for topic maps?

Granting that it would have to make prior choices about subject matter, data sets, etc. but still, it could be an effective marketing tool for topic maps.

I suspect so even if the range of choices to be made to effect merging were limited.

If I were a left-wing one political blogger in the US I would create a topic map that includes donations to Republican PACs and white collar crime convictions by family members.

Or for the right-wing, a mapping between the provisions of ObamaCare and various specific groups and agencies.

Such that users could choose additional information and it shows up in some visually pleasing way to make the case that the user already thinks is true.

Will have to give this some thought in terms of a framework with a smallish number of framework topics and the ability to quickly add in additional topics for a particular demonstration.

Such that it would be possible to quickly create a topic map demo for some breaking news story.

Could provide useful additional content but the main purpose being a demonstration of the technology.

Useful content is fairly rare so no need to tax a topic map demo with always providing useful content. Sometimes, content is simply content. 😉

Visuals for Topic Maps

Filed under: Graphics,Visualization — Patrick Durusau @ 5:19 am

20 Fresh JavaScript Data Visualization Libraries

Jacob Grune, jacob@sixrevisions.com of SixRevisions has assembled this collection of 20 JavaScript libraries for data visualization.

Be sure to see the comments and further references.

Visualization of your topic map data depends upon the nature of your topic map and its users.

Consider the traditional text display as a starting place for further development.

Digital Diplomatics 2011 – Conference

Filed under: Conferences,Examples,Marketing,Topic Maps — Patrick Durusau @ 4:41 am

Digital Diplomatics 2011: Tools for the Digital Diplomatist

From the website:

Scholars of diplomatics never had a fundamental opposition on using modern technology to support their research. Nevertheless no technology since the introduction of photography had such an impact on questions and methods of diplomatics as the computer had: Digital imaging gives us cheap reproductions at high quality, so nowadays large copora of documents are to be found online. Digital imaging allows manipulations to make apparently invisible traces visible. Modern information technology gives us access to huge text corpora in which single words and phrases can be found thus helping to indicate relationships, to retrieve parallel texts for comparison or plot geographical and temporal distributions.

The conference aims at presenting projects which working to enlarge the digitised charter corpus on the one hand and on the other hand will put a particular focus on research applying information technology on medieval and early modern charters aiming at pure diplomatic questions as well as historic or philologic research.

An excellent opportunity for topic maps to illustrate how all the fruits of modern and ancient commentary can be brought to bear, using a text (or at least the idea of a text) as the focal or binding point for information.

Biblical scholarship, for example, becomes less sweat of the brow in terms of travel/access and more a question of seeking answers to interesting questions.

Proposals due: May 15, 2011

Conference: Naples, 29th September – 1st October 2011

February 7, 2011

Client-side Metaservices?

Filed under: Marketing,Metaservices — Patrick Durusau @ 8:14 am

File Under: Metaservices, Rise of?

John Battelle writes:

Let me step back and describe the problem. In short, heavy users of the web depend on scores – sometimes hundreds – of services, all of which work wonderfully for their particular purpose (eBay for auctions, Google for search, OpenTable for restaurant reservations, etc). But these services simply don’t communicate with each other, nor collaborate in a fashion that creates a robust or evolving ecosystem.

The rise of the app economy exacerbates the problem – most apps live in their own closed world, sharing data sparingly, if at all. And while many have suggested that Facebook’s open social graph can help untangle the problem, in fact it only makes it worse, as Fred put it in a recent post (which sparked this Thinking Out Loud session for me):

The people I want to follow on Etsy are not the same people I want to follow on Twitter. The people I want to follow on Svpply are not my Facebook friends. I don’t want to sharemy Foursquare checkins with everyone on Twitter and Facebook.

It is a very interesting take but I disagree with the implication that metaservices need to be server side.

With a client side metaservice, say one based upon a topic map, I could share (or not) information as I choose.

Granted that puts more of a burden for maintenance of privacy on the user, but any who trusts others to manage privacy for them, is already living in fish bowl, they just don’t know it.

I think breaching silos with metaservices on the client-side is an excellent opportunity for demonstrating the information management capabilities of topic maps.

Not to mention being an opportunity for commercialization of a client-side metaservice, which should include mapping for the various online silos and their changing arrangements on a subscription basis.

KEA: keyphrase extraction algorithm

Filed under: Entity Extraction,Natural Language Processing — Patrick Durusau @ 7:59 am

KEA: keyphrase extraction algorithm

From the website:

Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing. For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags or content labels) to organize and provide a thematic access to their data.

KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.

Given the indexing roots of topic maps, this software is definitely a contender for use in topic map construction.

Weka – Data Mining

Filed under: Mahout,Natural Language Processing — Patrick Durusau @ 7:10 am

Weka

From the website:

Weka 3: Data Mining Software in Java

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

I would say it is under active development/use since the mailing list archives have an average of about 315 posts per month.

Yes, approximately 315 post per month.

Another tool for your topic map toolbox!

GATE: General Architecture for Text Engineering

Filed under: Entity Extraction,Natural Language Processing — Patrick Durusau @ 7:06 am

GATE: General Architecture for Text Engineering

From the website:

GATE is…

  • open source software capable of solving almost any text processing problem
  • a mature and extensive community of developers, users, educators, students and scientists
  • a defined and repeatable process for creating robust and maintainable text processing workflows
  • in active use for all sorts of language processing tasks and applications, including: voice of the customer; cancer research; drug research; decision support; recruitment; web mining; information extraction; semantic annotation
  • the result of a €multi-million R&D programme running since 1995, funded by commercial users, the EC, BBSRC, EPSRC, AHRC, JISC, etc.
  • used by corporations, SMEs, research labs and Universities worldwide
  • the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining
  • a world-class team of language processing developers

If you need to solve a problem with text analysis or human language processing you’re in the right place.

I suppose there is something to be said for an abundance of confidence. 😉

Seriously, this is a very complex and impressive effort.

I will be covering specific tools and aspects of this effort as they relate to topic maps.

« Newer PostsOlder Posts »

Powered by WordPress