Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 25, 2011

Gene Ontology

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 3:35 pm

Gene Ontology

From the website:

The Gene Ontology project is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. The project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data from GO Consortium members, as well as tools to access and process this data.

I was encouraged by the following in the description of the project:

GO is not a way to unify biological databases (i.e. GO is not a ‘federated solution’). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following:

Knowledge changes and updates lag behind.

Individual curators evaluate data differently. While we can agree to use the word ‘kinase’, we must also agree to support this by stating how and why we use ‘kinase’, and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related. GO does not attempt to describe every aspect of biology; its scope is limited to the domains described above. (emphasis added)

It is refreshing to see a project that acknowledges that sharing vocabulary is a major and worthwhile step.

One that falls short of universal unification.

Go Annotation graph ….

Filed under: Bioinformatics,Biomedical,Graphs — Patrick Durusau @ 3:34 pm

GO Annotation graph visualizations with Bio4j Go Tools + Gephi Toolkit + SiGMa project

Interactive graph visualization for protein GO annotations. (GO = Gene Ontology)

From the post:

Bio4j Go Tools includes now a new feature providing you with an interactive graph visualization for protein GO annotations.
The url of the app is still the same old one.

On the server side, we’re using Gephi Toolkit for applying layout algorithms while the corresponding Gexf file is generated with the class GephiExporter from BioinfoUtil project. The service is included in the project Bio4jTestServer, specifically the servlet GetGoAnnotationGexfServlet.

Regarding to the client side, we’re using the open-source project SiGMa for graph-visualization.

Interesting for the visualization aspects as well as the subject matter.

April 24, 2011

The Review of Metaphysics

Filed under: Bibliography — Patrick Durusau @ 5:37 pm

The Review of Metaphysics is a journal I first encountered as an undergraduate.

The CURRENT PERIODICAL ARTICLES section offers abstracts of current articles from a large number of philosophical journals. Far more than I could acquire or review personally.

This is where I discovered the Music, Essential Metaphor and, And Private Language paper.

If you are interested in the theory side of knowledge/topic maps and/or something harder than representing spreadsheets as topic maps, this is a good source of starting points.

It’s All Semantic With the New Text-Processing API

Filed under: Natural Language Processing,Text Analytics — Patrick Durusau @ 5:34 pm

It’s All Semantic With the New Text-Processing API

From the post at ProgrammableWeb:

Now I don’t have a master’s degree in Natural language processing, and you just might need one to get your hands dirty with this API. I see the text-processing.com API as offering a mid-level utility for incorporation in a web app. You might take text samples from your source, feed them through the Text-Processing API and analyze those results a bit further before presenting anything to your user.

This offering appears to be the result of a one man effort. Jacob Perkins designed his API as RESTful with JSON responses. It’s free and open for the meantime, but it sounds like Perkins may polish the service a bit and start charging for access. There could be a real market here since only a handful of the 58 semantic APIs in our directory offer results at the technical level.

Interesting and not all that surprising.

Could be a good way to see if you are interested in going further with natural language processing.

Hadoop2010: Hadoop and Pig at Twitter

Filed under: Hadoop,Pig — Patrick Durusau @ 5:33 pm

Hadoop2010: Hadoop and Pig at Twitter video of presentation by Kevin Weil.

From the description:

Apache Pig is a high-level framework built on top of Hadoop that offers a powerful yet vastly simplified way to analyze data in Hadoop. It allows businesses to leverage the power of Hadoop in a simple language readily learnable by anyone that understands SQL. In this presentation, Twitter’s Kevin Weil introduces Pig and shows how it has been used at Twitter to solve numerous analytics challenges that had become intractable with a former MySQL-based architecture.

I started to simply do the listing for the Hadoop Summit 2010 but the longer I watched Kevin’s presentation, the more I thought it needed to be singled out.

If you don’t already know Pig, you will be motivated to learn Pig after this presentation.

Hadoop Summit 2010

Filed under: Conferences,Hadoop — Patrick Durusau @ 5:33 pm

Hadoop Summit 2010

Slides and some videos from the Hadoop Summit 2010 meeting.

Pig and Hive at Yahoo!

Filed under: Hive,Pig — Patrick Durusau @ 5:32 pm

Pig and Hive at Yahoo!

Observations on how and why to use Pig and/or Hive.

Write Good Papers

Filed under: Authoring Topic Maps,Marketing — Patrick Durusau @ 5:31 pm

Write Good Papers by Daniel Lemire merits bookmarking and frequent review.

Authoring, whether of a blog post, a formal paper, program documentation, or a topic map, is authoring.

Review of these rules will improve the result of any authoring task.

April 23, 2011

Music, Essential Metaphor, And Private Language

Filed under: Metaphors,Topic Maps — Patrick Durusau @ 8:22 pm

Music, Essential Metaphor, And Private Language by Nick Zangwill, American Philosophical Quarterly, Volume 48, Number 1, January 2011.

Abstract:

Music is elusive. describing it is problematic. In particular its aesthetic properties cannot be captured in literal description. Beyond very simple terms, they cannot be literally described. In this sense, the aesthetic description of music is essentially nonliteral. An adequate aesthetic description of music must have resort to metaphor or other nonliteral devices. I maintain that this is because of the nature of the aesthetic properties being described. I defend this view against an apparently simple objection put by Malcolm Budd. dealing with this objection will take us into some surprising terrain. We are led to consider issues concerning privacy and the language for describing sensations. In the light of these considerations, I develop the essentially nonliteralist thesis and explore some of its consequences. (emphasis in original)

Zangwill’s article is a good reminder that there are very large areas of human experience that are not amenable to the “Just the facts, Ma’am” type approach. Music being one. Would you believe medicine is another? Zangwill says:

It might seem strange to hold that there is part of reality that cannot be literally described. Is that not an obscure and mystical view? If aesthetic properties are there in the world, surely we should be able to describe them in literal terms, at least in principle. But the idea of a literally indescribable reality is not unfamiliar. If we want to describe tastes, smells, and inner sensations, we will, beyond very simple descriptions, be forced to describe them nonliterally. Indeed, part of the training of doctors is to elicit and interpret metaphorical descriptions of pain, with a view to diagnosis. Nonliteral description is inescapable and irreplaceable in such cases. The same is true in the description of music.

I would add physical sensations, relationships with others, our experiencing of events, etc.

The interesting bits of our lives aren’t describable other than by metaphor.

If you think about it, literal descriptions offer an impoverished view on the world.

One that excludes what is unique to us, our metaphors.


PS: You may also enjoy other papers at Nick Zangwill’s homepage, particularly the one on Negative Properties.

NoSQL, NewSQL and Beyond:…

Filed under: Marketing,NoSQL — Patrick Durusau @ 8:21 pm

NoSQL, NewSQL and Beyond: The answer to SPRAINed relational databases

From the post:

The 451 Group’s new long format report on emerging database alternatives, NoSQL, NewSQL and Beyond, is now available.

The report examines the changing database landscape, investigating how the failure of existing suppliers to meet the performance, scalability and flexibility needs of large-scale data processing has led to the development and adoption of alternative data management technologies.

There is one point that I think presents an opportunity for topic maps:

Polyglot persistence, and the associated trend toward polyglot programming, is driving developers toward making use of multiple database products depending on which might be suitable for a particular task.

I don’t know if the report covers the reasons for polyglot persistence as I don’t have access to the “full” version of the report. Maybe someone who does can say if the report covers why the polyglot nature of IT resources is immune to attempts at its reduction.

Structure and Interpretation of Computer Programs

Filed under: Lisp — Patrick Durusau @ 8:21 pm

Structure and Interpretation of Computer Programs

From the website:

Structure and Interpretation of Computer Programs has been MIT’s introductory pre-professional computer science subject since 1981. It emphasizes the role of computer languages as vehicles for expressing knowledge and it presents basic principles of abstraction and modularity, together with essential techniques for designing and implementing computer languages. This course has had a worldwide impact on computer science curricula over the past two decades. The accompanying textbook by Hal Abelson, Gerald Jay Sussman, and Julie Sussman is available for purchase from the MIT Press, which also provides a freely available on-line version of the complete textbook.

These twenty video lectures by Hal Abelson and Gerald Jay Sussman are a complete presentation of the course, given in July 1986 for Hewlett-Packard employees, and professionally produced by Hewlett-Packard Television. The videos have been used extensively in corporate training at Hewlett-Packard and other companies, as well as at several universities and in MIT short courses for industry.

An introduction to computer programming, Lisp and the art of teaching the same.

Learn You Some Erlang For Great Good!

Filed under: Erlang — Patrick Durusau @ 8:21 pm

Learn You Some Erlang For Great Good!

A free online resource for learning Erlang.

April 22, 2011

Square Pegs and Round Holes in the NOSQL World

Filed under: Graphs,Key-Value Stores,Marketing,Neo4j,NoSQL — Patrick Durusau @ 1:06 pm

Square Pegs and Round Holes in the NOSQL World

Jim Webber reviews why graph databases (such as Neo4J) are better for storing graphs than Key-Value, Document or relational datastores.

He concludes:

In these kind of situations, choosing a non-graph store for storing graphs is a gamble. You may find that you’ve designed your graph topology far too early in the system lifecycle and lose the ability to evolve the structure and perform business intelligence on your data. That’s why Neo4j is cool – it keeps graph and application concerns separate, and allows you to defer data modelling decisions to more responsible points throughout the lifetime of your application.

You know, we could say the same thing about topic maps, that you don’t have to commit to all modeling decisions up front.

Something to think about.

Introduction to Algorithms and Computational Complexity

Filed under: Algorithms — Patrick Durusau @ 1:06 pm

Three lectures on algorithms and computational complexity from Yuri Gurevich, courtesy of Channel 9.

Yuri works at Microsoft Research and is the inventor of abstract state machines.

Introduction to Algorithms and Computational Complexity, 1 of n

Introduction to Algorithms and Computational Complexity, 2 of n

Introduction to Algorithms and Computational Complexity, 3 of 3

5 Graph Databases to Consider

Filed under: AllegroGraph,FlockDB,GraphDB,InfiniteGraph,Neo4j — Patrick Durusau @ 1:05 pm

5 Graph Databases to Consider

General overview of Neo4J, FlockDB, AllegroGraph, GraphDB, InfiniteGraph.

Intuition = …because I said so!

Filed under: Data Analysis,Machine Learning — Patrick Durusau @ 1:05 pm

Intuition & Data-Driven Machine Learning

From the post:

Clever algorithms and pages of mathematical formulas filled with probability and optimization theory are usually the associations that get invoked when you ask someone to describe the fields of AI and Machine Learning. Granted, there is definitely an abundance of both, but this mental picture also tends to obscure some of the more interesting and recent developments in these fields: data driven learning, and the fact that you are often better off developing simple intuitive insights instead of complicated domain models which are meant to represent every attribute of the problem.

I wonder about the closing observation:

you are often better off developing simple intuitive insights instead of complicated domain models which are meant to represent every attribute of the problem.

Does that apply to identifications of subjects as well?

May we not be better off to capture the conclusion of an analyst that “X” is a fact, from some large body of data, rather finding a clever way in the data to map their conclusion to that of other analyst’s?

Both said “X,” what more do we need? True enough we need to identify “X” in some way but that is simpler than trying to justify the conclusion in data.

I suppose I am arguing there should be room in subject identification for human intuition, that is, “…because I said so!” 😉

Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics

Filed under: Data Analysis,Hadoop,Indexing,MapReduce — Patrick Durusau @ 1:04 pm

Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics by Jimmy Lin, Dmitriy Ryaboy, and Kevin Weil.

Abstract:

MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one ineffcient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.

I always hope when I see first-class citizen(s) in CS papers that it is going to be talking about data structures and/or metadata (hopefully both).

Alas, I was disappointed once again but the paper is an interesting one and will repay close study.

Oh, the reason I mention treating data structures and metadata as first class citizens is then I can avoid the my way, your way or the highway sort of choices when it comes to metadata and formats.

Granted some formats maybe easier to use on some contexts, such as HDF5 (for space data), FITS (astronomical images), XML (for data and documents) or COBOL (for financial transactions), but if I can see formats as first class citizens, then I can map between them.

Not in a conversion sense, I can see them as though they are the same format as I prefer. Extract data from them, write data to them, etc.

Geo Analytics Tutorial – Where 2.0 2011

Filed under: Geo Analytics,Hadoop,Mechanical Turk,Pig — Patrick Durusau @ 1:04 pm

Geo Analytics Tutorial – Where 2.0 2011

Very cool set of slides on geo analytics from Pete Skomoroch.

Includes use of Hadoop, Pig, Mechanical Turk.

April 21, 2011

IC Bias: If it’s measurable, it’s meaningful

Filed under: Data Models,Intelligence,Marketing — Patrick Durusau @ 12:37 pm

Dean Conway writes in Data Science in the U.S. Intelligence Community [1] about modeling assumptions:

For example, it is common for an intelligence analyst to measure the relationship between two data sets as they pertain to some ongoing global event. Consider, therefore, in the recent case of the democratic revolution in Egypt that an analyst had been asked to determine the relationship between the volume of Twitter traffic related to the protests and the size of the crowds in Tahrir Square. Assuming the analyst had the data hacking skills to acquire the Twitter data, and some measure of crowd density in the square, the next step would be to decide how to model the relationship statistically.

One approach would be to use a simple linear regression to estimate how Tweets affect the number of protests, but would this be reasonable? Linear regression assumes an independent distribution of observations, which is violated by the nature of mining Twitter. Also, these events happen in both time (over the course of several hours) and space (the square), meaning there would be considerable time- and spatial-dependent bias in the sample. Understanding how modeling assumptions impact the interpretations of analytical results is critical to data science, and this is particularly true in the IC.

His central point that: Understanding how modeling assumptions impact the interpretations of analytical results is critical to data science, and this is particularly true in the IC. cannot be over emphasized.

The example of Twitter traffic reveals a deeper bias in the intelligence community, if it’s measurable, it’s meaningful.

No doubt Twitter facilitated communication within communities that already existed but that does not make it an enabling technology.

The revolution was made possible by community organizers working over decades (http://english.aljazeera.net/news/middleeast/2011/02/2011212152337359115.html) and trade unions (http://www.guardian.co.uk/commentisfree/2011/feb/10/trade-unions-egypt-tunisia).

And the revolution continued after Twitter and then cell phones were turned off.

Understanding such events requires investment in human intell and analysis, not over reliance on SIGINT. [2]


[1] Spring (2011) issue of I-Q-Tel’s quarterly journal, IQT Quarterly

[2] That a source is technical or has lights and bells, does not make it reliable or even useful.

PS: The Twitter traffic, such as it was, may have primarily been from: Twitter, I think, is being used by news media people with computer connections, through those kind of means. Facebook, Twitter, and the Middle East, IEEE Spectrum, Steve Cherry interviews Ben Zhao, expert on social networking performance.

Are we really interested in how news people use Twitter, even in a social movement context?

Graph Data Management (GDM 2011)

Filed under: Conferences,Graphs — Patrick Durusau @ 12:36 pm

The 2nd International Workshop on Graph Data Management: Techniques and Applications (GDM 2011)

The Path-o-Logical Gremlin presentation I mentioned the other day was at this conference.

I am looking for slides and/or papers from the other presentations.

HBase Do’s and Don’ts

Filed under: HBase,NoSQL — Patrick Durusau @ 12:36 pm

HBase Do’s and Don’ts

From the post:

We at Cloudera are big fans of HBase. We love the technology, we love the community and we’ve found that it’s a great fit for many applications. Successful uses of HBase have been well documented and as a result, many organizations are considering whether HBase is a good fit for some of their applications. The impetus for my talk and this follow up blog post is to clarify some of the good applications for HBase, warn against some poor applications and highlight important steps to a successful HBase deployment.

Helpful review of HBase.

CouchTM Released

Filed under: CouchDB,CouchTM — Patrick Durusau @ 12:35 pm

CounchTM Released

From the announcement:

CouchTM – the topic maps engine with CouchDB as backend – is released.

Apache CouchDB is a document-oriented database that can be queried and indexed in a MapReduce fashion. According to the website CouchDB is:

  • A document database server, accessible via a RESTful JSON API.
  • Ad-hoc and schema-free with a flat address space.
  • Distributed, featuring robust, incremental replication with bi-directional conflict detection and management.
  • Query-able and index-able, featuring a table oriented reporting engine that uses JavaScript as a query language.

Inspired by the increasing impact of NoSQL approaches like CoucDB we developed CouchTM. This a TMAPI topic maps engine which uses the CouchDB as backend. CouchTM can be used like any other Java based topic maps engine. Hans-Henning Koch’s Bachlor thesis provides more information about CouchTM (in German). CouchTM as well CouchDB are available as open source.

BTW, don’t look for the files under Downloads but rather under Source. (as of 4/21/2011).

Kuria 1.0.1 Released

Filed under: Interface Research/Design,Kuria — Patrick Durusau @ 12:35 pm

Kuria 1.0.1 Released

From the announcement:

Kuria is a frontend generator based on Java POJO annotations. The new version 1.0.1 is mainly bug fixing and some new features.

Kuria provides a set of java annotations and a parser to generate bindings for widget classes. These bindings are used to create input masks, tables and trees for the annotated domain model. It is strongly advised to use Kuria inside Eclipse plug-ins or maven projects.

Kuria is completely independent from any Topic Maps technology. But the “Ontology-based automatic generation of applications in Onotoa” uses Aranuka in conjunction with Kuria.

The new Kuria version 1.0.1 is mainly bug fixing and provide the following new features:

  • added weight attribute to field annotations to set order of widgets
  • create widgets for annotated super classes in InputMask
  • added Annox support (only in non OSGi environments)

You will get the sources and the change log for Kuria at Google Code.

schemafit

Filed under: TMCL — Patrick Durusau @ 12:33 pm

schemafit

From the webpage:

Schemafit is a generic tool which extracts from an “unknown” topic map a fitting TMCL schema.

April 20, 2011

Local and Distributed Traversal Engines

Filed under: Graphs,Gremlin,Neo4j,NoSQL,TinkerPop — Patrick Durusau @ 2:19 pm

Local and Distributed Traversal Engines

Marko Rodriguez on graph traversal engines:

In the graph database space, there are two types of traversal engines: local and distributed. Local traversal engines are typically for single-machine graph databases and are used for real-time production applications. Distributed traversal engines are typically for multi-machine graph databases and are used for batch processing applications. This divide is quite sharp in the community, but there is nothing that prevents the unification of both models. A discussion of this divide and its unification is presented in this post.

If you are interested in graphs and topic maps, definitely an effort to watch.

CUFP: Commercial Users of Functional Programming

Filed under: Conferences,Functional Programming — Patrick Durusau @ 2:19 pm

CUFP: Commercial Users of Functional Programming

The How Twitter Scales post lead me to the CUFP site that has free videos from previous conferences!

The call for participation for CUFP 2011 has gone out and proposals are due by 15 June 2011.

Co-located with the ACM International Conference on Functional Programming (ICFP) conference in Toyko. September 19-21, 2011.

Video: How Twitter Scales with Scala

Filed under: Functional Programming,Scala — Patrick Durusau @ 2:18 pm

Video: How Twitter Scales with Scala

From the post:

Last week we told you about how Twitter is migrating its search stack from Ruby to Java. But Twitter is also known for being an early adopter of Scala. This presentation by Marius Eriksen at the Commercial Users of Funtional Programming 2010 conference explains how Twitter uses Scala to scale.

Adopting Apache Hadoop in the Federal Government

Filed under: Hadoop,Lucene,NoSQL,Solr — Patrick Durusau @ 2:17 pm

Adopting Apache Hadoop in the Federal Government

Background:

The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.

Thoughts on how to relate topic maps to technologies that already have their foot in the door?

5 Reasons Why Product Data Integration is Like Chasing Roadrunners

Filed under: Data Integration,Marketing — Patrick Durusau @ 2:16 pm

5 Reasons Why Product Data Integration is Like Chasing Roadrunners

Abstract:

Integrating product data carries a tremendous value, but cleanly integrating that data across multiple applications, data stores, countries and businesses can be as elusive a goal as catching the famed Looney Tunes character.

So why do it?

As a report from Automotive Aftermarket Industry Association pointed out, assuming $100 billion in transactions between suppliers and direct customer in the aftermarket each year, the shared savings potential tops $1.7 billion annually by eliminating product data errors in the supply chain. That’s just potential savings in one industry, in one year.

Note to self: The 1.7% savings on transaction errors requires a flexible and accurate mapping form one party’s information system to another. Something topic maps excel at.

You know what they say, a few $billion here, a few $billion there, and pretty soon you are talking about real money.

Giving a Single Name a Single Identity

Filed under: Marketing,Subject Identity — Patrick Durusau @ 2:15 pm

Giving a Single Name a Single Identity

This was just too precious to pass up.

The securities industry, parts of it anyway, would like to identify what is being traded in a reliable way.

Answer: Well, we’ll just pick a single identifier, etc. Read the article for the details but see near the end:

If you are running a worldwide trading desk in search of buyers or sellers in every corner of the world, you’re going to have a hard time finding them, in a single universal manner, says Robin Strong, Director of Buy-side Market Strategy, at Fidessa Group, a supplier of trading systems.

That is “primarily because the parties involved at the end of the various bits of wire from a single buy-side dealing desk don’t tend to cooperate. They’re all competitors. They want to own their piece of the value chain,’’ whether it’s a coding system or an order management system. “They’ve built a market that they own” and want to protect, he said.

With a topic map you could create a mapping into other markets.

Topic maps: Enhance the market you own with a part of someone else’s.

How is that for a marketing slogan?

Should by some mis-chance a single identifier come about, topic maps can help maintain insider semantics to maintain the unevenness of the playing field.

« Newer PostsOlder Posts »

Powered by WordPress