Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 16, 2011

PYGR: Python Graph Database for Bioinformatics

Filed under: Bioinformatics,Graphs,NoSQL — Patrick Durusau @ 12:57 pm

PYGR: Python Graph Database for Bioinformatics

From the website:

Pygr is open source software designed to make it easy to do powerful sequence and comparative genomics analyses, even with extremely large multi-genome alignments.

  • Bioinformatics tools for sequence analysis and comparative genomics such as sequence databases, search methods such as BLAST, repeat-masking, megablast, etc., sequence annotation databases and annotation query, and sequence alignment datasets.
  • Data namespace for accessing a given resource with seamless data relationship management. Easy data sharing that includes transparent access over network protocols.
  • High performance graph representation of interval data

If anyone has any spare grant money must lying around for graduate student work, it would be real interesting to see an inter-disciplinary history on graph databases.

We may not be able to avoid re-inventing the wheel but perhaps we could re-invent stronger wheels more quickly.

Announcing Neo4j on Windows Azure – Post

Filed under: Graphs,Neo4j,NoSQL — Patrick Durusau @ 12:54 pm

Announcing Neo4j on Windows Azure

Peter Neubauer and Magnus Mårtensson write:

Neo4j has a ‘j’ appended to the name. And now it is available on Windows Azure? This proves that in the most unlikely of circumstances sometimes beautiful things can emerge. Microsoft has promised Java to be a valued “first class citizen” on Windows Azure. In this blog post we will show that it is no problem at all to host a sophisticated and complex server product such as the Neo4j graph database server on Window Azure. Since Neo4j has a REST API over HTTP you can speak to this server from your regular .NET (or Java) applications, inside or outside of the cloud just as easily as you speak to Windows Azure Storage.

Would be interesting if the cloud proves to be the impetus for the next step in interoperability.

The more opportunity for interoperability, the greater I see the need for topic maps to govern semantic interoperability.

So long as data interchange/interoperability is largely theoretical, there isn’t much sense in being concerned.

When your nearest competitor is gaining ground or pulling away because of data interchange/interoperability, it is an entirely different matter.

February 15, 2011

TSearch Primer

Filed under: PostgreSQL,SQL,TSearch — Patrick Durusau @ 2:05 pm

TSearch Primer

From the website:

TSearch is a Full-Text Search engine that is packaged with PostgreSQL. The key developers of TSearch are Oleg Bartunov and Teodor Sigaev who have also done extensive work with GiST and GIN indexes used by PostGIS, PgSphere and other projects. For more about how TSearch and OpenFTS got started check out A Brief History of FTS in PostgreSQL. Check out the TSearch Official Site if you are interested in related TSearch tips or interested in donating to this very worthy project.

Tsearch is different from regular string searching in PostgreSQL in a couple of key ways.

  1. It is well-suited for searching large blobs of text since each word is indexed using a Generalized Inverted Index (GIN) or Generalized Search Tree (GiST) and searched using text search vectors. GIN is generally used for indexing. Search vectors are at word and phrase boundaries.
  2. TSearch has a concept of Linguistic significance using various language dictionaries, ISpell, thesaurus, stop words, etc. therefore it can ignore common words and equate like meaning terms and phrases.
  3. TSearch is for the most part case insensitive.
  4. While various dictionaries and configs are available out of the box with TSearch, one can create new ones and customize existing further to cater to specific niches within industries – e.g. medicine, pharmaceuticals, physics, chemistry, biology, legal matters.

Short introduction to TSearch, which is part of PostgreSQL.

Should be of interest to topic mappers using PostgreSQL.

Auto Completion

Filed under: Authoring Topic Maps,Redis — Patrick Durusau @ 1:54 pm

Auto-completion is a feature that I find useful in a number of applications.

I suspect users would find that to be the case for topic map authoring and navigation software.

One article to look at is: Auto Complete with Redis.

Which was cited by: Announcing Soulmate, A Redis-Backed Service For Fast Autocompleting

The second item being an application complete with an interface.

From the Soulmate announcement:

Inspired by Auto Complete with Redis, Soulmate uses sorted sets to build an index of partially completed words and the corresponding top matching items, and provides a simple sinatra app to query them.

Here’s a quick overview of what the initial version of Soulmate supports:

  • Provide suggestions for multiple types of items in a single query (at SeatGeek we’re autocompleting for performers, events, and venues)
  • Results are ordered by a user-specified score
  • Arbitrary metadata for each item (at SeatGeek we’re storing both a url and a subtitle)

I rather like the idea of arbitrary metadata.

Could be a utility that presents snippets to paste into a topic map?

UMBEL – Reference Concept Ontology and Vocabulary 1.0

Filed under: Ontology,UMBEL,Vocabularies — Patrick Durusau @ 1:41 pm

UMBEL – Reference Concept Ontology and Vocabulary 1.0 has been released!

From the website:

This is the official Web site for the UMBEL Vocabulary and Reference Concept Ontology (namespace: umbel). UMBEL is the Upper Mapping and Binding Exchange Layer, designed to help content interoperate on the Web.

UMBEL provides two valuable functions:

  • First, it is a vocabulary for the construction of concept-based domain ontologies, designed to act as references for the linking and mapping of external content, and
  • Second, it is its own broad, general reference structure of 28,000 concepts, which provides a scaffolding to orient other datasets and domain vocabularies.

The mappings in Annex F: Mapping with UMBL are with owl:sameAs and umbel:isLike.

I would prefer more specific reasons for mapping. Particular given the varying use of owl:sameAs. Could mean just about anything.

Still, this is a valuable data set, although I would use it for mappings with more specific reasoning disclosed as part of the mapping.

PS: It has a really cool logo!

Webmail for Millions, Powered by Erlang

Filed under: Erlang,Hibari,NoSQL — Patrick Durusau @ 11:38 am

Webmail for Millions, Powered by Erlang

From the website:

Scott Lystig Fritchie presents the architecture and lessons learned implementing a webmail system in Erlang, using UBF and Hibari, a distributed key-value store, to accommodate a large user base.

UBF? (new to me)

From http://norton.github.com/ubf:

UBF is the “Universal Binary Format”, designed and implemented by Joe Armstrong. UBF is a language for transporting and describing complex data structures across a network. It has three components:

  • UBF(A) is a “language neutral” data transport format, roughly equivalent to well-formed XML.
  • UBF(B) is a programming language for describing types in UBF(A) and protocols between clients and servers. This layer is typically called the “protocol contract”. UBF(B) is roughly equivalent to Verified XML, XML-schemas, SOAP and WDSL.
  • UBF(C) is a meta-level protocol used between a UBF client and a UBF server.

Potential lessons for those developing scalable topic map applications.

Scala: Introduction to Scala for Java Programmers

Filed under: Java,Scala — Patrick Durusau @ 11:27 am

Scala: Introduction to Scala for Java Programmers by Adam Rabung.

Useful for Java programmers looking at Scala for topic map development.

WinCouch

Filed under: CouchDB,NoSQL — Patrick Durusau @ 11:25 am

WinCouch

From the website:

The one-click CouchDB package for Windows like the Jan’s CouchDBX for Mac OSX.

  • Based on CouchDB-1.0.2 binaries from Dave Cottlehuber.
  • Used the GeckoFX to embed Mozilla Gecko (Firefox) into the application.

A Couch implementation for Windows.

I tried to access the www.geckofx.org website on several days but was unable to connect. I was able to connect to the http://code.google.com/p/geckofx/. It points to www.geckofx.org. Thinking this could be of interest to topic map application developers.

If someone knows the status of the www.geckofx.org site, please post a note here. Thanks!

Cassandra 0.7.1 Release

Filed under: Cassandra,NoSQL — Patrick Durusau @ 11:06 am

Cassandra 0.7.1

Largest production cluster reported to be 100 TB spread over 150 machines.

It occurs to me that most topic map engines support SQL backends.

I will be checking in on the SQL world for recent developments that are relevant to topic maps.

Modern Science and the Bayesian-Frequentist Controversy – Post

Filed under: Bayesian Models — Patrick Durusau @ 11:04 am

Modern Science and the Bayesian-Frequentist Controversy is a post by John Myles White, citing an essay by Bradley Efron, Modern Science and the Bayesian-Frequentist Controversy.

Topic maps will be using statistical means to manage large data sets so I commend the posting and essay to you for review.

February 14, 2011

Substring search algorithm

Filed under: String Matching — Patrick Durusau @ 2:58 pm

Substring search algorithm by Leonid Volnitsky.

From the website:

Described new online substring search algorithm which allows faster string traversal. Presented here implementation is substantially faster than any other online substring search algorithms.

Will be interesting to see which topic map engines incorporate this new substring search algorithm first.

Data Mining Video

Filed under: Data Mining — Patrick Durusau @ 1:44 pm

How I OCR hundreds of hours of video.

A very useful posting for anyone interested in mining the text overlays displayed during TV coverage.

Here the context is legislative coverage but I assume the same principles apply in other contexts.

One topic map aspect would be to create mappings to other materials involving the same parties or issues.

PyML

Filed under: Machine Learning — Patrick Durusau @ 1:44 pm

PyML is an interactive object oriented framework for machine learning in Python

From the website:

PyML has been tested on Mac OS X and Linux. Some components are in C++ so it’s not automatically portable.

Here are some key features of “PyML”:

  • Classifiers: Support Vector Machines (SVM), nearest neighbor classifiers, ridge regression
  • Multi-class methods (one-against-one and one-against-rest)
  • Feature selection (filter methods, RFE, multiplicative update)
  • Model selection
  • Syntax for combining classifiers
  • Classifier testing (cross-validation, error rates, ROC curves, statistical test for comparing classifiers)

If you are running Mac OS X, please let me know what you think about this package.

Free Encyclopedia of Interactive Design, Usability and User Experience

Filed under: Interface Research/Design,Visualization — Patrick Durusau @ 1:38 pm

Free Encyclopedia of Interactive Design, Usability and User Experience

A resource like this will not make non-design types (ask me about the cover on my first book some time) into designers.

It may give you a greater appreciation for design issues and the ability to at least sense when something is wrong.

So you will know when to ask designers for their assistance.

Interfaces that are intuitive and inviting will go a long way towards selling a paradigm.

NetworkX Introduction: Hacking social networks using the Python programming language

Filed under: Social Networks — Patrick Durusau @ 11:34 am

NetworkX Introduction: Hacking social networks using the Python programming language

Aric Hagberg (Los Alamos National Laboratory) and Drew Conway (New York University) for the 2011 Sunbelt Conference on Social Networks.

I ran across this while look at the “Data Bootcamp” materials that Drew Conway posted.

Still reading The social life of information and remembered not all topic map folks use Perl, ;-), two good reasons to mention this resource.

Seriously, information only exists (in any meaningful sense) in social networks.

It could be that information exists independently of us, but how interesting is that?

Even the arguments about the existence of information in our absence takes place in our presence. How ironic is that?

Technology vs. Teaching?

Filed under: Examples,Marketing,Topic Maps — Patrick Durusau @ 11:25 am

The reported experiences with technology at “Data Bootcamp” tutorial at O’Reilly’s Strata Conference 2011, installation and other woes, made me think about technology and teaching for topic maps.

Is it a question of technology vs. teaching?

If technology gets in the way of teaching, does the same happen for users?

I don’t know of any user studies where users are presented with an interface and a list of questions to answer or tasks to perform, in connection with a particular topic maps interface?

Has anyone done such studies?

It would be really good to have a public archive of videos of such sessions (with permission of the participants).

****
PS: For topic map presentations/workshops, it would be good to record comments so tests of the presentation/workshop could be done in advance.

I have done presentations where the slides were perfectly clear to me when I wrote them. At presentation time I had to temporize to remember what I was trying to say. You can imagine the difficulty the audience was having. 😉

“Data Bootcamp” tutorial at O’Reilly’s Strata Conference 2011

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 10:39 am

“Data Bootcamp” tutorial at O’Reilly’s Strata Conference 2011

All the materials from the “Data Bookcamp.”

I haven’t had time to review the materials but am looking forward to it.

Machine Learning Lectures (Video)

Filed under: Machine Learning — Patrick Durusau @ 10:34 am

Machine Learning by Andrew Ng, Stanford University, on iTunes.

These videos will not make the Saturday night movie schedule in most households but will definitely repay close study.

Augmented topic map authoring is a necessary feature of creation of topic maps for large data sets.

February 13, 2011

Elliptics in production

Filed under: Elliptics,NoSQL — Patrick Durusau @ 1:51 pm

Elliptics in production

From the project website:

Elliptics network is a fault tolerant distributed hash table object storage.

The network does not use dedicated servers to maintain the metadata information, it supports redundant objects storage and implements transactional data update. Small to medium sized write benchmarks can be found (its the latest to date, other presented earlier) in the appropriate blog section.

Distributed hash table design allows not to use dedicated metadata servers which frequently become points of failure in the classical storages, instead user can connect to any server in the network and all requests will be forwarded to the needed nodes, one can also lookup the needed server and connect there directly. It can really be called a cloud of losely connected equivalent nodes. Joining node will automatically connect to the needed servers according to the network topology, it can store data in different configurable backends like file IO storage, eblob backend or using own IO storage backend.

Protocol allows to implement own data storage using specific features for the deploying project and generally extend data communication with infinite number of the extensions. One of the implemented examples is remote command execution, which can be used as a load balancing job manager.

Hard to say which of the NoSQL solutions will make useful backends or other components in a topic map system.

But, I would rather err of being inclusive.

How It Works – The “Musical Brain”

Filed under: Data Source,Music Retrieval,Natural Language Processing — Patrick Durusau @ 1:46 pm

How It Works – The “Musical Brain”

I found this following the links in the Million Song Dataset post.

One aspect, among others, that I found interesting, was the support for multiple ID spaces.

I am curious about the claim it works by:

Analyzing every song on the web to extract key, tempo, rhythm and timbre and other attributes — understanding every song in the same way a musician would describe it

Leaving aside the ambitious claims about NLP processing made elsewhere on that page, I find it curious that there is a uniform method for describing music.

Or perhaps they mean that the “Musical Brain” uses only one description uniformly across the music it evaluates. I can buy that. And it could well be a useful exercise.

At least from the prospective of generating raw data that could then be mapped to other nomenclatures used by musicians.

I wonder if the Rolling Stone uses the same nomenclature as the “Musical Brain?” Will have to check.

Suggestions for other music description languages? Mappings to the one used by the “Musical Brain?”

BTW, before I forget, the “Musical Brain” offers a free API (for non-commercial use) to its data.

Would appreciate hearing about your experiences with the API.

Semantic Multimedia

Filed under: Conferences,Multimedia,Semantics — Patrick Durusau @ 1:45 pm

Special Issue on Semantic Multimedia.

The Journal of Semantic Computing has issued the following call for papers:

In the new millennium Multimedia Computing plays an increasingly important role as more and more users produce and share a constantly growing amount of multimedia documents. The sheer number of documents available in large media repositories or even the World Wide Web makes indexing and retrieval of multimedia documents as well as browsing and annotation more important tasks than ever before. Research in this area is of great importance because of the very limited understanding of the semantics of such data sources as well as the limited ways in which they can be accessed by the users today. The field of Semantic Computing has much to offer with respect to these challenges. This special issue invites articles that bring together Semantic Computing and Multimedia to address the challenges arising by the constant growth of Multimedia.

Important Dates
June 3, 2011: Submissions due
August 3, 2011: Notification date
October 18, 2011: Final versions due

Personally I would argue there is “…very limited understanding of the semantics of … [all] data sources….” 😉

Multimedia documents are more popular and expected so the failure there may be more visible.

Apache Lucene 3.0 Tutorial

Filed under: Authoring Topic Maps,Indexing,Lucene — Patrick Durusau @ 1:34 pm

Apache Lucene 3.0 Tutorial by Bob Carpenter.

At 20 pages it isn’t your typical “Hello World” introduction. 😉

It should be the first document you hand a semi-technical person about Lucene.

Discovering the vocabulary of the documents/domain for which you are building a topic map is a critical first step.

Indexing documents gives you an important control over the accuracy and completeness of information you are given by domain “experts” and users.

There will be terms that are transparent to them and can only be clarified if you ask.

Text Analysis with LingPipe, Version 0.3

Filed under: Authoring Topic Maps,Indexing,LingPipe — Patrick Durusau @ 1:30 pm

Text Analysis with LingPipe 4 (Version 0.3) By Bob Carpenter.

On the importance of this book see: LingPipe Home.

O’Reilly Book Sale

Filed under: Books,MongoDB,NoSQL,Topic Maps — Patrick Durusau @ 7:14 am

OK, ok, one more not-strictly topic map item and I promise no more for today!

Buy 2, Get 1 Free at O’Reilly caught my eye this morning.

I think it is justified to appear here for two reasons:

1) It has a lot of books, such as those on databases, that are relevant to implementing topic map systems.

But, just as importantly:

2) The O’Reilly online catalog illustrates the need for topic maps.

Look at the catalog listings under Other Databases for MongoDB (you may have heard about it). Now look under Database Design and Analysis. Opps! There you will find: MongoDB: The Definitive Guide. (at least as of 13 February 2010, 7:01 AM Eastern time)

One way (not the only way) to implement a topic map here would result in a single source of updates across the catalog. And the catalog could also act as a resource pointer to other materials. The Other Resources for the MongoDB book, isn’t terribly inspiring.

*****
PS: I am hopeful the interest in NoSQL will drive greater exploration of MySQL, PostgresSQL and Oracle databases in general and as part of topic maps systems in particular.

Programming Scala

Filed under: Merging,Scala — Patrick Durusau @ 6:51 am

Programming Scala by Dean Wampler and Alex Payne.

Experimental book at O’Reilly Labs.

Seems to be the day for not-strictly topic map posts but I think Scala is going to be important both for topic maps as well as scalable programming in general.

I suspect that reflects my personal view that functional approaches to merging are more likely to be successful with topic maps than approaches that rely upon mutable objects.

Comments about your experience with Scala, particularly with regard to topic maps and with this book most welcome!

Practical Transformation Using XSLT and XPath

Filed under: XPath,XSLT,XTM — Patrick Durusau @ 6:17 am

A new edition of Practical Transformation Using XSLT and XPath by Ken Holman is out.

While not topic map specific, ;-), this is one of the two resources you need for transformations getting to (or from) topic maps using XSLT and XPath. The other one, would be: XSLT 2.0 and XPath 2.0: programmer’s reference. (You can also use both of these for non-topic map, XML based work.)

While your looking at Ken’s training resources, note his series on UBL (Universal Business Language).

I mention that because the greater the exposure of business systems the greater the need for the mapping of semantics (that means topic maps).

dygraphs JavaScript Visualization Library

Filed under: Graphics,Javascript,Time Series — Patrick Durusau @ 5:50 am

dygraphs JavaScript Visualization Library

From the website:

dygraphs is an open source JavaScript library that produces produces interactive, zoomable charts of time series. It is designed to display dense data sets and enable users to explore and interpret them.

If your topic map contains or can be viewed as a time series, this graphics library may be of interest to you.

Software for Non-Human Users?

The description of: Emerging Intelligent Data and Web Technologies (EIDWT-2011) is a call for software designed for non-human users.

The Social Life of Information by John Seely Brown and Paul Duguid, makes it clear that human users don’t want to share data because sharing data represents a loss of power/status.

A poll of the readers of CACM or Computer would report a universal experience of working in an office where information is hoarded up by individuals in order to increase their own status or power.

9/11 was preceded and followed by, to this day, by a non-sharing of intelligence data. Even national peril cannot overcome the non-sharing reflex with regard to data.

EIDWT-2011 and conferences like it, are predicated on a sharing of data known to not exist, at least among human users.

Hence, I suspect the call must be directed at software for non-human users.

Emerging Intelligent Data and Web Technologies (EIDWT-2011)

2nd International Conference on Emerging Intelligent Data and Web Technologies (EIDWT-2011)

From the announcement:

The 2-nd International Conference on Emerging Intelligent Data and Web Technologies (EIDWT-2011) is dedicated to the dissemination of original contributions that are related to the theories, practices and concepts of emerging data technologies yet most importantly of their applicability in business and academia towards a collective intelligence approach. In particular, EIDWT-2011 will discuss advances about utilizing and exploiting data generated from emerging data technologies such as Data Centers, Data Grids, Clouds, Crowds, Mashups, Social Networks and/or other Web 2.0 implementations towards a collaborative and collective intelligence approach leading to advancements of virtual organizations and their user communities. This is because, current and future Web and Web 2.0 implementations will store and continuously produce a vast amount of data, which if combined and analyzed through a collective intelligence manner will make a difference in the organizational settings and their user communities. Thus, the scope of EIDWT-2011 is to discuss methods and practices (including P2P) which bring various emerging data technologies together to capture, integrate, analyze, mine, annotate and visualize data – made available from various community users – in a meaningful and collaborative for the organization manner. Finally, EIDWT-2011 aims to provide a forum for original discussion and prompt future directions in the area.

Important Dates:

Submission Deadline: March 10, 2011
Authors Notification: May 10, 2011
Author Registration: June 10, 2011
Final Manuscript: July 1, 2011
Conference Dates: September 7 – 9, 2011

February 12, 2011

Learning Kernel Classifiers

Filed under: Kernel Methods,R — Patrick Durusau @ 5:28 pm

Learning Kernel Classifiers by Ralf Herbrich.

A bit dated (2001) to be on the web in partial form but may still be a useful work.

The source code listings appear to be complete and are mostly written in R.

Interested in how anyone sees this versus more recent works on kernel classifiers.

« Newer PostsOlder Posts »

Powered by WordPress