Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 23, 2011

NoSQL, NewSQL and Beyond:…

Filed under: Marketing,NoSQL — Patrick Durusau @ 8:21 pm

NoSQL, NewSQL and Beyond: The answer to SPRAINed relational databases

From the post:

The 451 Group’s new long format report on emerging database alternatives, NoSQL, NewSQL and Beyond, is now available.

The report examines the changing database landscape, investigating how the failure of existing suppliers to meet the performance, scalability and flexibility needs of large-scale data processing has led to the development and adoption of alternative data management technologies.

There is one point that I think presents an opportunity for topic maps:

Polyglot persistence, and the associated trend toward polyglot programming, is driving developers toward making use of multiple database products depending on which might be suitable for a particular task.

I don’t know if the report covers the reasons for polyglot persistence as I don’t have access to the “full” version of the report. Maybe someone who does can say if the report covers why the polyglot nature of IT resources is immune to attempts at its reduction.

April 22, 2011

Square Pegs and Round Holes in the NOSQL World

Filed under: Graphs,Key-Value Stores,Marketing,Neo4j,NoSQL — Patrick Durusau @ 1:06 pm

Square Pegs and Round Holes in the NOSQL World

Jim Webber reviews why graph databases (such as Neo4J) are better for storing graphs than Key-Value, Document or relational datastores.

He concludes:

In these kind of situations, choosing a non-graph store for storing graphs is a gamble. You may find that you’ve designed your graph topology far too early in the system lifecycle and lose the ability to evolve the structure and perform business intelligence on your data. That’s why Neo4j is cool – it keeps graph and application concerns separate, and allows you to defer data modelling decisions to more responsible points throughout the lifetime of your application.

You know, we could say the same thing about topic maps, that you don’t have to commit to all modeling decisions up front.

Something to think about.

April 21, 2011

HBase Do’s and Don’ts

Filed under: HBase,NoSQL — Patrick Durusau @ 12:36 pm

HBase Do’s and Don’ts

From the post:

We at Cloudera are big fans of HBase. We love the technology, we love the community and we’ve found that it’s a great fit for many applications. Successful uses of HBase have been well documented and as a result, many organizations are considering whether HBase is a good fit for some of their applications. The impetus for my talk and this follow up blog post is to clarify some of the good applications for HBase, warn against some poor applications and highlight important steps to a successful HBase deployment.

Helpful review of HBase.

April 20, 2011

Local and Distributed Traversal Engines

Filed under: Graphs,Gremlin,Neo4j,NoSQL,TinkerPop — Patrick Durusau @ 2:19 pm

Local and Distributed Traversal Engines

Marko Rodriguez on graph traversal engines:

In the graph database space, there are two types of traversal engines: local and distributed. Local traversal engines are typically for single-machine graph databases and are used for real-time production applications. Distributed traversal engines are typically for multi-machine graph databases and are used for batch processing applications. This divide is quite sharp in the community, but there is nothing that prevents the unification of both models. A discussion of this divide and its unification is presented in this post.

If you are interested in graphs and topic maps, definitely an effort to watch.

Adopting Apache Hadoop in the Federal Government

Filed under: Hadoop,Lucene,NoSQL,Solr — Patrick Durusau @ 2:17 pm

Adopting Apache Hadoop in the Federal Government

Background:

The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.

Thoughts on how to relate topic maps to technologies that already have their foot in the door?

April 17, 2011

Good Relationships: The Spring Data Graph Guide Book

Filed under: Graphs,NoSQL — Patrick Durusau @ 5:25 pm

Good Relationships: The Spring Data Graph Guide Book (1.0.0-BUILD-SNAPSHOT)

From About this book:

Welcome to the Spring Data Graph Guide Book. Thank you for taking the time to get an in depth look into Spring Data Graph. This project is part of the Spring Data project, which brings the convenient programming model of the Spring Framework to modern NOSQL databases. Spring Data Graph, as the name alludes to, aims to provide support for graph databases. It currently supports Neo4j.

It was written by developers for developers. Hopefully we’ve created a document that is well received by our peers.

If you have any feedback on Spring Data Graph or this book, please provide it via the SpringSource JIRA, the SpringSource NOSQL Forum, github comments or issues, or the Neo4j mailing list.

This book is presented as a duplex book, a term coined by Martin Fowler. A duplex book consists of at least two parts. The first part is an easily accessible tutorial that gives the reader an overview of the topics contained in the book. It contains lots of examples and discussion topics. This part of the book is highly suited for cover-to-cover reading.

We chose a tutorial describing the creation of a web application that allows movie enthusiasts to find their favorite movies, rate them, connect with fellow movie geeks, and enjoy social features such as recommendations. The application is running on Neo4j using Spring Data Graph and the well-known Spring Web Stack.

April 14, 2011

There’s no schema for Science – CouchDB in Research

Filed under: CouchDB,NoSQL — Patrick Durusau @ 7:23 am

There’s no schema for Science – CouchDB in Research

Erlang Factory 2011 video of presentation by Nitin Borwankar on CouchDB.

From the website:

The cutting edge and constantly evolving nature of scientific research makes it very hard to use relational databases to model scientific data. When a hypothesis changes, the observations change and the schema changes – large volumes of data may have to be migrated. This makes it very hard for researchers and they end up using spreadsheets and flat files since they are more flexible. Enter CouchDB and the schemaless model. The talk will take three real world examples and generalize to extract some principles and help identify where you might apply these.

March 30, 2011

CouchDB Tutorial: Starting to relax with CouchDB

Filed under: CouchDB,NoSQL — Patrick Durusau @ 12:36 pm

CouchDB Tutorial: Starting to relax with CouchDB

From Alex Popescu’s myNoSQL blog, a pointer to a useful tutorial on CouchDB.

CouchDB homepage

The Little MongoDB Book

Filed under: MongoDB,NoSQL — Patrick Durusau @ 12:35 pm

The Little MongoDB Book

From the webpage:

I’m happy to freely release The Little MongoDB Book; an ebook meant to help people get familiar with MongoDB and answer some of the more common questions they have.

Not complete but a useful short treatment.

March 29, 2011

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin

Filed under: NoSQL,SQL — Patrick Durusau @ 12:48 pm

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin

From the article:

In this article we present a mathematical data model for the most common noSQL databases—namely, key/value relationships—and demonstrate that this data model is the mathematical dual of SQL’s relational data model of foreign-/primary-key relationships. Following established mathematical nomenclature, we refer to the dual of SQL as coSQL. We also show how a single generalization of the relational algebra over sets—namely, monads and monad comprehensions—forms the basis of a common query language for both SQL and noSQL. Despite common wisdom, SQL and coSQL are not diabolically opposed, but instead deeply connected via beautiful mathematical theory.

Just as Codd’s discovery of relational algebra as a formal basis for SQL shifted the database industry from a monopolistically competitive market to an oligopoly and thus propelled a billion-dollar industry around SQL and foreign-/primary-key stores, we believe that our categorical data-model formalization model and monadic query language will allow the same economic growth to occur for coSQL key-value stores.

Considering the authors’ claim that the current SQL oligopoly is woth $32 billion and still growing in double digits, color me interested!

😉

Since they are talking about query languages, maybe the TMQL editors should take a look as well.

MongoDB Manual

Filed under: Database,MongoDB,NoSQL — Patrick Durusau @ 12:46 pm

MongoDB Manual

More of a placeholder for myself than anything else.

I am going to create a page of links to the documentation for all the popular DB projects.

MongoDB with Style

Filed under: MongoDB,NoSQL — Patrick Durusau @ 12:46 pm

MongoDB with Style

One of the more amusing introductions to use of MongoDB.

March 26, 2011

Hypertable 0.9.5.0

Filed under: Hypertable,NoSQL — Patrick Durusau @ 5:19 pm

Hypertable 0.9.5.0

My first encounter with this project lead me to: http://www.hypertable.com, which is a commercial venture offering support for open source software.

Except that that wasn’t really clear from the .com homepage.

I finally tracked links back to: http://code.google.com/p/hypertable/ to discover its GNU GPL v2 license.

The list of ventures using Hypertable is an impressive one.

Linking to the documentation at the .org site from the .com site would be a real plus.

A bit more attention to the .com site might attract more business, use cases, that sort of thing.

March 23, 2011

Microsoft Research Watch: AI, NoSQL and Microsoft’s Big Data Future

Filed under: Artificial Intelligence,Graphs,NoSQL — Patrick Durusau @ 6:01 am

Microsoft Research Watch: AI, NoSQL and Microsoft’s Big Data Future

From ReadWriteCloud channel:

Probase is a Microsoft Research project described as an “ongoing project that focuses on knowledge acquisition and knowledge serving.” Its primary goal is to “enable machines to understand human behavior and human communication.” It can be compared to Cyc, DBpedia or Freebase in that it is attempting to compile a massive collection of structured data that can be used to power artificial intelligence applications.

It’s powered by a new graph database called Trinity, which is also a Microsoft Research project. Trinity was spotted today by MyNoSQL blogger Alex Popescu, and that led us to Probase. Neither project seems to be available to the public yet.

Err, did they say graph database?

Now, if they can just avoid the one-world-semantic trap this could prove to be very interesting.

Well, it will be interesting in any case but avoiding that particular dead end would give MS a robustness that would be hard to match.

NoSQL: Guides, Tutorials, Books, Papers

Filed under: NoSQL — Patrick Durusau @ 5:59 am

NoSQL: Guides, Tutorials, Books, Papers

If you are new to NoSQL or just want to see what beginner material is available (to avoid writing your own), Alex Popescu has a growing collection of materials on NoSQL.

Bookmark it to send to your co-workers and even clients.

March 20, 2011

a practical guide to noSQL

Filed under: Marketing,NoSQL — Patrick Durusau @ 1:26 pm

a practical guide to noSQL by Denise Mura strikes me as deeply problematic.

First, realize that Denise is describing the requirements that a MarkLogic server is said to meet.

That may or may not be the same as your requirements.

The starting point for evaluating any software, MarkLogic (which I happen to like) or not, must be with your requirements.

I mention this in part because I can think of several organizations and more than one government agency that has bought software that met a vendors requirements, but not their own.

The result was a sale for the vendor but a large software dog that everyone kept tripping over but pride and unwillingness to admit error kept it around for a very long time.

Take for example her claim that MarkLogic deliver[s] real-time updates, search, and retrieval results…. Well, ok, but if I run weekly reports on data that is uploaded on a daily basis, then real-time updates, search, and retrieval results may not be one of my requirements.

You need to start with your requirements (you do have written requirements, yes?) and not those of a vendor or what “everyone else” requires.

The same lesson holds true for construction of a topic map. It is your world view that it needs to reflect.

Second, it can also be used as a lesson in reading closely.

For example, of Lucene, Solr, and Sphinx, Denise says:

Search engines lie to you all the time in ways that are not always obvious because they need to take shortcuts to make performance targets. In other words, they don’t provide for a way to guarantee accuracy.

It isn’t clear from the context what lies Denise thinks we are being told. Or what it would mean to …guarantee accuracy?

I can’t think of any obvious ways that a search engine has ever lied to me, much less any non-obvious ones. (That may be because they are non-obvious.)

There are situations where noSQL, SQL, MarkLogic and topic maps solutions are entirely appropriate. But as a consumer you will need cut through promotional rhetoric to make the choice that is right for you.

March 19, 2011

Implementing Replica Set Priorities

Filed under: MongoDB,NoSQL — Patrick Durusau @ 5:54 pm

Implementing Replica Set Priorities, Kristina Chodorow explains replica sets:

Replica set priorities will, very shortly, be allowed to vary between 0.0 and 100.0. The member with the highest priority that can reach a majority of the set will be elected master. (The change is done and works, but is being held up by 1.8.0… look for it after that release.) Implementing priorities was kind of an interesting problem, so I thought people might be interested in how it works. Following in the grand distributed system lit tradition I’m using the island nation of Replicos to demonstrate.

Should be of interest for anyone planning distributed topic map stores/distributions.

March 17, 2011

The Joy of Indexing

Filed under: Indexing,MongoDB,NoSQL,Subject Identity — Patrick Durusau @ 6:52 pm

The Joy of Indexing

Interesting blog post on indexing by Kyle Banker of MongoDB.

Recommended in part to understanding the limits of traditional indexing.

Ask yourself, what is the index in Kyle’s examples indexing?

Kyle says the example are indexing recipes but is that really true?

Or is it the case that the index is indexing the occurrence of a string at a location in the text?

Not exactly the same thing.

That is to say there is a difference between a token that appears in a text and a subject we think about when we see that token.

It is what enables us to say that two or more words that are spelled differently are synonyms.

Something other that the two words as strings is what we are relying on to make the claim they are synonyms.

A traditional indexing engine, of the sort described here, can only index the strings it encounters in the text.

What would be more useful would be an indexing engine that indexed the subjects in a text.

I think we would call such a subject-indexing engine a topic map engine. Yes?

Questions:

  1. Do you agree/disagree that a word indexing engine is not a subject indexing engine? (3-5 pages, no citations)
  2. What would you change about a word indexing engine (if anything) to make it a subject indexing engine? (3-5 pages, no citations)
  3. What texts/subjects would you use as test cases for your engine? (3-5 pages, citations of the test documents)

MongoDB 1.8 Released

Filed under: MongoDB,NoSQL — Patrick Durusau @ 6:49 pm

MongoDB 1.8 Released

Highlights from the announcement:

Cassandra – London Podcasts

Filed under: Cassandra,NoSQL — Patrick Durusau @ 6:48 pm

Cassandra – London Podcasts

Podcasts from the London Cassandra User Group.

Cassandra – Thrift Application Jools Enticknap: 21 February 2011

Cassandra in TWEETMEME Nick Telford: 21 February 2011

Cassandra Meetup January 17th Jan 2011

Cassandra London Meetup Jake Luciani : 8th Dec 2010

March 16, 2011

Hank: A Fast, Open-Source, Batch-Updatable, Distributed Key-Value Store

Filed under: Hank,Key-Value Stores,NoSQL — Patrick Durusau @ 3:17 pm

Hank: A Fast, Open-Source, Batch-Updatable, Distributed Key-Value Store

From the RapLeaf blog:

We’re really excited to announce the open-source debut of a cool piece of Rapleaf’s internal infrastructure, a distributed database project we call Hank.

Our use case is very particular: we have tons of data that needs to get processed, producing a lot of data points for individual people, which then need to be made randomly accessible so they can be served through our API. You can think of it as the “process and publish” pattern.

For the processing component, Hadoop and Cascading were an obvious choice. However, making our results randomly accessible for the API was more challenging. We couldn’t find an existing solution that was fast, scalable, and perhaps most importantly, wouldn’t degrade performance during updates. Our API needs to have lightning-fast responses so that our customers can use it in realtime to personalize their users’ experiences, and it’s just not acceptable for us to have periods where reads contend with writes while we’re updating.

Requirements:

  1. Random reads need to be fast – reliably on the order of a few milliseconds.
  2. Datastores need to scale to terabytes, with keys and values on the order of kilobytes.
  3. We need to be able to push out hundreds of millions of updates a day, but they don’t have to happen in realtime. Most will come from our Hadoop cluster.
  4. Read performance should not suffer while updates are in progress.

Non-requirements

  1. During the update process, it doesn’t matter if there is more than one version of our datastores available. Our application is tolerant of this inconsistency.
  2. We have no need for random writes.

If you read updates as merges then the relevance of this posting to topic maps becomes a bit clearer. 😉

Not all topic map systems will have the same requirements and non-requirements.

(This resource pointed out to me by Jack Park.)

March 15, 2011

Couchbase

Filed under: CouchDB,Membase,NoSQL — Patrick Durusau @ 5:15 am

Couchbase

From the website:

Couchbase Server is powered by Apache CouchDB, the industry’s most advanced and widely deployed document database technology. It boasts many advanced NoSQL capabilities, such as the ability to execute complex queries, to maintain indices and to store data with ACID transaction semantics. Plus it incorporates geospatial indexing so developers can easily create location-aware applications. Couchbase Server provides an exceptionally flexible data management platform, offering the rich data management operations that developers expect from their database.

Couchbase Server is simple.

  • Flexible Views and Querying. Built-in javascript-based map/reduce indexing engine is a powerful way to analyze and query your data.
  • Schemaless Data Repository. Couchbase document model is a perfect fit for web applications, providing significant data flexibility.
  • Geo-spatial Indexing. Built-in GeoCouch lets developers easily create location-aware apps.

Couchbase Server is fast.

  • Durable Speed Without Compromising Safety. You get safety and speed with our architecture, no compromises.
  • Indexing. Rapidly retrieve data in any format you demand, across clusters.

Couchbase Server is elastic.

  • Peer-to-Peer Replication. Unmatched peer-based replication capabilities, each replica allowing full queries, updates and additions..
  • Mobile Synchronization. Couchbase is ported to popular mobile devices and because it doesn’t depend on a constant Internet connection, users can access their data anytime, anywhere.

Expiring columns

Filed under: Cassandra,NoSQL — Patrick Durusau @ 5:08 am

Expiring columns

In Cassandra 0.7, there are expiring columns.

From the blog:

Sometimes, data comes with an expiration date, either by its nature or because it’s simply intractable to keep all of a rapidly growing dataset indefinitely.

In most databases, the only way to deal with such expiring data is to write a job running periodically to delete what is expired. Unfortunately, this is usually both error-prone and inefficient: not only do you have to issue a high volume of deletions, but you often also have to scan through lots of data to find what is expired.

Fortunately, Cassandra 0.7 has a better solution: expiring columns. Whenever you insert a column, you can specify an optional TTL (time to live) for that column. When you do, the column will expire after the requested amount of time and be deleted auto-magically (though asynchronously — see below). Importantly, this was designed to be as low-overhead as possible.

Now there is an interesting idea!

Goes along with the idea that a topic map does not (should not?) present a timeless view of information. That is a topic map should maintain state so that we can determine what was known at any particular time.

Take a simple example, a call for papers for a conference. It could be that a group of conferences all share the same call for papers, the form, submission guidelines, etc. And that call for papers is associated with each conference by an association.

Shouldn’t we be able to set an expiration date on that association so that at some point in time, all those facilities are no longer available for that conference? Perhaps it switches over to another set of properties in the same association to note that the submission dates have passed? That would remove the necessity for the association expiring.

But there are cases where associations do expire or at least end. Divorce in an unhappy example. Being hired is a happier one.

Something to think about.

Getting Started with CouchDB

Filed under: CouchDB,NoSQL — Patrick Durusau @ 5:07 am

Getting Started with CouchDB

A tutorial introduction to CouchDB.

Fairly brief but covers most of the essentials.

Note to self: Would not be a bad model for a topic map tutorial introduction.

Redis, from the ground up

Filed under: NoSQL,Redis — Patrick Durusau @ 5:04 am

Redis, from the ground up

Mark J. Russo:

A deep dive into Redis’ origins, design decisions, feature set, and a look at a few potential applications.

Not all that you would want to know about Redis but enough to develop an appetite for more!

March 13, 2011

Zotonic – The Erlang CMS

Filed under: NoSQL,Zotonic — Patrick Durusau @ 4:25 pm

Zotonic – The Erlang CMS

From the documentation:

The Zotonic data model has two tables at its core. The rsc (resource aka page) table and the edge table. All other tables are for access control, visitor administration, configuration and other purposes.

For simplicity of communication the rsc record is often referred to as a page. As every rsc record can have their own page on the web site.

Zotonic is a mix between a traditional database and a triple store. Some page (rsc record) properties are stored as columns, some are serialized in a binary column and some are represented as directed edges to other pages.

In Zotonic there is no real distinction between rsc records that are a person, a news item, a video or something else. The only difference is the category of the rsc record. And the rsc’s category can be changed. Even categories and predicates are represented as rsc records and can, subsequently, have their own page on the web site.

Interesting last sentence: “Even categories and predicates are represented as rsc records and can, subsequently, have their own page on the web site.”

And one assumes the same to be true for categories and predicates in those “own page[s] on the web site.”

Questions:

  1. What use would you make of a CMS in a library environment? (3-5 pages, no citations)
  2. What subject identity issues are left unresolved by a CMS, such as Zotonic? (3-5 pages, no citations)
  3. What use cases would you write for your library director/board/funding organization to add subject identity management to Zotonic? (3-5 pages, no citations)

It isn’t enough that you recognize a problem and have a cool solution, even an effective one.

That is a necessary but not sufficient condition for success.

An effective librarian can:

  1. Recognize an information problem
  2. Find an effective solution for it (within resource/budget constraints)
  3. Communicate #1 and #2 to others, especially decision makers

I know lots of people who can do #1.

A fair number who can do #2, but who sit around at lunch or the snack machine and bitch about how if they were in charge things would be different. Yeah, probably worse.

The trick is to be able to do #3.

March 12, 2011

Social Analytics on MongoDB

Filed under: Analytics,MongoDB,NoSQL — Patrick Durusau @ 6:48 pm

Social Analytics on MongoDB, Patrick Stokes of Buddy Media does a highly entertaining presentation on MongoDB and its adoption by Buddy Media.

Unfortunately the slides don’t display during the presentation.

Still, refreshing in the honesty about the development process.

PS: I have written to ask about where to find the slides.

Update

You can find the slides at: http://www.slideshare.net/pstokes2/social-analytics-with-mongodb

Neo4j 1.3 “Abisko Lampa” M04 – Size really does matter

Filed under: Graphs,Neo4j,NoSQL — Patrick Durusau @ 6:47 pm

Neo4j 1.3 “Abisko Lampa” M04 – Size really does matter

Fourth milestone release on the way to Neo4J 1.3 release.

From the announcement:

A database can now contain 32 billion nodes, 32 billion relationships and 64 billion properties. Before this, you had to make do with a puny 4 billion nodes, 4 billion relationships and 4 billion properties. Finally, every single person on Earth can have their own personal node! And did we mention this is happening without adding even one byte to the size of your database?

Well, they are graph database, not population, experts. (The current world population being just a few shy of 32 billion. At last count.)

😉

Still, shadows of things that will be, must be.

Learn MongoDB Basics

Filed under: MongoDB,NoSQL — Patrick Durusau @ 6:47 pm

Learn MongoDB Basics

Covers the basics of the MongoDB.

One nice aspect is the immediate feedback/reinforcement of the principles being taught.

I can imagine someone creating a resource for topic maps along these lines.

Hopefully both a web as well as local version.

I don’t think the benefits of using topic maps are in dispute.

What is unclear is how to convey those benefits to user?

Comments/suggestions?

Questions:

What aspects of the Learn MongoDB site did you find the least/most helpul? (2-3 pages, no citations)
  • How would you construct a topic map tutorial using this as a guide? (4-6 pages, no citations)
  • What other illustrations would you use to convey the advantages of topic maps? (4-6 pages no citations)
  • March 11, 2011

    agamemnon

    Filed under: Cassandra,Graphs,NoSQL — Patrick Durusau @ 6:59 pm

    agamemnon

    From the website:

    Agamemnon is a thin library built on top of pycassa. It allows you to use the Cassandra database (http://cassandra.apache.org) as a graph database. Much of the api was inspired by the excellent neo4j.py project (http://components.neo4j.org/neo4j.py/snapshot/)

    Thanks to Jack Park for pointing this out!

    « Newer PostsOlder Posts »

    Powered by WordPress