Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 12, 2011

Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages

Filed under: Hadoop,MapReduce,R — Patrick Durusau @ 7:56 am

Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages

Antonio Piccolboni’s review has been summarized as:

  • Java Hadoop (mature and efficient, but verbose and difficult to program)
  • Cascading (brings an SQL-like flavor to Java programming with Hadoop)
  • Pipes/C++ (a C++ interface to programming on Hadoop)
  • Hive (a high-level SQL-like language for Hadoop, concise and expressive but limited in flexibility)
  • Pig (a new high-level langauge for Hadoop)
  • Rhipe (an R package for map-reduce programming with Hadoop)
  • Dumbo (a Hadoop library for python)
  • Cascalog (a powerful but obtuse lisp-based interface to Hadoop)

Read Piccolboni’s review for yourself and see what you think.

Beyond the Polar Bear

Filed under: Interface Research/Design,Search Interface — Patrick Durusau @ 7:55 am

Beyond the Polar Bear

Webinar:

Date: Thursday, May 26, 2011, 11:30am-12:30pm (EDT)

From the post:

The BBC’s new Food site (bbc.co.uk/food) is completely rebuilt using principles of domain and data modeling. Domain-driven design breaks down complex subjects into the things people usually think about. With food, it’s stuff like ‘dishes’, ‘ingredients’ and ‘chefs’. The parts of the model inter-relate far more organically than a traditional top-down hierarchy.

A logical domain model makes site navigation mirror the way people explore knowledge. By intersecting across subjects, links themselves become facts, allowing humans and machines to learn through undirected user journeys. This paradigm shift from labeling boxes to taming rich data is a vital skill for the modern IA.

In this webinar, we’ll explore how to design for a semantic ‘web of data’, using case studies from the BBC’s Food and Natural History products. You’ll learn how to unlock the potential of your content, create scalable navigation patterns, achieve simply fabulous SEO and step confidently into the world of open linked data.

Not cheap: ASIS&T Members: $25 Non-Members: $59

I need to check on my ASIS&T dues status.

This could well be worth the price of admission.

Information Heterogeneity and Fusion

Filed under: Data Fusion,Heterogeneous Data,Information Integration,Mapping — Patrick Durusau @ 7:54 am

2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)

Important Dates:

Paper submission deadline: 25th July 2011
Notification of acceptance: 19th August 2011
Camera-ready version due: 12th September 2011
Workshop: 23rd or 27th October 2011

Datasets are also being made available. Just in case you can’t find any heterogeneous data lying around. 😉

Looks like a perfect venue for topic map papers. (Not to mention that a re-usable mapping between recommender systems looks like a commercial opportunity.)

From the website:

In recent years, increasing attention has been given to finding ways for combining, integrating and mediating heterogeneous sources of information for the purpose of providing better personalized services in many information seeking and e-commerce applications. Information heterogeneity can indeed be identified in any of the pillars of a recommender system: the modeling of user preferences, the description of resource contents, the modeling and exploitation of the context in which recommendations are made, and the characteristics of the suggested resource lists.

Almost all current recommender systems are designed for specific domains and applications, and thus usually try to make best use of a local user model, using a single kind of personal data, and without explicitly addressing the heterogeneity of the existing personal information that may be freely available (on social networks, homepages, etc.). Recognizing this limitation, among other issues: a) user models could be based on different types of explicit and implicit personal preferences, such as ratings, tags, textual reviews, records of views, queries, and purchases; b) recommended resources may belong to several domains and media, and may be described with multilingual metadata; c) context could be modeled and exploited in multi-dimensional feature spaces; d) and ranked recommendation lists could be diverse according to particular user preferences and resource attributes, oriented to groups of users, and driven by multiple user evaluation criteria.

The aim of HetRec workshop is to bring together students, faculty, researchers and professionals from both academia and industry who are interested in addressing any of the above forms of information heterogeneity and fusion in recommender systems. We would like to raise awareness of the potential of using multiple sources of information, and look for sharing expertise and suitable models and techniques.

Another dire need is for strong datasets, and one of our aims is to establish benchmarks and standard datasets on which the problems could be investigated. In this edition, we make available on-line datasets with heterogeneous information from several social systems. These datasets can be used by participants to experiment and evaluate their recommendation approaches, and be enriched with additional data, which may be published at the workshop website for future use.

May 11, 2011

Topic Map Metrics

Filed under: Topic Maps — Patrick Durusau @ 6:58 pm

I just saw a tweet asking how to measure how “well” a topic map organized information? In other words, what is the payoff of new topics and associations?

Well, yes, but you have to define what you mean by “well” organized.

Two quick examples but I need to return to this topic fairly soon.

Example 1.

You are writing a topic map about George W. Bush, one of the celebratory kind about how he pursued terrorists, etc.

Would you include a topic and association about Switzerland not being on his travel agenda because of outstanding warrants for his arrest as a suspected war criminal? http://www.schweizmagazin.ch/news/schweiz/5627-Sommaruga-will-Bush-verhaften.html

Would inclusion of that information make your topic map better organized? More complete?

Example 2.

You are contributing to a public topic map about gene mappings.

Due to internal research, you know that while accurate, a proposed association will lead to no where in terms of drug research. And that it is very likely your competitors will think you slipped by posting it and run off to pursue it.

Do you post less information than you know about the subject? Doesn’t make the information you do post inaccurate.

Has that lessened the organizational value of the topic map?

The question is a complicated one.

I think Sam Hunting would say it is a matter of contract. What sort of information was warranted to be in the map, what degree of completeness, accuracy, etc.? That gets more difficult when we start talking about public topic maps.

It is something that I think merits a lot of exploration.

The Functional Web

Filed under: Functional Programming — Patrick Durusau @ 6:58 pm

The Functional Web

Steve Vinoski’s column from the IEEE Internet Computing magazine.

Very nice set of columns on functional programming and the Web.

Introduction to programming in Erlang, Part 1

Filed under: Erlang — Patrick Durusau @ 6:57 pm

Introduction to programming in Erlang, Part 1

IBM’s developerWorks has discovered Erlang.

Summary:

Erlang is a multi-purpose programming language used primarily for developing concurrent and distributed systems. It began as a proprietary programming language used by Ericsson for telephony and communications applications. Released as open source in 1998, Erlang has become more popular in recent years thanks to its use in high profile projects, such as the Facebook chat system, and in innovative open source projects, such as the CouchDB document-oriented database management system. In this article, you will learn about Erlang, and how its functional programming style compares with other programming paradigms such as imperative, procedural and object-oriented programming. You will learn how to create your first program, a Fibonacci recursive function. Next, you will go through the basics of the Erlang language, which can be difficult at first for those used to C, C++, Java™, and Python.

Comprehensive LaTeX symbol list

Filed under: TeX/LaTeX — Patrick Durusau @ 6:57 pm

Comprehensive LaTeX symbol list

Scott Pakin has assembled a list of LaTeX symbols, current as of 2009.

If you start authoring serious papers on topic maps you are likely to be using TeX/LaTeX.

Data Stream Mining Techniques

Filed under: Data Mining,Data Streams — Patrick Durusau @ 6:56 pm

An analytical framework for data stream mining techniques based on challenges and requirements by Mahnoosh Kholghi and Mohammadreza Keyvanpour.

Abstract:

A growing number of applications that generate massive streams of data need intelligent data processing and online analysis. Real-time surveillance systems, telecommunication systems, sensor networks and other dynamic environments are such examples. The imminent need for turning such data into useful information and knowledge augments the development of systems, algorithms and frameworks that address streaming challenges. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. Generally, two main challenges are designing fast mining methods for data streams and need to promptly detect changing concepts and data distribution because of highly dynamic nature of data streams. The goal of this article is to analyze and classify the application of diverse data mining techniques in different challenges of data stream mining. In this paper, we present the theoretical foundations of data stream analysis and propose an analytical framework for data stream mining techniques.

The paper is an interesting collection of work on mining data streams and its authors should be encouraged to continue their research in this field.

However, the current version is in serious need of editing, both in terms of language usage but organizationally as well. For example, it is hard to relate table 2 (data stream mining techniques) to the analytical framework that was the focus of the article.

May 10, 2011

Promoting Topic Maps

Filed under: Marketing — Patrick Durusau @ 3:35 pm

The recent release of QuaaxTM made me wonder how many people comfortable with the LAMP stack are aware of QuaaxTM?

There is the Topic Maps Tools page by Lars Marius Garshol, which is always the first place I look for new software, but how many non-topic map users would know to look there?

I am thinking that we need a two-fold strategy:

1) Use Lars’ lists of software to create “flyers” as it were for particular languages/platforms. (Communities that use Python are probably not interested in C++ libraries.)

2) “Distribute” those flyers when appropriate (no spamming) in discussions in other communities. With pointers back to Topic Maps Tools.

The metadata associated with the current listings makes the tools easy to find, but that is a pull information model.

I am thinking more along the lines of a push information model.

If you think about advertising, it is all based on a push information model.

Maybe there is a lesson there for the topic maps community.

Parasail

Filed under: Parallelism — Patrick Durusau @ 3:33 pm

Parasail (Parallel, Specification and Implementation Language) is a programming language with inherent parallelism.

More correctly it represents the development of a programming language with inherent parallelism.

Extending parallel processing to be a more general solution is all the rage and it is hard to choose winners and losers.

It isn’t too early to start thinking about parallel processing of semantics.

Some resources on Parasail to get you started in that direction:

Designing ParaSail, a new programming language: A blog on the development of Parasail.

Map/Reduce in ParaSail; Parameterized operations: A Map/Reduce post that may peak your interesting in Parasail.

ParaSail Programming Language: Google Discussion Group on Parasail.

An Introduction to ParaSail: Parallel Specification and Implementation Language

Parasail Reference Manual — First Complete Draft

QuaaxTM – 0.6.2

Filed under: QuaaxTM,Topic Map Software — Patrick Durusau @ 3:32 pm

QuaaxTM – 0.6.2

From the website:

QuaaxTM is a PHP Topic Maps engine which supports ISO/IEC 13250-2 Topic Maps Data Model (TMDM). QuaaxTM is an implementation of the core and index interfaces of PHPTMAPI. PHPTMAPI is based on the TMAPI specification and provides a standardized API for PHP5 to access and process data held in a topic map.

QuaaxTM uses MySQL with InnoDB as storage engine and therefore benefits from transaction support and referential integrity.

Brisk: Simpler, More Reliable, High-Performance Hadoop Solution

Filed under: Brisk,Cassandra,Hadoop — Patrick Durusau @ 3:30 pm

DataStax Releases Dramatically Simpler, More Reliable, High-Performance Hadoop Solution

From NoSQLDatabases coverage of Brisk a second generation Hadoop soltuion from Datastax.

From the post:

Today, DataStax, the commercial leader in Apache Cassandra™, released DataStax’ Brisk – a second-generation open-source Hadoop distribution that eliminates the key operational complexities with deploying and running Hadoop and Hive in production. Brisk is powered by Cassandra and offers a single platform containing a low-latency database for extremely high-volume web and real-time applications, while providing tightly coupled Hadoop and Hive analytics.

Download Brisk -> Here.

Hypertable 0.9.5.0 pre-release

Filed under: Hypertable,NoSQL — Patrick Durusau @ 3:30 pm

Stability Improvements in the Hypertable 0.9.5.0 pre-release

From the Hypertable blog:

We recently announced the Hypertable 0.9.5.0 pre-release. Even though we’ve labelled it as a “pre” release, it is one of the biggest and most important Hypertable releases to date. Among other things, it includes a complete re-write of the Master, to fix some known stability problems. It represents a significant amount of work as can be seen by the following code change statistics:

  • 512 files changed
  • 30,633 line insertions
  • 14,354 line deletions

The following describes problems that existed in prior releases and how they were solved, and highlights other stability improvements included in the 0.9.5.0 pre-release.

Details on the recent “pre-release” of Hypertable.

Special Issue on Linked Data for Science and Education

Filed under: Linked Data,LOD — Patrick Durusau @ 3:29 pm

Special Issue on Linked Data for Science and Education

The Semantic Web Journal has posted a call for papers on linked data for science and education.

Important dates:

Deadline for submissions: May 31 2011
Reviews due: July 15 2011
Final versions of accepted papers due: August 12 2011

Apologies, I missed this announcement when it came out in early February, 2011.

From the call:

The number of universities, research organizations, publishers and funding agencies contributing to the Linked Data cloud is constantly increasing. The Linked Data paradigm has been identified as a lightweight approach for data dissemination and integration, opening up new opportunities for the organization, integration, archiving and retrieval of research results and educational material. Obviously, this novel approach also raises new challenges regarding the integrity, adoption, use and sustainability of contents. A number of case studies from universities and research communities already demonstrate that Linked Data is not merely a novel way of exposing data on the Web, but that its principles help integrating related data, connecting scientists working on related topics, and improving scientific and educational workflows. The next challenges in creating a true Web of scientific and educational data include dealing with provenance, mapping vocabularies (i.e., ontologies), and organizational issues such as assessing costs and ensuring persistence and performance. In this special issue of the Semantic Web Journal, we want to collect the state of the art in Linked Data for science and education and identify upcoming challenges, focusing on technological aspects as well as social and legal implications.

Well, I like that:

The next challenges in creating a true Web of scientific and educational data include dealing with provenance, mapping vocabularies (i.e., ontologies), and organizational issues such as assessing costs and ensuring persistence and performance.

Link data together and then hope we can sort it out on the other end.

Doesn’t that sound a lot like Google?

Index data together and then hope we can sort it out on the other end.

May 9, 2011

Objectivity Infinite Graph (timed associations?)

Filed under: Associations,Graphs,InfiniteGraph — Patrick Durusau @ 10:39 am

Objectivity Infinite Graph

Curt Monash that reports his conversation with Darren Wood, the lead developer for the Infinite Graph database product.

From last June (2010) but I think after reading it, you will agree it was worth bringing up.

A couple of goodies from his thoughts on edges:

  • Edges are first-class citizens in Infinite Graph, just as nodes are.
  • In Infinite Graph, edges can also have effectiveness date intervals. E.g., if you live at an address for a certain period, that’s when the edge connecting you to it is valid.

The second point, edges with date intervals, may have a bearing on a recent series of posts by Robert Cerny to the Topicmapmail list. (See: “Temporal validitity of subject indicators?” in the second quarter archives, early May 2011)

Is that timing for an association?

Tracking the relationships in Sex in the City would require such an ability.

TinkerPop Releases – Gremlin 1.0/Rexster 0.3

Filed under: Blueprints,Frames,Gremlin,Pipes,Rexster — Patrick Durusau @ 10:35 am

Marko Rodriguez announced the release of Gremlin 1.0 and Rexster 0.3 (graph server) along with other releases:

Blueprints 0.7 (Patrick):
https://github.com/tinkerpop/blueprints/wiki/Release-Notes
A property graph interface.

Frames 0.2 (Huff and Puff):
https://github.com/tinkerpop/frames/wiki/Release-Notes
An object to graph framework.

Gremlin 1.0 (Gremlin):
https://github.com/tinkerpop/gremlin/wiki/Release-Notes
A graph traversal language.

Pipes 0.5 (Drain):
https://github.com/tinkerpop/pipes/wiki/Release-Notes
A data flow framework using process graphs.

Rexster 0.3 (Dog Eat Dog):
https://github.com/tinkerpop/rexster/wiki/Release-Notes
A RESTful graph shell.

Stats of the Union tells health stories in America

Filed under: Data Source,Marketing,Visualization — Patrick Durusau @ 10:35 am

Stats of the Union tells health stories in America

From Flowingdata.com news of an iPad app that:

maps the status of health in America. Browse, pan, zoom, and explore through a number of demographics and breakdowns.

I don’t have an iPad (or iPhone) but both are venues of opportunity for topic maps.

It isn’t hard to imagine a topic map that takes the same information in Stats of the Union and adds in data that correlates obesity with the density of fast-food restaurants, making zoning decisions for the same a matter of public health.

To answer the question: “Why are you fat?” with a localized “McDonalds, Wendys, Arbies, etc.”

Nice visualizations from what I could see on the video.

Just a thought, to personalize the obesity app, you could map in frequent customers who are, ahem, extra large sizes. (With their consent of course. I wouldn’t bother asking McDonalds.)

Perhaps a new slogan: Topic maps, focusing information to a sharp point.

What do you think?

Google at CHI 2011

Google at CHI 2011

From the Google blog:

Google has an increasing presence at ACM CHI: Conference on Human Factors in Computing Systems, which is the premiere conference for Human Computer Interaction research. Eight Google papers will appear at the conference. These papers not only touch on our core areas such as Search, Chrome and Android but also demonstrate our growing effort in new areas where HCI is essential, such as new search user interfaces, gesture-based interfaces and cross-device interaction. They showcase our efforts to address user experiences in diverse situations. Googlers are playing active roles in the conference in many other ways too: participating in conference committees, hosting panels, organizing workshops and teaching courses, as well as running demos and 1:1 sessions at Google’s booth.

The post also has a complete set of links to papers from Google and other materials.

I remember reading something recently about modulating the amount of information sent to a user based on their current activity level. That is a person who was engaged in a task requiring immediate attention (does watching American Idol count?) is sent less information than a person doing something less important (watching a presidential address).

Is merging affected by my activity level or just delivery of less than all the results?

leveldb

Filed under: leveldb,NoSQL — Patrick Durusau @ 10:33 am

leveldb

A NoSQL library.

From the website:

LevelDB is a library that implements a fast persistent key-value store.

Features

  • Keys and values are arbitrary byte arrays.
  • Data is stored sorted by key.
  • Callers can provide a custom comparison function to override the sort order.
  • The basic operations are Put(key,value), Get(key), Delete(key).
  • Multiple changes can be made in one atomic batch.
  • Users can create a transient snapshot to get a consistent view of data.
  • Forward and backward iteration is supported over the data.
  • Data is automatically compressed using the Snappy compression library.
  • External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
  • Detailed documentation about how to use the library is included with the source code.

Limitations

  • This is not a SQL database. It does not have a relational data model, it does not support SQL queries, and it has no support for indexes.
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library.

iQvoc 3.0 released

Filed under: SKOS,Vocabularies — Patrick Durusau @ 10:32 am

iQvoc 3.0 released

An SKOS tool that is described on its “about” page as:

iQvoc is a web-based open source tool for managing vocabularies (classifications, thesauri, etc.). It combines an intuitive user interface with Semantic Web standards.

The navigation is intuitive, providing direct links and hierarchical tree visualizations. All common browsers are supported. Due to iQvoc’s modular architecture, its appearance can be easily and extensively customized.

iQvoc covers a comprehensive range of capabilities:

  • support for multiple languages in both the user interface and the content corpus (i.e. labels, notes etc.)
  • import/export of existing SKOS vocabularies
  • editorial control and workflow
  • notes and annotations
  • use of the vocabulary within the Linked Data network
  • modularity and extensibility

Big Data – Demo

Filed under: BigData,Marketing — Patrick Durusau @ 10:32 am

The slides from JAX2011 by Pavlo Baron are as informative and entertaining as any set I have ever seen.

Big Data The slide deck.

If more people did slides like these, fewer people would be asleep or doing email during presentations.

Big Data – Demo For JAX2011

Pavlo blogs about his demo at JAX2011.

Big-Data-Demo-2

The source code for Pavlo’s demo.

*****

Every serious data project visits the factors Pavlo lists. The ones that succeed anyway.

May 8, 2011

Modeling Network Evolution Using Graph Motifs

Filed under: Evoluntionary,Graph Motif Model,Graphs — Patrick Durusau @ 6:17 pm

Modeling Network Evolution Using Graph Motifs by Drew Conway.

Abstract:

Network structures are extremely important to the study of political science. Much of the data in its subfields are naturally represented as networks. This includes trade, diplomatic and conflict relationships. The social structure of several organization is also of interest to many researchers, such as the affiliations of legislators or the relationships among terrorist. A key aspect of studying social networks is understanding the evolutionary dynamics and the mechanism by which these structures grow and change over time. While current methods are well suited to describe static features of networks, they are less capable of specifying models of change and simulating network evolution. In the following paper I present a new method for modeling network growth and evolution. This method relies on graph motifs to generate simulated network data with particular structural characteristic. This technique departs notably from current methods both in form and function. Rather than a closed-form model, or stochastic implementation from a single class of graphs, the proposed “graph motif model” provides a framework for building flexible and complex models of network evolution. The paper proceeds as follows: first a brief review of the current literature on network modeling is provided to place the graph motif model in context. Next, the graph motif model is introduced, and a simple example is provided. As a proof of concept, three classic random graph models are recovered using the graph motif modeling method: the Erdos-Renyi binomial random graph, the Watts-Strogatz “small world” model, and the Barabasi-Albert preferential attachment model. In the final section I discuss the results of these simulations and subsequent advantage and disadvantages presented by using this technique to model social networks.

Now there’s an interesting idea.

Modeling the evolution of topic maps.

I wonder if the more interesting evolution would reflect the authoring of the topic map or the subject matter of the topic map?

I suppose that would depend upon the author and/or the subject of the map.


Update:

Graph Motif Modeling software now available as distributed Python package

Graph Motif Model Documentation

Using Neo4j with Vaadin Part 1:…

Filed under: Interface Research/Design — Patrick Durusau @ 6:16 pm

Using Neo4j with Vaadin Part 1: Creating the Architecture

In case you aren’t familiar with Vaadin.

Server-side Java code, so, no dependence on plugins or Javascript.

Another candidate for topic map UIs.

Linked Data in JSON

Filed under: JSON,Linked Data — Patrick Durusau @ 6:15 pm

A mailing list has been created for Linked Data in JSON.

Manu Spomy has posted Updated JSON-LD Draft with a summary of changes and links for those already familiar with the draft.

You will be encountering it so it will be helpful to follow the discussion.

Nitrogen Web Framework

Filed under: Erlang,Riak — Patrick Durusau @ 6:14 pm

Nitrogen Web Framework

From the website:

Nitrogen Web Framework is the fastest way to develop interactive web applications in full-stack Erlang.

Whether you are working with Riak (also programmed in Erlang) or not, this web framework may be of interest.

May 7, 2011

On the dangers of personalization

Filed under: Data Silos,Filters,Personalization — Patrick Durusau @ 6:06 pm

On the dangers of personalization

From the post:

We’re getting our search results seriously edited and, I bet, most of us don’t even know it. I didn’t. One Google engineer says that their search engine uses 57 signals to personalize your search results, even when you’re logged out.

Do we really want to live in a web bubble?

What I find interesting about this piece is that it describes a data silo but from the perspective of an individual.

Think about it.

A data silo is based on data that is filtered and stored.

Personalization is based on data that is filtered and presented.

Do you see any difference?

Structuring data integration models and data integration architecture

Filed under: Data Integration — Patrick Durusau @ 5:51 pm

Structuring data integration models and data integration architecture

By Anthony David Giordano

From the post:

In this excerpt from Data Integration Blueprint and Modeling, readers will learn how to build a business case for a new data integration design process and how to improve the development process for data integration modeling. Readers will also get tips on leveraging process modeling for data integration and designing data integration architecture models, plus definitions for three data integration modeling types – physical, logical and conceptual.

Interesting enough that I bought a copy of the book.

Mostly to see where in data integration design would it make the most sense to pitch topic maps.

May also find clues where topic maps would make the best fit in data integration tools.

If you are familiar with this book, please comment.

Cassandra – New Beta

Filed under: Cassandra,NoSQL — Patrick Durusau @ 5:50 pm

Cassandra – New Beta

Version 0.8.0 beta2 has been posted!

Changes.

NoSQL Databases

Filed under: NoSQL — Patrick Durusau @ 5:49 pm

NoSQL Databases

I saw this on the High Scalability blog. Its a 120+ page overview of NoSQL databases by Christof Strauch, from Stuttgart Media University.

Christof is quoted as saying the goals of the paper were:

The paper aims at giving a systematic and thorough introduction and overview of the NoSQL field by assembling information dispersed among blogs, wikis and scientific papers. It firstly discusses reasons, rationales and motives for the development and usage of nonrelational database systems. These can be summarized by the need for high scalability, the processing of large amounts of data, the ability to distribute data among many (often commodity) servers, consequently a distribution-aware design of DBMSs.

The paper then introduces fundamental concepts, techniques and patterns that are commonly used by NoSQL databases to address consistency, partitioning, storage layout, querying, and distributed data processing. Important concepts like eventual consistency and ACID vs. BASE transaction characteristics are discussed along with a number of notable techniques such as multi-version storage, vector clocks, state vs. operational transfer models, consistent hashing, MapReduce, and row-based vs. columnar vs. log-structured merge tree persistence.

As a first class of NoSQL databases, key-value-stores are examined by looking at the proprietary, fully distributed, eventual consistent Amazon Dynamo store as well as popular opensource key-value-stores like Project Voldemort, Tokyo Cabinet/Tyrant and Redis.

In the following, document stores are being observed by reviewing CouchDB and MongoDB as the two major representatives of this class of NoSQL databases. Lastly, the paper takes a look at column-stores by discussing Google’s Bigtable, Hypertable and HBase, as well as Apache Cassandra which integrates the full-distribution and eventual consistency of Amazon’s Dynamo with the data model of Google’s Bigtable.”

« Newer PostsOlder Posts »

Powered by WordPress