Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 7, 2012

Integrating Lucene with HBase

Filed under: Geographic Information Retrieval,HBase,Lucene,Spatial Index — Patrick Durusau @ 5:40 pm

Integrating Lucene with HBase by Boris Lublinsky and Mike Segel.

You have to get to the conclusion for the punch line:

The simple implementation, described in this paper fully supports all of the Lucene functionality as validated by many unit tests from both Lucene core and contrib modules. It can be used as a foundation of building a very scalable search implementation leveraging inherent scalability of HBase and its fully symmetric design, allowing for adding any number of processes serving HBase data. It also avoids the necessity to close an open Lucene Index reader to incorporate newly indexed data, which will be automatically available to user with possible delay controlled by the cache time to live parameter. In the next article we will show how to extend this implementation to incorporate geospatial search support.

Put why your article is important in the introduction as well.

The second article does better:

Implementing Lucene Spatial Support

In our previous article [1], we discussed how to integrate Lucene with HBase for improved scalability and availability. In this article I will show how to extend this Implementation with the spatial support.

Lucene spatial contribution package [2, 3, 4, 5] provides powerful support for spatial search, but is limited to finding the closest point. In reality spatial search often has significantly more requirements, for example, which points belong to a given shape (circle, bounding box, polygon), which shapes intersect with a given shape and so on. Solution, presented in this article allows solving all of the above problems.

March 6, 2012

Neo4j Heroku Challengers – Vote Now

Filed under: Heroku,Neo4j — Patrick Durusau @ 8:10 pm

Neo4j Heroku Challengers – Vote Now

From the post:

The Neo4j Heroku Challenge has closed, leaving a brilliant collection of projects to highlight developing with Neo4j using a broad range of languages and frameworks. With the challenge closed to entries, it is time for the voting! Let’s take a look at the challengers to see who deserves your support.

A wide variety of apps based on Neo4j in the cloud!

Take a look and see what looks good to you! (Don’t forget to vote!)

Computer Algorithms: Merge Sort

Filed under: Algorithms,Merge Sort — Patrick Durusau @ 8:10 pm

Computer Algorithms: Merge Sort

From the post:

Basically sorting algorithms can be divided into two main groups. Such based on comparisons and such that are not. I already posted about some of the algorithms of the first group. Insertion sort, bubble sort and Shell sort are based on the comparison model. The problem with these three algorithms is that their complexity is O(n2) so they are very slow.

So is it possible to sort a list of items by comparing their items faster than O(n2)? The answer is yes and here’s how we can do it.

Nice illustrations!

I suspect that algorithms and graph algorithms in particular are going to become fairly common here. Suggestions of work or posts that I should cover most welcome!

Stanford – Delayed Classes – Enroll Now!

If you have been waiting for notices about the delayed Stanford courses for Spring 2012, your wait is over!

Even if you signed up for more information, you must register at the course webpage to take the course.

Details as I have them on 6 March 2012 (check course pages for official information):

Cryptography Starts March 12th.

Design and Analysis of Algorithms Part 1 Starts March 12th.

Game Theory Starts March 19th.

Natural Language Processing Starts March 12th.

Probabilistic Graphical Models Starts March 19th.

You may be asking yourself, “Are all these courses useful for topic maps?”

I would answer by pointing out that librarians and indexers have rely on a broad knowledge of the world to make information more accessible to users.

By way of contrast, “big data” and Google, have made it less accessible.

Something to think about while you are registering for one or more of these courses!

Cloudera Manager | Activity Monitoring & Operational Reports Demo Video

Filed under: Cloud Computing,Cloudera,Hadoop — Patrick Durusau @ 8:10 pm

Cloudera Manager | Activity Monitoring & Operational Reports Demo Video by Jon Zuanich.

From the post:

In this demo video, Philip Zeyliger, a software engineer at Cloudera, discusses the Activity Monitoring and Operational Reports in Cloudera Manager.

Activity Monitoring

The Activity Monitoring feature in Cloudera Manager consolidates all Hadoop cluster activities into a single, real-time view. This capability lets you see who is running what activities on the Hadoop cluster, both at the current time and through historical activity views. Activities are either individual MapReduce jobs or those that are part of larger workflows (via Oozie, Hive or Pig).

Operational Reports

Operational Reports provide a visualization of current and historical disk utilization by user, user groups and directory. In addition, it tracks MapReduce activity on the Hadoop cluster by job, user, group or job ID. These reports are aggregated over selected time periods (hourly, daily, weekly, etc.) and can be exported as XLS or CSV files.

It is a sign of Hadoop’s maturity that professional management interfaces have started to appear.

Hadoop has always been manageable. The question was how to find someone to marry your cluster? And what happened in the case of a divorce?

Professional management tools enable a less intimate relationship between your cluster and its managers. Not to mention the availability of a larger pool of managers for your cluster.

One request, please avoid the default security options on vimeo videos. They should be embeddable and downloadable in all cases.

running mahout collocations over common crawl text

Filed under: Common Crawl,Mahout — Patrick Durusau @ 8:09 pm

running mahout collocations over common crawl text by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Can you answer Mat’s question about the incidence of Lithuanian pages? (Please post here.)

Extending the GATK for custom variant comparisons using Clojure

Filed under: Bioinformatics,Biomedical,Clojure,MapReduce — Patrick Durusau @ 8:09 pm

Extending the GATK for custom variant comparisons using Clojure by Brad Chapman.

From the post:

The Genome Analysis Toolkit (GATK) is a full-featured library for dealing with next-generation sequencing data. The open-source Java code base, written by the Genome Sequencing and Analysis Group at the Broad Institute, exposes a Map/Reduce framework allowing developers to code custom tools taking advantage of support for: BAM Alignment files through Picard, BED and other interval file formats through Tribble, and variant data in VCF format.

Here I’ll show how to utilize the GATK API from Clojure, a functional, dynamic programming language that targets the Java Virtual Machine. We’ll:

  • Write a GATK walker that plots variant quality scores using the Map/Reduce API.
  • Create a custom annotation that adds a mean neighboring base quality metric using the GATK VariantAnnotator.
  • Use the VariantContext API to parse and access variant information in a VCF file.

The Clojure variation library is freely available and is part of a larger project to provide variant assessment capabilities for the Archon Genomics XPRIZE competition.

Interesting data, commercial potential, cutting edge technology and subject identity issues galore. What more could you want?

mlpy: Machine Learning Python

Filed under: Machine Learning,Python — Patrick Durusau @ 8:09 pm

mlpy: Machine Learning Python by Davide Albanese, Roberto Visintainer, Stefano Merler, Samantha Riccadonna, Giuseppe Jurman, and Cesare Furlanello.

Abstract:

mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is distributed under GPL3 at the website this http URL

There must have been a publication requirement because the paper doesn’t really add anything to the already excellent documentation at the project site. More of a short summary/overview sort of document.

The software, on the other hand, deserves your close attention.

I guess the authors got the memo on GPL licensing? 😉

That’s not science: the FSF’s analysis of GPL usage

Filed under: Licensing — Patrick Durusau @ 8:09 pm

That’s not science: the FSF’s analysis of GPL usage by Matthew Aslett.

From the post:

The Free Software Foundation has responded to our analysis of figures that indicate that the proportion of open source projects using the GPL is in decline.

Specifically, FSF executive director John Sullivan gave a presentation at FOSDEM which asked “Is copyleft being framed”. You can find his slides here, a write-up about the presentation here, and Slashdot discussion here.

Most of the opposition to the earlier posts on this subject addressed perceived problems with the underlying data, specifically that it comes from Black Duck, which does not publish details of its methodology. John’s response is no exception. “That’s not science,” he asserts, with regards to the lack of clarity.

This is a valid criticism, which is why – prompted by Bradley M Kuhn – I previously went to a lot of effort to analyze data from Rubyforge, Freshmeat, ObjectWeb and the Free Software Foundation collected and published by FLOSSmole, only to find that it confirmed the trend suggested by Black Duck’s figures. I was personally therefore happy to use Black Duck’s figures for our update.

I wasn’t real sure why this was an issue until I followed the link to On the continuing decline of the GPL where I read:

Our projection also suggests that permissive licenses (specifically in this case, MIT/Apache/BSD/Ms-PL) will account for close to 30% of all open source software by September 2012, up from 15% in June 2009 (we don’t have a figure for June 2008 unfortunately).

Permissive licenses work for me, both for data as well as software.

Think of it this way: Commercial use of data or software is like another form of commercial activity, the first one is always free. Do good work and you will attract the attention of those would would like to have it all the time.

MPC – Minnesota Population Center

Filed under: Census Data,Government Data — Patrick Durusau @ 8:09 pm

MPC – Minnesota Population Center

I mentioned the Integrated Public Use Microdata Series (IPUMS-USA) data set last year which self-describes as:

IPUMS-USA is a project dedicated to collecting and distributing United States census data. Its goals are to:

  • Collect and preserve data and documentation
  • Harmonize data
  • Disseminate the data absolutely free!

Use it for GOOD — never for EVIL

There is international data and more U.S. data that may be of interest:

Using your Lucene index as input to your Mahout job – Part I

Filed under: Clustering,Collocation,Lucene,Mahout — Patrick Durusau @ 8:08 pm

Using your Lucene index as input to your Mahout job – Part I

From the post:

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.

Access to original text can help with improving clustering results. See the blog post for details.

March 5, 2012

Bad vs Good Search Experience

Filed under: Interface Research/Design,Lucene,Search Interface,Searching — Patrick Durusau @ 7:53 pm

Bad vs Good Search Experience by Emir Dizdarevic.

From the post:

The Problem

This article will show how a bad search solution can be improved. We will demonstrate how to build an enterprise search solution relatively easy using Apache Lucene/SOLR.

We took a local ad site as an example of a bad search experience.

We crawled the ad site with Apache Nutch, using a couple of home grown plugins to fetch only the data we want and not the whole site. Stay tuned for a separate article on this topic.

‘BAD’ search is based on real search results from the ad site i.e. how the website search currently works. ‘GOOD ‘ search is based on same data but indexed with Apache Lucene/Solr (inverted index).

BAD Search: We assume that it’s based on exact match criteria or something similar to ‘%like%’ database statement. To simulate this behavior we used content field that it tokenized by whitespace, lowercased and used phrase queries every time. This is the closest we could get to existing ad site search solution, but even this bad it was performing better.

An excellent post in part because of the detailed example but also to show that improving search results is an iterative process.

Enjoy!

Java Remote Method Invocation (RMI) for Bioinformatics

Filed under: Bioinformatics,Java,Remote Method Invocation (RMI) — Patrick Durusau @ 7:53 pm

Java Remote Method Invocation (RMI) for Bioinformatics by Pierre Lindenbaum.

From the post:

Java Remote Method Invocation (Java RMI) enables the programmer to create distributed Java technology-based to Java technology-based applications, in which the methods of remote Java objects can be invoked from other Java virtual machines*, possibly on different hosts.“[Oracle] In the current post a java client will send a java class to the server that will analyze a DNA sequence fetched from the NCBI, using the RMI technology.

Distributed computing, both to the client and server, is likely to form part of a topic map solution. This example is one drawn from bioinformatics but the principles are generally applicable.

Models for Hierarchical Data with SQL and PHP

Filed under: Adjacency List,Closure Table,Nested Sets,Path Enumeration — Patrick Durusau @ 7:53 pm

Models for Hierarchical Data with SQL and PHP by Bill Karwin.

From the description:

Tree-like data relationships are common, but working with trees in SQL usually requires awkward recursive queries. This talk describes alternative solutions in SQL, including:

  • Adjacency List
  • Path Enumeration
  • Nested Sets
  • Closure Table

Code examples will show using these designs in PHP, and offer guidelines for choosing one design over another.

May be of interest as a consumer of data stored using one of these designs or even using them more directly.

Bio4j 0.7, some numbers

Filed under: Bio4j,Graphs,Neo4j — Patrick Durusau @ 7:52 pm

Bio4j 0.7, some numbers by Pablo Pareja Tobes.

From the post:

There have already been a good few posts showing different uses and applications of Bio4j, but what about Bio4j data itself?

Today I’m going to show you some basic statistics about the different types of nodes and relationships Bio4j is made up of.

Just as a heads up, here are the general numbers of Bio4j 0.7 :

  • Number of Relationships: 530.642.683
  • Number of Nodes: 76.071.411
  • Relationship types: 139
  • Node types: 38

The numbers speak for themselves. More information at Pablo’s post.

An Evidential Logic for Multi-Relational Networks

Filed under: Description Logic,Evidential Logic — Patrick Durusau @ 7:52 pm

An Evidential Logic for Multi-Relational Networks by Marko A. Rodriguez and Joe Geldart.

Slide presentation on description and evidential logic.

By the same title, see the article by these authors:

An Evidential Logic for Multi-Relational Networks

Abstract:

Multi-relational networks are used extensively to structure knowledge. Perhaps the most popular instance, due to the widespread adoption of the Semantic Web, is the Resource Description Framework (RDF). One of the primary purposes of a knowledge network is to reason; that is, to alter the topology of the network according to an algorithm that uses the existing topological structure as its input. There exist many such reasoning algorithms. With respect to the Semantic Web, the bivalent, monotonic reasoners of the RDF Schema (RDFS) and the Web Ontology Language (OWL) are the most prevalent. However, nothing prevents other forms of reasoning from existing in the Semantic Web. This article presents a non-bivalent, non-monotonic, evidential logic and reasoner that is an algebraic ring over a multi-relational network equipped with two binary operations that can be composed to execute various forms of inference. Given its multi-relational grounding, it is possible to use the presented evidential framework as another method for structuring knowledge and reasoning in the Semantic Web. The benefits of this framework are that it works with arbitrary, partial, and contradictory knowledge while, at the same time, it supports a tractable approximate reasoning process.

Of the two I would recommend the paper over the slides. Just a fuller presentation. (Despite having the same name, these could be represented as separate topics in a topic map.)

Whose Requirements Are They Anyway?

Filed under: Requirements — Patrick Durusau @ 7:52 pm

Over the last 4,000+ postings I have read an even larger number of presentations, papers, etc.

We all start discussions from what we know best so those presentations/papers/etc. started with a position, product or technology best known to the author.

No surprise there.

What happens next is no surprise either but it isn’t the best next step, at least for users/customers.

Your requirements, generally stated, can be best met by the author’s product or technology.

I am certainly not blameless in that regard but is it the best way to approach a user/customer’s requirements?

By “best way” I mean a solution that mets the user/customer’s requirements, whether that includes your product/technology or not.

Which means changing the decision making process from:

  1. Choose SQL, NoSQL, Semantic Web, Linked Data, Topic Maps, Graphs, Cloud, non-Cloud, Web, non-Web, etc.
  2. Create solution based on choice in #1

to:

  1. Define user/customer requirements
  2. Evaluate cost of meeting requirements against various technology options
  3. Decide on solution based on information from #2
  4. Create solution

I can’t give you the identity but I once consulted with a fairly old (100+) organization that had been sold a state of the art publishing system + installation. It was like a $500K dog that you had to step over going in the door. Great product, for its intended application space, utterly useless for the publishing work flow of the organization.

We all know stories like that one. Both in the private sector as well as in various levels of government around the world. I know a real horror story about an open source application that required support (they all do) which regularly fell over on its side, requiring experts to be flown in from another country. Failing wasn’t one of the requirements for the application, but open source mania lead to its installation.

I like open source projects and serve as the editor of the format (ODF) for several of them. But, choosing a technology based on ideology and not practical requirements is a bad choice. (full stop)

Its unreasonable to expect vendors to urge user/customers to critically evaluate their requirements against a range of products.

Users are going to have to step up and either perform those comparisons themselves or hire non-competing consultants to assist them.

A vendor with a product intended to meet your requirements (not theirs of making the sale) won’t object.

Perhaps that could be the first test of continuing discussions with a vendor?

Trees in the Database: Advanced Data Structures

Filed under: Data Structures,Database,PostgreSQL,RDBMS,SQL,Trees — Patrick Durusau @ 7:52 pm

Trees in the Database: Advanced Data Structures

Lorenzo Alberton writes:

Despite the NoSQL movement trying to flag traditional databases as a dying breed, the RDBMS keeps evolving and adding new powerful weapons to its arsenal. In this talk we’ll explore Common Table Expressions (SQL-99) and how SQL handles recursion, breaking the bi-dimensional barriers and paving the way to more complex data structures like trees and graphs, and how we can replicate features from social networks and recommendation systems. We’ll also have a look at window functions (SQL:2003) and the advanced reporting features they make finally possible. The first part of this talk will cover several different techniques to model a tree data structure into a relational database: parent-child (adjacency list) model, materialized path, nested sets, nested intervals, hybrid models, Common Table Expressions. Then we’ll move one step forward and see how we can model a more complex data structure, i.e. a graph, with concrete examples from today’s websites. Starting from real-world examples of social networks’ and recommendation systems’ features, and with the help of some graph theory, this talk will explain how to represent and traverse a graph in the database. Finally, we will take a look at Window Functions and how they can be useful for data analytics and simple inline aggregations, among other things. All the examples have been tested on PostgreSQL >= 8.4.

Very impressive presentation!

Definitely makes me want to dust off my SQL installations and manuals for a closer look!

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees)

Filed under: Bloom Filters,Cassandra,HBase,Merkle Trees — Patrick Durusau @ 7:51 pm

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees) by Lorenzo Alberton.

From the description:

The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.

Looking forward to more of this series!

March 4, 2012

Multiperspective

Filed under: Content Management System (CMS),Neo4j,Scala — Patrick Durusau @ 7:18 pm

Multiperspective

A hosted “content management system,” aka, a hosted website solution. Based on Scala and Neo4j.

I suspect that Scala and Neo4j make it easier for the system developers to offer a hosted website solution.

I am not sure that in a hosted solution the average web developer will notice the difference.

Still, unless you want a “custom” domain name, the service is “free” with some restrictions.

Would be interested if you can tell that it is Scala and Neo4j powering the usual services?

From “Under the hood”

Multispective.com is a next-generation content management system. In this post we will look how this system works and how this setup can benefit our users.

Unlike most other content management systems, multispective.com is entirely built in the programming language Scala, which means it runs on the rock-solid and highly performant Java Virtual Machine.

Scala offers us a highly powerful programming model, greatly cutting back the amount of software we had to write, while its powerful type system reduces the number of potential coding errors.

Another unique feature of multispective.com is the use of the Neo4j database engine.

Nearly all content management systems in use today, store their information in a Relational Database Management System (RDBMS), a proven technology ubiquitous around the ICT spectrum.

Relational Database Management Systems are very useful and have become extremely robust through decades of improvements, but they are not very well suited for highly connected data.

The world-wide-web is highly connected and in our search for the right technology for our software, we decided a different approach towards storage of data was needed.

Neo4j ended up to be the prefered solution for our storage needs. This database engine is based upon the model of the property-graph. Where a RDBMS stores information in tables, Neo4j stores information as nodes and relationships, where both can contain properties.

The data model of the property-graph is extremely simple, so it’s easy to reason about.

There were two main advantages to a graph-database for us. First of all, relationships are explicitly stored in the database. This makes navigating over complex networked data possible while maintaining a reasonable performance. Secondly, a graph database does not require a schema.

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

Filed under: Data Mining,PubMed — Patrick Durusau @ 7:17 pm

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? by Casey Bergman.

From the post:

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is an incredibly rich achive of the combined output of thousands of scientists worldwide.

Good question.

Suggestions for answers? (post to the original posting)

BTW, Casey includes a listing of the articles based on mining of the open-access contents of PubMed Central.

What other open access data sets suffer from a lack of use? Comments on why?

Spring Data Neo4j 2.1.0 Milestone 1 Released

Filed under: Neo4j,Spring Data — Patrick Durusau @ 7:17 pm

Spring Data Neo4j 2.1.0 Milestone 1 Released

Don’t be the last one in your office to see the latest milestone!

From the post:

Since the last release of Spring DataNeo4j we worked on a number of issues that you raised as important improvements and extensions.

Thanks to Mark Spitzler, Oliver Gierke, Rajaram Ganeshan, Laurent Pireyn for their contributions and all the other community members for the feedback and discussions.

We want to encourage you to give it a try, especially the new things and send us your feedback.

We are aware of the of issues that are still open and want to address them by the 2.1 release which is planned for the end of March – aligned with Neo4j 1.7.

….

Changes in version 2.1.M1 (2012-03-02)

  • DATAGRAPH-181 added support for unique entities with template.getOrCreateNode and @Indexed(unique=true)
  • DATAGRAPH-198 added support for custom target type, e.g. storing a Date converted to a Long @GraphProperty(propertyType=Long.class)
  • DATAGRAPH-102 fixed type representation in graph with support for @TypeAlias to allow shorter type-identifiers in the graph
  • DATAGRAPH-204 pom.xml cleanup (repositories) and dependency to SFW is now range from 3.0.7.RELEASE – 3.2
  • DATAGRAPH-185 cypher queries for single fields return null on no results
  • DATAGRAPH-182 allow @RelatedTo on Single RelationshipEntity fields + internal refactorings
  • DATAGRAPH-202 provide a getRelationshipsBetween() method in Neo4jTemplate
  • GH-#34 Fix for using Neo4j High-Availability
  • DATAGRAPH-176 Added debug log output for cypher and gremlin query as well as derived query methods
  • DATAGRAPH-186 default value for readonly relationship collections
  • DATAGRAPH-173 fixed verify method for interfaces, added interface support for type-representation strategies
  • DATAGRAPH-169 Backquoting all variable parts of derived finder queries to accommodate for non-identifier names.
  • DATAGRAPH-164 Added methods to determine stored java type to neo4j-template and crud-repository
  • DATAGRAPH-166 fixed multiple sort parameters
  • documentation updates

NoSQL Data Modeling Techniques

Filed under: Data Models,NoSQL — Patrick Durusau @ 7:17 pm

NoSQL Data Modeling Techniques by Ilya Katsov.

From the post:

NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like CAP theorem are well applicable to the NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks of systematic theory like in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

To explore data modeling techniques, we have to start with some more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:

Very complete and readable coverage of NoSQL data modeling techniques!

A must read if you are interested in making good choices between NoSQL solutions.

This post could profitably turned into a book length treatment with longer and a greater variety of examples.

Social networks in the database: using a graph database

Filed under: Graph Databases,Neo4j,Social Networks — Patrick Durusau @ 7:17 pm

Social networks in the database: using a graph database

The Neo4j response to Lorenzo Alberton’s post on social networks in a relational database.

From the post:

Recently Lorenzo Alberton gave a talk on Trees In The Database where he showed the most used approaches to storing trees in a relational database. Now he has moved on to an even more interesting topic with his article Graphs in the database: SQL meets social networks. Right from the beginning of his excellent article Alberton puts this technical challenge in a proper context:

Graphs are ubiquitous. Social or P2P networks, thesauri, route planning systems, recommendation systems, collaborative filtering, even the World Wide Web itself is ultimately a graph! Given their importance, it’s surely worth spending some time in studying some algorithms and models to represent and work with them effectively.

After a brief explanation of what a graph data structure is, the article goes on to show how graphs can be represented in a table-based database. The rest of the article shows in detail how an adjacency list model can be used to represent a graph in a relational database. Different examples are used to illustrate what can be done in this way.

Graph databases and Neo4j in particular offer advantages when used with graphs but the Neo4j post overlooks several points.

Unlike graph databases, SQL databases are nearly, if not always, ubiquitous. It may well be that the first “taste” of graph processing may come via a SQL database and lead users to expect more graph capabilities than a SQL solution can offer.

As Lorenzo points out in his posting, performance will vary depending upon the graph operations you need to perform. True for SQL databases and graph databases as well. Having a graph database doesn’t mean all graph algorithms run efficiently on your data set.

Finally:

A table-based system makes a good fit for static and simple data structures, ….

Isn’t going to ring true for anyone familiar with Oracle, PostgreSQL, MySQL, SQL Server, Informix, DB2 or any number of other “table-based systems.”

Graphs in the database: SQL meets social networks

Filed under: Database,Graphs,Social Networks,SQL — Patrick Durusau @ 7:17 pm

Graphs in the database: SQL meets social networks by Lorenzo Alberton.

If you are interested in graphs, SQL databases, Common Table Expressions (CTEs), together or in any combination, this is the article for you!

Lorenzo walks the reader through the basics of graphs with an emphasis on understanding how SQL techniques can be successfully used, depending upon your requirements.

From the post:

Graphs are ubiquitous. Social or P2P networks, thesauri, route planning systems, recommendation systems, collaborative filtering, even the World Wide Web itself is ultimately a graph! Given their importance, it’s surely worth spending some time in studying some algorithms and models to represent and work with them effectively. In this short article, we’re going to see how we can store a graph in a DBMS. Given how much attention my talk about storing a tree data structure in the db received, it’s probably going to be interesting to many. Unfortunately, the Tree models/techniques do not apply to generic graphs, so let’s discover how we can deal with them.

Redis (and Jedis) – delightfully simple and focused NoSQL

Filed under: Jedis,Redis — Patrick Durusau @ 7:17 pm

Redis (and Jedis) – delightfully simple and focused NoSQL by Ashwin Jayaprakash.

A very nice reminder that Redis may be the solution you need.

Redis is an open source NoSQL project that I had not paid much attention to. Largely because it didn’t seem very special at the time nor did it have a good persistence and paging story. Also, there is/was so much noise out there and the loudest among them being Memcached, Hadoop, Cassandra, Voldemort, Riak, MongoDB etc that it slipped my mind.

Last weekend I thought I’d give Redis another try. This time I just wanted to see Redis for what it is and not compare it with other solutions. So, as it says on the site:

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

Seemed interesting enough to warrant another look. There are so many projects that need:

  • Simple, fast, light
  • In-memory (with optional checkpointing)
  • Fault tolerant / Sharded / Distributed
  • Shared access from many processes and machines
  • Some real data structures instead of just wimpy key-value
  • Flexible storage format – without needing crummy layers to hide/overcome limitations
  • Clean Java API

So, I downloaded the Windows port of Redis and Jedis JAR for the Java API.

  1. Unzip Redis Windows zip file 
  2. Copy the Jedis JAR file
  3. Go to the 64bit or 32bit folder and start “redis-server.exe”
  4. Write a simple Java program that uses Jedis to talk to the Redis server
  5. That’s it

Explain.solr.pl

Filed under: Explain.solr.pl,Solr — Patrick Durusau @ 7:16 pm

Explain.solr.pl enables you to explore why you obtained particular results from Solr.

Useful for debugging/testing Solr queries but also for developing an intuitive feel for your data.

From the install.txt file:

Requirements:

  • ruby 1.8.7
  • rubygems
  • bundler
  • postgresql as database backend (propably works with other sql servers but this was not tested)

Solarium: Solr library for PHP

Filed under: PHP,Solarium,Solr — Patrick Durusau @ 7:16 pm

Solarium: Solr library for PHP

A problem with Solr tools descriptions is which features to list:

  • Facet support
  • Query building API
  • Complex update queries
  • Query inheritance
  • Plugin system
  • DisMax support
  • Configuration mode
  • Spatial search
  • MoreLikeThis
  • Highlighting
  • Grouping
  • Spellcheck
  • Stats component
  • Terms queries
  • Distributed search
  • Analysis
  • Client adapters
  • Term and phrase escaping
  • Loadbalancing plugin
  • PostBigRequest plugin
  • CustomizeRequest plugin
  • Use-at-will structure
  • Developed using continuous integration
  • Solarium uses the PSR-0 standard

So I dodged that bullet by listing all of Solarium’s features. 😉

You won’t (and probably shouldn’t) try to use all of them for your first interface. But this listing should give you ideas for how your library interface can change and grow in the future.

A track record of successful use of technologies like PHP, Solarium and Solr isn’t going to harm your job prospects.

Jogger: almost like named_scopes

Filed under: JRuby,Named Scopes,Pipes,Traversal — Patrick Durusau @ 7:15 pm

Jogger: almost like named_scopes

From the post:

We talked about graph databases in this and this blog post. As you might have read we’re big fans of a graph database called neo4j, and we’re using it together with JRuby. In this post we’ll share a little piece of code we created to make expressing graph traversals super easy and fun.

Jogger – almost like named_scopes

Jogger is a JRuby gem that enables lazy people to do very expressive graph traversals with the great pacer gem. If you don’t know what the pacer gem is, you should probably check pacer out first. (And don’t miss the pacer section at the end of the post.)

Remember the named_scopes from back in the days when you were using rails? Jogger gives you named traversals and is a little bit like named scopes. Jogger groups multiple pacer traversals together and give them a name. Pacer traversals are are like pipes. What are pipes? Pipes are great!!

The most important conceptual difference is, that the order in which named traversals are called matter, while it usually doesn’t matter in which order you call named scopes.

A nice way to make common traversals accessible by name.

Does the “order of calling” scopes in topic maps matter? At least for the current TMDM I think not because scopes are additive. That is the value covered by a set of scopes must be valid in each scope individually.

« Newer PostsOlder Posts »

Powered by WordPress