Joins « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 2, 2018

Visual Guide to Data Joins – Leigh Tami

Filed under: .Net,Data Aggregation,Data Integration,Data Science,Joins — Patrick Durusau @ 7:06 pm

Leigh Tami created a graphic involving a person and a coat to explain data set joins.

Scaling it down won’t do it justice here so see the original.

Preview any data science book with this image in mind. If it doesn’t match or exceed this explanation of joins, pass it by.

Comments Off

November 15, 2013

OhmDB

Filed under: Database,Graphs,Joins,NoSQL,SQL,Traversal — Patrick Durusau @ 7:52 pm

OhmDB

Billed as:

The Irresistible Database for Java Combining Great RDBMS and NoSQL Features.

Supposed to appear by the end of November 2013 so it isn’t clear if SQL, NoSQL are about to be joined by Irresistable as a database category or not.

The following caught my eye:

Very fast joins with graph-based relations

A single join has O(1) time complexity. A combination of multiple joins is internally processed as graph traversal with smart query optimization.

Without details, “very fast” has too wide a range of meanings to be very useful.

I don’t agree with the evaluation of Performance for RDBMS as “Limited.” People keep saying that as a truism when performance of any data store depends upon the architecture, data model, caching, etc.

I saw a performance test recently that depended upon (hopefully) a mis-understanding of one of the subjects of comparison. No surprise that it did really poorly in the comparison.

On the other hand, I am looking forward to the release of OhmDB as an early holiday surprise!

PS: I did subscribe to the newsletter on the theory that enough legitimate email might drown out the spam I get.

Comments (1)

June 30, 2013

Elasticsearch and Joining

Filed under: ElasticSearch,Indexing,Joins — Patrick Durusau @ 1:51 pm

Elasticsearch and Joining by Felix Hürlimann.

From the post:

With the success of elasticsearch, people, including us, start to explore the possibilities and mightiness of the system. Including border cases for which the underlying core, Lucene, never was originally intended or optimized for. One of the many requests that come up pretty quickly is the whish for joining data across types or indexes, similar to an SQL join clause that combines records from two or more tables in a database. Unfortunately full join support is not (yet?) available out of the box. But there are some possibilities and some attempts to solve parts of issue. This post is about summarizing some of the ideas in this field.

To illustrate the different ideas, let’s work with the following example: we would like to index documents and comments with a one to many relationship between them. Each comment has an author and we would like to answer the question: Give me all documents that match a certain query and a specific author has commented on it.

A variety of options are explored, including some new features of Elasticsearch.

Would you model documents with comments as an association?

Would you query on roles when searching for such a comment by a specific author on such a document?

Comments Off

March 12, 2013

Elasticsearch and Joining

Filed under: ElasticSearch,Joins,Lucene — Patrick Durusau @ 2:40 pm

Elasticsearch and Joining by Felix Hürlimann.

From the post:

With the success of elasticsearch, people, including us, start to explore the possibilities and mightiness of the system. Including border cases for which the underlying core, Lucene, never was originally intended or optimized for. One of the many requests that come up pretty quickly is the whish for joining data across types or indexes, similar to an SQL join clause that combines records from two or more tables in a database. Unfortunately full join support is not (yet?) available out of the box. But there are some possibilities and some attempts to solve parts of issue. This post is about summarizing some of the ideas in this field.

To illustrate the different ideas, let’s work with the following example: we would like to index documents and comments with a one to many relationship between them. Each comment has an author and we would like to answer the question: Give me all documents that match a certain query and a specific author has commented on it.

The latest beta release of Elasticsearch is described as:

If you have more complex requirements for join, a new feature introdcued in the latest beta release may can help you. It introduces another feature that allows for a kind of join by looking up filter terms in another index or type. This allows then e.g. for queries like ‘Show me all comments from documents that relate to this document and the author is ‘John Doe’.

The “looking up” in a different index or type sounds quite interesting.

Have you looked at the new beta of Elasticsearch?

Comments Off

July 1, 2012

Cascading map-side joins over HBase for scalable join processing

Filed under: HBase,Joins,Linked Data,LOD,MapReduce,RDF,SPARQL — Patrick Durusau @ 4:45 pm

Cascading map-side joins over HBase for scalable join processing by Martin Przyjaciel-Zablocki, Alexander Schätzle, Thomas Hornung, Christopher Dorner, and Georg Lausen.

Abstract:

One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. In this paper, we introduce the Map-Side Index Nested Loop Join (MAPSIN join) which combines scalable indexing capabilities of NoSQL storage systems like HBase, that suffer from an insufficient distributed processing layer, with MapReduce, which in turn does not provide appropriate storage structures for efficient large-scale join processing. While retaining the flexibility of commonly used reduce-side joins, we leverage the effectiveness of map-side joins without any changes to the underlying framework. We demonstrate the significant benefits of MAPSIN joins for the processing of SPARQL basic graph patterns on large RDF datasets by an evaluation with the LUBM and SP2Bench benchmarks. For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

Some topic map applications include Linked Data/RDF processing capabilities.

The salient comment here being: “For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.“

Comments Off

March 20, 2012

Worst-case Optimal Join Algorithms

Filed under: Algorithms,Database,Joins — Patrick Durusau @ 3:52 pm

Worst-case Optimal Join Algorithms by Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra.

Abstract:

Efficient join processing is one of the most fundamental and well-studied tasks in database research. In this work, we examine algorithms for natural join queries over many relations and describe a novel algorithm to process these queries optimally in terms of worst-case data complexity. Our result builds on recent work by Atserias, Grohe, and Marx, who gave bounds on the size of a full conjunctive query in terms of the sizes of the individual relations in the body of the query. These bounds, however, are not constructive: they rely on Shearer’s entropy inequality which is information-theoretic. Thus, the previous results leave open the question of whether there exist algorithms whose running time achieve these optimal bounds. An answer to this question may be interesting to database practice, as it is known that any algorithm based on the traditional select-project-join style plans typically employed in an RDBMS are asymptotically slower than the optimal for some queries. We construct an algorithm whose running time is worst-case optimal for all natural join queries. Our result may be of independent interest, as our algorithm also yields a constructive proof of the general fractional cover bound by Atserias, Grohe, and Marx without using Shearer’s inequality. This bound implies two famous inequalities in geometry: the Loomis-Whitney inequality and the Bollob\’as-Thomason inequality. Hence, our results algorithmically prove these inequalities as well. Finally, we discuss how our algorithm can be used to compute a relaxed notion of joins.

With reference to the optimal join problem the authors say:

Implicitly, this problem has been studied for over three decades: a modern RDBMS use decades of highly tuned algorithms to efficiently produce query results. Nevertheless, as we described above, such systems are asymptotically suboptimal – even in the above simple example of (1). Our main result is an algorithm that achieves asymptotically optimal worst-case running times for all conjunctive join queries.

The author’s strategy involves evaluation of the keys in a join and the dividing of those keys into separate sets. The information used by the authors has always been present, just not used in join processing. (pp. 2-3 of the article)

There are a myriad of details to be mastered in the article but I suspect this line of thinking may be profitable in many situations where “join” operations are relevant.

Comments Off

March 12, 2012

Joins with MapReduce

Filed under: Joins,MapReduce — Patrick Durusau @ 8:05 pm

Joins with MapReduce by Buddhika Chamith.

From the post:

I have been reading up on Join implementations available for Hadoop for past few days. In this post I recap some techniques I learnt during the process. The joins can be done at both Map side and Join side according to the nature of data sets of to be joined.

Covers examples of different types of joins.

Is there a MapReduce source with a wider range of examples? Thinking it would be useful to have a fairly full set of examples for joins using MapReduce

Comments Off

February 2, 2012

Query time joining in Lucene

Filed under: Joins,Lucene — Patrick Durusau @ 3:40 pm

Query time joining in Lucene

From the post:

Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6.

Lets say we have articles and comments. With the query time join you can store these entities as separate documents. Each comment and article can be updates without re-indexing large parts of your index. Even better would be to store articles in an article index and comments in a comment index! In both cases a comment would have a field containing the article identifier.

Joins based upon matching terms in different indexes.

Work is not finished yet so now would be the time to contribute your experiences or opinions.

Comments Off