Archive for the ‘Data Structures’ Category

Exotic Functional Data Structures: Hitchhiker Trees

Sunday, September 18th, 2016

Description:

Functional data structures are awesome–they’re the foundation of many functional programming languages, allowing us to express complex logic immutably and efficiently. There is one unfortunate limitation: these data structures must fit on the heap, limiting their lifetime to that of the process. Several years ago, Datomic appeared as the first functional database that addresses these limitations. However, there hasn’t been much activity in the realm of scalable (gigabytes to terabytes) functional data structures.

In this talk, we’ll first review some of the fundamental principles of functional data structures, particularly trees. Next, we’ll review what a B tree is and why it’s better than other trees for storage. Then, we’ll learn about a cool variant of a B tree called a fractal tree, how it can be made functional, and why it has phenomenal performance. Finally, we’ll unify these concepts to understand the Hitchhiker tree, an open-source functionally persistent fractal tree. We’ll also briefly look at an example API for using Hitchhiker trees that allows your application’s state to be stored off-heap, in the spirit of the 2014 paper “Fast Database Restarts at Facebook”.

David Greenberg (profile)

Hitchhiker Trees (GitHub)

Fast Database Restarts at Facebook by Aakash Goel, Bhuwan Chopra, Ciprian Gerea, Dhrúv Mátáni, Josh Metzler, Fahim Ul Haq, Janet Wiener.

You could have searched for all the information I have included, but isn’t it more convenient to have it “already found?”

International Conference on Learning Representations – Accepted Papers

Monday, February 8th, 2016

International Conference on Learning Representations – Accepted Papers

From the conference overview:

It is well understood that the performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. The rapidly developing field of representation learning is concerned with questions surrounding how we can best learn meaningful and useful representations of data. We take a broad view of the field, and include in it topics such as deep learning and feature learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.

Despite the importance of representation learning to machine learning and to application areas such as vision, speech, audio and NLP, there was no venue for researchers who share a common interest in this topic. The goal of ICLR has been to help fill this void.

That should give you an idea of the range of data representations/features that you will encounter in the eighty (80) papers accepted for the conference.

ICLR 2016 will be held May 2-4, 2016 in the Caribe Hilton, San Juan, Puerto Rico.

Time to review How To Read A Paper!

Enjoy!

I first saw this in a tweet by Hugo Larochelle.

Mutable Data Structures

Sunday, November 15th, 2015

The best illustration of the dangers of mutable data structures I have found to date.

Mutable data structures and algorithms based upon them were necessary when computer memory was far more limited than it is today.*

Why are you still using mutable data structures and algorithms based on hardware that doesn’t exist anymore?

I first saw this in a tweet by the Software Exorcist.

* Granting there are cases, the CERN comes to mind, where the memory requirements for some applications exceed available memory. You aren’t working at the CERN are you?

Rollin’ Trees, yo

Sunday, January 11th, 2015

Rollin’ Trees, yo by Clark Feusier.

From the post:

I like trees. All kinds of trees — concrete and abstract. Redwoods, Oaks, search trees, decision trees, fruit trees, DOM trees, Christmas trees, and more.

They are powerful beyond common recognition. Oxygen, life, shelter, food, beauty, computational efficiency, and more are provided by trees when we interact with them in the right ways.

Don’t get offended when I say this:

you don’t like trees enough

Before I can make you feel bad about taking trees for granted, I need you to be very familiar with trees and their uses. Once you understand the tree, you will feel bad for not appreciating it enough. Then, you will start appreciating trees, as well as using them in the situations for which they are perfectly suited. Good.

Oh, I don’t mind using trees for “…situations for which they are perfectly suited.” What I object to is using trees to model texts, where outside of the artificial strictures of markup, are definitely not trees!

I suppose the simplest (and most common) case of non-tree like behavior for texts is where a sentence crosses page boundaries. If you think of the page as a container, then the markup for the start and end of a sentence “overlaps” the markers for the page boundary.

Another easy example is where a quotation starts the middle of one sentence and ends at the end of the following sentence. Any markup for the first sentence is going to “overlap” the markup for the start of the quotation.

For all that, this is a good review of trees and worth your time to read. Just don’t allow yourself to be limited to trees when thinking about texts.

I first saw this in a tweet by Anna Pawlicka.

Using Clojure To Generate Java To Reimplement Clojure

Thursday, November 13th, 2014

Using Clojure To Generate Java To Reimplement Clojure by Zach Tellman.

From the post:

Most data structures are designed to hold arbitrary amounts of data. When we talk about their complexity in time and space, we use big O notation, which is only concerned with performance characteristics as n grows arbitrarily large. Understanding how to cast an O(n) problem as O(log n) or even O(1) is certainly valuable, and necessary for much of the work we do at Factual. And yet, most instances of data structures used in non-numerical software are very small. Most lists are tuples of a few entries, and most maps are a few keys representing different facets of related data. These may be elements in a much larger collection, but this still means that the majority of operations we perform are on small instances.

But except in special cases, like 2 or 3-vectors that represent coordinates, it’s rarely practical to specify that a particular tuple or map will always have a certain number of entries. And so our data structures have to straddle both cases, behaving efficiently at all possible sizes. Clojure, however, uses immutable data structures, which means it can do an end run on this problem. Each operation returns a new collection, which means that if we add an element to a small collection, it can return something more suited to hold a large collection.

Tellman describes this problem and his solution in Predictably Fast Clojure. (The URL is to a time mark but I think the entire video is worth your time.)

If that weren’t cool enough, Tellman details the creation of 1000 lines of Clojure that generate 5500 lines of Java so his proposal can be rolled into Clojure.

What other data structures can be different when immutability is a feature?

Elementary Algorithms

Sunday, July 27th, 2014

Elementary Algorithms by Xinyu LIU.

From the github page:

AlgoXY is a free book about elementary algorithms and data structures. This book doesn’t only focus on an imperative (or procedural) approach, but also includes purely functional algorithms and data structures. It doesn’t require readers to master any programming languages, because all the algorithms are described using mathematical functions and pseudocode.

For reference and implementation purposes, source code in C, C++, Haskell, Python, Scheme/Lisp is available in addition to the book.

The contents of the book are provided under GNU FDL and the source code is under GNU GPLv3.

The PDF version can be downloaded from github: https://github.com/liuxinyu95/AlgoXY/blob/algoxy/preview/elementary-algorithms.pdf?raw=true

This book is also available online at: https://sites.google.com/site/algoxy/home

I was concerned when the HTML version for trie was only 2 pages long. You need to view the pdf version, which for trie is some forty (40) pages, to get an idea of the coverage of any particular algorithm.

I first saw this in a tweet by OxAX

Finger trees:…

Wednesday, July 23rd, 2014

Finger trees: a simple general-purpose data structure by Ralf Hinze and Ross Paterson.

Abstract:

We introduce 2-3 finger trees, a functional representation of a persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece. Representations achieving these bounds have appeared previously, but 2-3 finger trees are much simpler, as are the operations on them. Further, by defi ning the split operation in a general form, we obtain a general purpose data structure that can serve as a sequence, priority queue, search tree, priority search queue and more.

Before the Hinze and Paterson article you may want to read: 2-3 finger trees in ASCII by Jens Nicolay.

Note 2-3 finger trees go unmentioned in Purely Functional Data Structures by Chris Okasaki.

Other omissions of note?

Data Structures in Clojure:…

Tuesday, February 4th, 2014

Data Structures in Clojure: Singly-Linked Lists by Max Countryman.

From the post:

This series is about the implementation of common data structures. Throughout the series we will be implementing these data structures ourselves, exploring how they work. Our implementations will be done in Clojure. Consequently this tutorial is also about Lisp and Clojure. It is important to note that, unlike the standard collection of data structures found in Clojure, which are persistent, ours will be mutable.

To start with, we will explore a linked list implementation using deftype (more on this later) to define a JVM class from Clojure-land. This implementation will be expanded to include in-place reversal. Finally we will utilize Clojure’s built-in interfaces to give our linked list access to some of the methods Clojure provides.

If you aren’t going to invent your own computer language, why not learn an existing one better?

The next post is on hash tables.

Enjoy!

Tangle Machines

Wednesday, September 11th, 2013

Daniel Moskovich has started a thread on what he calls: “Tangle Machines.”

The series started with Tangle Machines- Positioning claim.

From that post:

Avishy Carmi and I are in the process of finalizing a preprint on what we call “tangle machines”, which are knot-like objects which store and process information. Topologically, these roughly correspond to embedded rack-coloured networks of 2-spheres connected by line segments. Tangle machines aren’t classical knots, or 2-knots, or knotted handlebodies, or virtual knots, or even w-knot. They’re a new object of study which I would like to market.

Below is my marketing strategy.

My positioning claim is:

  • Tangle machines blaze a trail to information topology.

My three supporting points are:

  • Tangle machines pre-exist in a the sense of Plato. If you look at a knot from the perspective of information theory, you are inevitably led to their definition.
  • Tangle machines are interesting mathematical objects with rich algebraic structure which present a plethora of new and interesting questions with information theoretic content.
  • Tangle machines provide a language in which one might model “real-world” classical and quantum interacting processes in a new and useful way.

Next post, I’ll introduce tangle machines. Right now, I’d like to preface the discussion with a content-free pseudo-philosophical rant, which argues that different approaches to knot theory give rise to different `most natural’ objects of study.

You know where Daniel lost me in his sales pitch. 😉

But, it’s Daniel’s story so read on “as though” knots “pre-exist,” even though he later concedes the pre-existing claim cannot be proven or even tested.

Tangle Machines- Part 1 by Daniel Moskovich begins:

In today’s post, I will define tangle machines. In subsequent posts, I’ll realize them topologically and describe how we study them and more about what they mean.

To connect to what we already know, as a rough first approximation, a tangle machine is an algebraic structure obtained from taking a knot diagram coloured by a rack, then building a graph whose vertices correspond to the arcs of the diagram and whose edges correspond to crossings (the overcrossing arc is a single unit- so it “acts on” one undercrossing arc to change its colour and to convert it into another undercrossing arc). Such considerations give rise to a combinatorial diagrammatic-algebraic setup, and tangle machines are what comes from taking this setup seriously. One dream is that this setup is well-suited to modeling mutually interacting processes which satisfy a natural `conservation law’- and to move in a very applied direction of actually identifying tangle machine inside data.

To whet your appetite, below is a pretty figure illustrating a 926 knot hiding inside a synthetic collection of phase transitions between anyons (an artificial and unrealistic collection; the hope is to find such things inside real-world data):

Hard to say if in the long run this “new” data structure will be useful or not.

Do stay tuned for future developments.

Intrinsic vs. Extrinsic Structure

Wednesday, August 14th, 2013

Intrinsic vs. Extrinsic Structure by Jesse Johnson.

From the post:

At this point, I think it will be useful to introduce an idea from geometry that is very helpful in pure mathematics, and that I find helpful for understanding the geometry of data sets. This idea is difference between the intrinsic structure of an object (such as a data set) and its extrinsic structure. Have you ever gone into a building, walked down a number of different halls and through different rooms, and when you finally got to where you’re going and looked out the window, you realized that you had no idea which direction you were facing, or which side of the building you were actually on? The intrinsic structure of a building has to do with how the rooms, halls and staircases connect up to each other. The extrinsic structure is how these rooms, halls and staircases sit with respect to the outside world. So, while you’re inside the building you may be very aware of the intrinsic structure, but completely lose track of the extrinsic structure.

You can see a similar distinction with subway maps, such as the famous London tube map. This map records how the different tube stops connect to each other, but it distorts how the stops sit within the city. In other words, the coordinates on the tube map do not represent the physical/GPS coordinates of the different stops. But while you’re riding a subway, the physical coordinates of the different stops are much less important than the inter-connectivity of the stations. In other words, the intrinsic structure of the subway is more important (while you’re riding it) than the extrinsic structure. On the other hand, if you were walking through a city, you would be more interested in the extrinsic structure of the city since, for example, that would tell you the distance in miles (or kilometers) between you and your destination.

Data sets also have both intrinsic and extrinsic structure, though there isn’t a sharp line between where the intrinsic structure ends and the extrinsic structure begins. These are more intuitive terms than precise definitions. In the figure below, which shows three two-dimensional data sets, the set on the left has an intrinsic structure very similar to that of the middle data set: Both have two blobs of data points connected by a narrow neck of data points. However, in the data set on the left the narrow neck forms a roughly straight line. In the center, the tube curves around, so that the entire set roughly follows a circle.

I am looking forward to this series of posts from Jesse.

Bearing in mind that the structure of a data set is impacted by collection goals, methods, and other factors.

Matters that are not (usually) represented in the data set per se.

Hybrid Indexes for Repetitive Datasets

Monday, June 24th, 2013

Hybrid Indexes for Repetitive Datasets by H. Ferrada, T. Gagie, T. Hirvola, and S. J. Puglisi.

Abstract:

Advances in DNA sequencing mean databases of thousands of human genomes will soon be commonplace. In this paper we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we preprocess the text with LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show this also significantly reduces query times.

Need another repetitive data set?

Have you thought about topic maps?

If there is to be any merging in a topic map there are multiple topics that represent the same subjects.

This technique may be overkill for some hardly merging topic maps but if you had the endless repetition that you find in linked data versions of Wikipedia data, there it would be quite useful.

That might knock down the “Some-Smallish-Number” of triples count and so would be disfavored.

On the other hand, there are other data sets with massive replication (think phone records) where fast querying could be an advantage.

Succinct data structures for representing equivalence classes

Monday, June 24th, 2013

Succinct data structures for representing equivalence classes by Moshe Lewenstein, J. Ian Munro, and Venkatesh Raman.

Abstract:

Given a partition of an n element set into equivalence classes, we consider time-space tradeoffs for representing it to support the query that asks whether two given elements are in the same equivalence class. This has various applications including for testing whether two vertices are in the same component in an undirected graph or in the same strongly connected component in a directed graph.

We consider the problem in several models.

— Concerning labeling schemes where we assign labels to elements and the query is to be answered just by examining the labels of the queried elements (without any extra space): if each vertex is required to have a unique label, then we show that a label space of (\sum_{i=1}^n \lfloor {n \over i} \rfloor) is necessary and sufficient. In other words, \lg n + \lg \lg n + O(1) bits of space are necessary and sufficient for representing each of the labels. This slightly strengthens the known lower bound and is in contrast to the known necessary and sufficient bound of \lceil \lg n \rceil for the label length, if each vertex need not get a unique label.

–Concerning succinct data structures for the problem when the n elements are to be uniquely assigned labels from label set {1, 2, …n}, we first show that \Theta(\sqrt n) bits are necessary and sufficient to represent the equivalence class information. This space includes the space for implicitly encoding the vertex labels. We can support the query in such a structure in O(\lg n) time in the standard word RAM model. We then develop structures resulting in one where the queries can be supported in constant time using O({\sqrt n} \lg n) bits of space. We also develop space efficient structures where union operation along with the equivalence query can be answered fast.

On the down side, this technique would not support merging based on arbitrary choice of properties.

On the up side, this technique does support merging based on pre-determined properties for merging.

The latter being the more common case, I commend this article to you for a close read.

A Trillion Dollar Math Trick

Saturday, June 1st, 2013

A Trillion Dollar Math Trick by Dick Lipton.

Dick reviews a presentation by Mike Stonebraker at TTI Vanguard meeting on “Ginormous Systems” in DC.

In part:

In Mike’s wonderful talk he made seven points about the past, present, and the future of database technology. He has a great track record, so likely he is mostly right on his guesses. One of his predictions was about a way of re-organizing databases that has several remarkable properties:

  • It speeds up database operations 50x. That is to say, on typical queries—ones that companies actually do—it is fifty times faster than classical database implementations. As a theorist we like speedups, especially asymptotic ones. But 50x is pretty cool. That is enough to change a query from an hour to a minute.
  • It is not a new idea. But the time is finally right, and Mike predicts that future databases will use this method.
  • It is an idea that no one seems to know who invented it. I asked Mike, I asked other experts at the conference, and all shrugged and said effectively: “I have no idea.” Curious.

Let’s look quickly at the way databases work, and then consider the trick.

I won’t spoil the surprise for you, see Dick’s post for the details.

BTW, read the comments on historical uses of the same idea.

Then think about how to apply to topic maps.

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

Data Structures in Riak

Thursday, May 2nd, 2013

Data Structures in Riak (NoSQL Matters Cologne 2013) by Sean Cribbs.

From the description:

Since the beginning, Riak has supported high write-availability using Dynamo-style multi-valued keys – also known as conflicts or siblings. The tradeoff for this type of availability is that the application must include logic to resolve conflicting updates. While it is convenient to say that the application can reason best about conflicts, ad hoc resolution is error-prone and can result in surprising anomalies, like the reappearing item problem in Dynamo’s shopping cart.

What is needed is a more formal and general approach to the problem of conflict resolution for complex data structures. Luckily, there are some formal strategies in recent literature, including Conflict-Free Replicated Data Types (CRDTs) and BloomL lattices. We’ll review these strategies and cover some recent work we’ve done toward adding automatically-convergent data structures to Riak.

OK, it’s a slide deck so you will have to supply the prose parts yourself.

There are references to published literature with URLs so it may take a little work, but you will be better off for it. 😉

Strange Loop 2013

Saturday, April 27th, 2013

Strange Loop 2013

Dates:

  • Call for presentation opens: Apr 15th, 2013
  • Call for presentation ends: May 9, 2013
  • Speakers notified by: May 17, 2013
  • Registration opens: May 20, 2013
  • Conference dates: Sept 18-20th, 2013

From the webpage:

Below is some guidance on the kinds of topics we are seeking and have historically accepted.

  • Frequently accepted or desired topics: functional programming, logic programming, dynamic/scripting languages, new or emerging languages, data structures, concurrency, database internals, NoSQL databases, key/value stores, big data, distributed computing, queues, asynchronous or dataflow concurrency, STM, web frameworks, web architecture, performance, virtual machines, mobile frameworks, native apps, security, biologically inspired computing, hardware/software interaction, historical topics.
  • Sometimes accepted (depends on topic): Java, C#, testing frameworks, monads
  • Rarely accepted (nothing wrong with these, but other confs cover them well): Agile, JavaFX, J2EE, Spring, PHP, ASP, Perl, design, layout, entrepreneurship and startups, game programming

It isn’t clear why Strange Loop claims to have “archives:”

2009201020112012

As far as I can tell, these are listings with bios of prior presentations, but no substantive content.

Am I missing something?

Saddle

Friday, April 5th, 2013

Saddle

From the webpage:

Saddle is a data manipulation library for Scala that provides array-backed, indexed, one- and two-dimensional data structures that are judiciously specialized on JVM primitives to avoid the overhead of boxing and unboxing.

Saddle offers vectorized numerical calculations, automatic alignment of data along indices, robustness to missing (N/A) values, and facilities for I/O.

Saddle draws inspiration from several sources, among them the R programming language & statistical environment, the numpy and pandas Python libraries, and the Scala collections library.

I have heard some one and two dimensional data structures can be quite useful. 😉

Something to play with over the weekend.

Enjoy!

AI Algorithms, Data Structures, and Idioms…

Tuesday, March 19th, 2013

AI Algorithms, Data Structures, and Idioms in Prolog, Lisp and Java by George F. Luger and William A. Stubblefield.

From the introduction:

Writing a book about designing and implementing representations and search algorithms in Prolog, Lisp, and Java presents the authors with a number of exciting opportunities.

The first opportunity is the chance to compare three languages that give very different expression to the many ideas that have shaped the evolution of programming languages as a whole. These core ideas, which also support modern AI technology, include functional programming, list processing, predicate logic, declarative representation, dynamic binding, meta-linguistic abstraction, strong-typing, meta-circular definition, and object-oriented design and programming. Lisp and Prolog are, of course, widely recognized for their contributions to the evolution, theory, and practice of programming language design. Java, the youngest of this trio, is both an example of how the ideas pioneered in these earlier languages have shaped modern applicative programming, as well as a powerful tool for delivering AI applications on personal computers, local networks, and the world wide web.

Where could you go wrong with comparing Prolog, Lisp and Java?

Either for the intellectual exercise or because you want a better understanding of AI, a resource to enjoy!

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop

Thursday, March 14th, 2013

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop by Justin Kestelyn.

From the post:

We’d like to introduce a new columnar storage format for Hadoop called Parquet, which started as a joint project between Twitter and Cloudera engineers.

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Parquet is built from the ground up with complex nested data structures in mind. We adopted the repetition/definition level approach to encoding such data structures, as described in Google’s Dremel paper; we have found this to be a very efficient method of encoding data in non-trivial object schemas.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing Parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

Under heavy development so watch closely!

PersistIT [B+ Tree]

Thursday, March 7th, 2013

PersistIT: A fast, transactional, Java B+Tree library

From the webpage:

Akiban PersistIT is a key/value data storage library written in Java™. Key features include:

  • Support for highly concurrent transaction processing with multi-version concurrency control
  • Optimized serialization and deserialization mechanism for Java primitives and objects
  • Multi-segment keys to enable a natural logical key hierarchy
  • Support for long records
  • Implementation of a persistent SortedMap
  • Extensive management capability including command-line and GUI tools

For more information

I mention this primarily because of the multi-segment keys, which I suspect could be useful for type hierarchies.

Possibly other uses as well but that is the first one that came to mind.

A Simple, Combinatorial Algorithm for Solving…

Tuesday, March 5th, 2013

A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time by Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, Zeyuan Allen Zhu.

Abstract:

In this paper, we present a simple combinatorial algorithm that solves symmetric diagonally dominant (SDD) linear systems in nearly-linear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, spectral sparsification, or even the Chebyshev Method or Conjugate Gradient. After constructing a “nice” spanning tree of a graph associated with the linear system, the entire algorithm consists of the repeated application of a simple (non-recursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bit-precision required by previous solvers. As such, the algorithm has the fastest known running time under the standard unit-cost RAM model. We hope that the simplicity of the algorithm and the insights yielded by its analysis will be useful in both theory and practice.

In one popular account, the importance of the discovery was described this way:

The real value of the MIT paper, Spielman says, is in its innovative theoretical approach. “My work and the work of the folks at Carnegie Mellon, we’re solving a problem in numeric linear algebra using techniques from the field of numerical linear algebra,” he says. “Jon’s paper is completely ignoring all of those techniques and really solving this problem using ideas from data structures and algorithm design. It’s substituting one whole set of ideas for another set of ideas, and I think that’s going to be a bit of a game-changer for the field. Because people will see there’s this set of ideas out there that might have application no one had ever imagined.”

Thirty-two pages of tough sledding but if the commentaries are correct, this paper may have a major impact on graph processing.

Making Sense of Others’ Data Structures

Sunday, February 3rd, 2013

Making Sense of Others’ Data Structures by Eruditio Loginquitas.

From the post:

Coming in as an outsider to others’ research always requires an investment of time and patience. After all, how others conceptualize their fields, and how they structure their questions and their probes, and how they collect information, and then how they represent their data all reflect their understandings, their theoretical and analytical approaches, their professional training, and their interests. When professionals collaborate, they will approach a confluence of understandings and move together in a semi-united way. Individual researchers—not so much. But either way, for an outsider, there will have to be some adjustment to understand the research and data. Professional researchers strive to control for error and noise at every stage of the research: the hypothesis, literature review, design, execution, publishing, and presentation.

Coming into a project after the data has been collected and stored in Excel spreadsheets means that the learning curve is high in yet another way: data structures. While the spreadsheet itself seems pretty constrained and defined, there is no foregone conclusion that people will necessarily represent their data a particular way.

Data structures as subjects. What a concept! 😉

Data structures, contrary to some, are not self-evident or self-documenting.

Not to mention that like ourselves, are in a constant state of evolution as our understanding or perception of data changes.

Mine is not the counsel of despair, but of encouragement to consider the costs/benefits of capturing data structure subject identities just as more traditional subjects.

It may be costs or other constraints prevent such capture but you may also miss benefits if you don’t ask.

How much did it cost for each transition in episodic data governance efforts to re-establish data structure subject identities?

Could be that more money spent now would get an enterprise off the perpetual cycle of data governance.

Schemaless Data Structures

Saturday, January 12th, 2013

Schemaless Data Structures by Martin Fowler.

From the first slide:

In recent years, there’s been an increasing amount of talk about the advantages of schemaless data. Being schemaless is one of the main reasons for interest in NoSQL databases. But there are many subtleties involved in schemalessness, both with respect to databases and in-memory data structures. These subtleties are present both in the meaning of schemaless and in the advantages and disadvantages of using a schemaless approach.

Martin points out that “schemaless” does not mean the lack of a schema but rather the lack of an explicit schema.

Sounds a great deal like the implicit subjects that topic maps have the ability to make explicit.

Is there a continuum of explicitness for any given subject/schema?

Starting from entirely implied, followed by an explicit representation, then further explication as in a data dictionary, and at some distance from the start, a subject defined as a set of properties, which are themselves defined as sets of properties, in relationships with other sets of properties.

How far you go down that road depends on your requirements.

Report on XLDB Tutorial on Data Structures and Algorithms

Tuesday, October 16th, 2012

Report on XLDB Tutorial on Data Structures and Algorithms by Michael Bender.

From the post:

The tutorial was organized as follows:

  • Module 0: Tutorial overview and introductions. We describe an observed (but not necessary) tradeoff in ingestion, querying, and freshness in traditional database.
  • Module 1: I/O model and cache-oblivious analysis.
  • Module 2: Write-optimized data structures. We give the optimal trade-off between inserts and point queries. We show how to build data structures that lie on this tradeoff curve.
  • Module 2 continued: Write-optimized data structures perform writes much faster than point queries; this asymmetry affects the design of an ACID compliant database.
  • Module 3: Case study – TokuFS. How to design and build a write-optimized file systems.
  • Module 4: Page-replacement algorithms. We give relevant theorems on the performance of page-replacement strategies such as LRU.
  • Module 5: Index design, including covering indexes.
  • Module 6: Log-structured merge trees and fractional cascading.
  • Module 7: Bloom filters.

These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQL and TokuDB.

The slides are available here.

A tutorial offered by Michael and Bradley C. Kuszmaul at the 6th XLDB conference.

If you are committed to defending your current implementation choices against all comers, don’t bother with the slides.

If you want a peek at one future path in data structures, get the slides. You won’t be disappointed.

Convergent and Commutative Replicated Data Types [Warning: Heavy Sledding Ahead]

Thursday, October 11th, 2012

A comprehensive study of Convergent and Commutative Replicated Data Types (PDF file) Marc Shapiro, Nuno M. Preguiça, Carlos Baquero, Marek Zawirski.

Abstract:

Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs). This paper formalises asynchronous object replication, either state based or operation based, and provides a sufficient condition appropriate for each case. It describes several useful CRDTs, including container data types supporting both add and remove operations with clean semantics, and more complex types such as graphs, montonic DAGs, and sequences. It discusses some properties needed to implement non-trivial CRDTs.

I found this following a link in the readme for riak dt which said:

WHAT?

Currently under initial development, riak_dt is a platform for convergent data types. It’s built on riak core and deployed with riak. All of our current work is around supporting fast, replicated, eventually consistent counters (though more data types are in the repo, and on the way.) This work is based on the paper – A Comprehensive study of Convergent and Commutative Replicated Data Types – which you may find an interesting read.

WHY?

Riak’s current model for handling concurrent writes is to store sibling values and present them to the client for resolution on read. The client must encode the logic to merge these into a single, meaningful value, and then inform Riak by doing a further write. Convergent data types remove this burden from the client, as their structure guarantees they will deterministically converge to a single value. The simplest of these data types is a counter.

I haven’t thought of merging of subject representatives as a quest for “consistency” but that is one way to think about it.

The paper is forty-seven pages long and has forty-four references, most of which I suspect are necessary to fully appreciate the work.

Having said that, I suspect it will be well worth the effort.

Advanced Data Structures [Jeff Erickson, UIUC]

Tuesday, October 9th, 2012

Advanced Data Structures

From the description:

This course will survey important developments in data structures that have not (yet) worked their way into the standard computer science curriculum. The precise topics will depend on the interests and background of the course participants; see the current schedule for details. Potential topics include include self-adjusting binary search trees, dynamic trees and graphs, persistent data structures, geometric data structures, kinetic data structures, I/O-efficient and cache-oblivious data structures, hash tables and Bloom filters, data structures that beat information-theoretic lower bounds, and applications of these data structures in computational geometry, combinatorial optimization, systems and networking, databases, and other areas of computer science.

The course page has links to similar courses.

For hardy souls exploring data structures per se or for specialized topic maps, an annotated bibliography of readings.

If you haven’t seen it, visit Jeff’s homepage.

To see what Jeff has been up to lately: DBLP: Jeff Erickson.

Memobot

Friday, October 5th, 2012

Memobot

From the webpage:

Memobot is a data structure server written in clojure. It speaks Redis protocol, so any standard redis client can work with it.

For interests in data structures, Clojure or both.

C Linked List Data Structure Explained ….. [Neo4j internals]

Friday, August 31st, 2012

C Linked List Data Structure Explained with an Example C Program by Himanshu Arora.

From the post:

Linked list is one of the fundamental data structures in C.

Knowledge of linked lists is must for C programmers. This article explains the fundamentals of C linked list with an example C program.

Linked list is a dynamic data structure whose length can be increased or decreased at run time.

How Linked lists are different from arrays? Consider the following points :

  • An array is a static data structure. This means the length of array cannot be altered at run time. While, a linked list is a dynamic data structure.
  • In an array, all the elements are kept at consecutive memory locations while in a linked list the elements (or nodes) may be kept at any location but still connected to each other.

When to prefer linked lists over arrays? Linked lists are preferred mostly when you don’t know the volume of data to be stored. For example, In an employee management system, one cannot use arrays as they are of fixed length while any number of new employees can join. In scenarios like these, linked lists (or other dynamic data structures) are used as their capacity can be increased (or decreased) at run time (as an when required).

My Neo4J Internals (update) post pointed you to resources on Neo4j’s use of linked lists.

You may find this explanation of linked list data structures in C helpful.

Be aware that Knuth (TACP, vol. 1, page 233, 3rd ed.) suggests “nodes” may also be referenced as records, entities, beads, items or elements. (“Item,” and “element” being variants found in TACP). I am sure there are others.

Question: Assume you have discovered an example of a variant name for “node.” (One of these or another one.) How would you use that discovery in formulating a search?

A Provably Correct Scalable Concurrent Skip List

Thursday, August 16th, 2012

From High Scalability, report of Paper: A Provably Correct Scalable Concurrent Skip List.

From the post:

In MemSQL Architecture we learned one of the core strategies MemSQL uses to achieve their need for speed is lock-free skip lists. Skip lists are used to efficiently handle range queries. Making the skip-lists lock-free helps eliminate contention and make writes fast.

If this all sounds a little pie-in-the-sky then here’s a very good paper on the subject that might help make it clearer: A Provably Correct Scalable Concurrent Skip List.

The cited paper is by Maurice Herlihy, Yossi Lev, Victor Luchangco, and Nir Shavit. The authors shared Sun Microsystems as an employer so you know the paper is dated.

For more background on lock-free data structures, including Keir Fraser’s “Practical lock freedom” dissertation, see: Practical lock-free data structures.

Binary Search Tree

Friday, June 29th, 2012

Binary Search Tree by Stoimen Popov.

Nothing new but clearly explained and well illustrated, two qualities that make this post merit mentioning.

To say nothing of the related posts at the bottom of this one that cover related material in an equally effective manner.

BTW, if you do use these illustrations in slides or teaching, give credit where credit is due. It will encourage others to contribute as well.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Wednesday, May 2nd, 2012

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform by Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone.

Abstract:

Motivation

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.

Results

We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.

We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting’ strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).

This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

Important work for several reasons.

First, if the human genome is thought of as “big data,” it opens the possibility that compressed full text indexes can be build for other instances of “big data.”

Second, indexing is similar to topic mapping in the sense that pointers to information about a particular subject are gathered to a common location. Indexes often account for synonyms (see also) and distinguish the use of the same word for different subjects (polysemy).

Third, depending on the granularity of tokenizing and indexing, index entries should be capable of recombination to create new index entries.

Source code for this approach:

Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git.

Comments?