Archive for the ‘Data Structures’ Category

Data Structures in Riak

Thursday, May 2nd, 2013

Data Structures in Riak (NoSQL Matters Cologne 2013) by Sean Cribbs.

From the description:

Since the beginning, Riak has supported high write-availability using Dynamo-style multi-valued keys – also known as conflicts or siblings. The tradeoff for this type of availability is that the application must include logic to resolve conflicting updates. While it is convenient to say that the application can reason best about conflicts, ad hoc resolution is error-prone and can result in surprising anomalies, like the reappearing item problem in Dynamo’s shopping cart.

What is needed is a more formal and general approach to the problem of conflict resolution for complex data structures. Luckily, there are some formal strategies in recent literature, including Conflict-Free Replicated Data Types (CRDTs) and BloomL lattices. We’ll review these strategies and cover some recent work we’ve done toward adding automatically-convergent data structures to Riak.

OK, it’s a slide deck so you will have to supply the prose parts yourself.

There are references to published literature with URLs so it may take a little work, but you will be better off for it. ;-)

Strange Loop 2013

Saturday, April 27th, 2013

Strange Loop 2013

Dates:

  • Call for presentation opens: Apr 15th, 2013
  • Call for presentation ends: May 9, 2013
  • Speakers notified by: May 17, 2013
  • Registration opens: May 20, 2013
  • Conference dates: Sept 18-20th, 2013

From the webpage:

Below is some guidance on the kinds of topics we are seeking and have historically accepted.

  • Frequently accepted or desired topics: functional programming, logic programming, dynamic/scripting languages, new or emerging languages, data structures, concurrency, database internals, NoSQL databases, key/value stores, big data, distributed computing, queues, asynchronous or dataflow concurrency, STM, web frameworks, web architecture, performance, virtual machines, mobile frameworks, native apps, security, biologically inspired computing, hardware/software interaction, historical topics.
  • Sometimes accepted (depends on topic): Java, C#, testing frameworks, monads
  • Rarely accepted (nothing wrong with these, but other confs cover them well): Agile, JavaFX, J2EE, Spring, PHP, ASP, Perl, design, layout, entrepreneurship and startups, game programming

It isn’t clear why Strange Loop claims to have “archives:”

2009201020112012

As far as I can tell, these are listings with bios of prior presentations, but no substantive content.

Am I missing something?

Saddle

Friday, April 5th, 2013

Saddle

From the webpage:

Saddle is a data manipulation library for Scala that provides array-backed, indexed, one- and two-dimensional data structures that are judiciously specialized on JVM primitives to avoid the overhead of boxing and unboxing.

Saddle offers vectorized numerical calculations, automatic alignment of data along indices, robustness to missing (N/A) values, and facilities for I/O.

Saddle draws inspiration from several sources, among them the R programming language & statistical environment, the numpy and pandas Python libraries, and the Scala collections library.

I have heard some one and two dimensional data structures can be quite useful. ;-)

Something to play with over the weekend.

Enjoy!

AI Algorithms, Data Structures, and Idioms…

Tuesday, March 19th, 2013

AI Algorithms, Data Structures, and Idioms in Prolog, Lisp and Java by George F. Luger and William A. Stubblefield.

From the introduction:

Writing a book about designing and implementing representations and search algorithms in Prolog, Lisp, and Java presents the authors with a number of exciting opportunities.

The first opportunity is the chance to compare three languages that give very different expression to the many ideas that have shaped the evolution of programming languages as a whole. These core ideas, which also support modern AI technology, include functional programming, list processing, predicate logic, declarative representation, dynamic binding, meta-linguistic abstraction, strong-typing, meta-circular definition, and object-oriented design and programming. Lisp and Prolog are, of course, widely recognized for their contributions to the evolution, theory, and practice of programming language design. Java, the youngest of this trio, is both an example of how the ideas pioneered in these earlier languages have shaped modern applicative programming, as well as a powerful tool for delivering AI applications on personal computers, local networks, and the world wide web.

Where could you go wrong with comparing Prolog, Lisp and Java?

Either for the intellectual exercise or because you want a better understanding of AI, a resource to enjoy!

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop

Thursday, March 14th, 2013

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop by Justin Kestelyn.

From the post:

We’d like to introduce a new columnar storage format for Hadoop called Parquet, which started as a joint project between Twitter and Cloudera engineers.

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Parquet is built from the ground up with complex nested data structures in mind. We adopted the repetition/definition level approach to encoding such data structures, as described in Google’s Dremel paper; we have found this to be a very efficient method of encoding data in non-trivial object schemas.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing Parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

Under heavy development so watch closely!

PersistIT [B+ Tree]

Thursday, March 7th, 2013

PersistIT: A fast, transactional, Java B+Tree library

From the webpage:

Akiban PersistIT is a key/value data storage library written in Java™. Key features include:

  • Support for highly concurrent transaction processing with multi-version concurrency control
  • Optimized serialization and deserialization mechanism for Java primitives and objects
  • Multi-segment keys to enable a natural logical key hierarchy
  • Support for long records
  • Implementation of a persistent SortedMap
  • Extensive management capability including command-line and GUI tools

For more information

I mention this primarily because of the multi-segment keys, which I suspect could be useful for type hierarchies.

Possibly other uses as well but that is the first one that came to mind.

A Simple, Combinatorial Algorithm for Solving…

Tuesday, March 5th, 2013

A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time by Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, Zeyuan Allen Zhu.

Abstract:

In this paper, we present a simple combinatorial algorithm that solves symmetric diagonally dominant (SDD) linear systems in nearly-linear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, spectral sparsification, or even the Chebyshev Method or Conjugate Gradient. After constructing a “nice” spanning tree of a graph associated with the linear system, the entire algorithm consists of the repeated application of a simple (non-recursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bit-precision required by previous solvers. As such, the algorithm has the fastest known running time under the standard unit-cost RAM model. We hope that the simplicity of the algorithm and the insights yielded by its analysis will be useful in both theory and practice.

In one popular account, the importance of the discovery was described this way:

The real value of the MIT paper, Spielman says, is in its innovative theoretical approach. “My work and the work of the folks at Carnegie Mellon, we’re solving a problem in numeric linear algebra using techniques from the field of numerical linear algebra,” he says. “Jon’s paper is completely ignoring all of those techniques and really solving this problem using ideas from data structures and algorithm design. It’s substituting one whole set of ideas for another set of ideas, and I think that’s going to be a bit of a game-changer for the field. Because people will see there’s this set of ideas out there that might have application no one had ever imagined.”

Thirty-two pages of tough sledding but if the commentaries are correct, this paper may have a major impact on graph processing.

Making Sense of Others’ Data Structures

Sunday, February 3rd, 2013

Making Sense of Others’ Data Structures by Eruditio Loginquitas.

From the post:

Coming in as an outsider to others’ research always requires an investment of time and patience. After all, how others conceptualize their fields, and how they structure their questions and their probes, and how they collect information, and then how they represent their data all reflect their understandings, their theoretical and analytical approaches, their professional training, and their interests. When professionals collaborate, they will approach a confluence of understandings and move together in a semi-united way. Individual researchers—not so much. But either way, for an outsider, there will have to be some adjustment to understand the research and data. Professional researchers strive to control for error and noise at every stage of the research: the hypothesis, literature review, design, execution, publishing, and presentation.

Coming into a project after the data has been collected and stored in Excel spreadsheets means that the learning curve is high in yet another way: data structures. While the spreadsheet itself seems pretty constrained and defined, there is no foregone conclusion that people will necessarily represent their data a particular way.

Data structures as subjects. What a concept! ;-)

Data structures, contrary to some, are not self-evident or self-documenting.

Not to mention that like ourselves, are in a constant state of evolution as our understanding or perception of data changes.

Mine is not the counsel of despair, but of encouragement to consider the costs/benefits of capturing data structure subject identities just as more traditional subjects.

It may be costs or other constraints prevent such capture but you may also miss benefits if you don’t ask.

How much did it cost for each transition in episodic data governance efforts to re-establish data structure subject identities?

Could be that more money spent now would get an enterprise off the perpetual cycle of data governance.

Schemaless Data Structures

Saturday, January 12th, 2013

Schemaless Data Structures by Martin Fowler.

From the first slide:

In recent years, there’s been an increasing amount of talk about the advantages of schemaless data. Being schemaless is one of the main reasons for interest in NoSQL databases. But there are many subtleties involved in schemalessness, both with respect to databases and in-memory data structures. These subtleties are present both in the meaning of schemaless and in the advantages and disadvantages of using a schemaless approach.

Martin points out that “schemaless” does not mean the lack of a schema but rather the lack of an explicit schema.

Sounds a great deal like the implicit subjects that topic maps have the ability to make explicit.

Is there a continuum of explicitness for any given subject/schema?

Starting from entirely implied, followed by an explicit representation, then further explication as in a data dictionary, and at some distance from the start, a subject defined as a set of properties, which are themselves defined as sets of properties, in relationships with other sets of properties.

How far you go down that road depends on your requirements.

Report on XLDB Tutorial on Data Structures and Algorithms

Tuesday, October 16th, 2012

Report on XLDB Tutorial on Data Structures and Algorithms by Michael Bender.

From the post:

The tutorial was organized as follows:

  • Module 0: Tutorial overview and introductions. We describe an observed (but not necessary) tradeoff in ingestion, querying, and freshness in traditional database.
  • Module 1: I/O model and cache-oblivious analysis.
  • Module 2: Write-optimized data structures. We give the optimal trade-off between inserts and point queries. We show how to build data structures that lie on this tradeoff curve.
  • Module 2 continued: Write-optimized data structures perform writes much faster than point queries; this asymmetry affects the design of an ACID compliant database.
  • Module 3: Case study – TokuFS. How to design and build a write-optimized file systems.
  • Module 4: Page-replacement algorithms. We give relevant theorems on the performance of page-replacement strategies such as LRU.
  • Module 5: Index design, including covering indexes.
  • Module 6: Log-structured merge trees and fractional cascading.
  • Module 7: Bloom filters.

These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQL and TokuDB.

The slides are available here.

A tutorial offered by Michael and Bradley C. Kuszmaul at the 6th XLDB conference.

If you are committed to defending your current implementation choices against all comers, don’t bother with the slides.

If you want a peek at one future path in data structures, get the slides. You won’t be disappointed.

Convergent and Commutative Replicated Data Types [Warning: Heavy Sledding Ahead]

Thursday, October 11th, 2012

A comprehensive study of Convergent and Commutative Replicated Data Types (PDF file) Marc Shapiro, Nuno M. Preguiça, Carlos Baquero, Marek Zawirski.

Abstract:

Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs). This paper formalises asynchronous object replication, either state based or operation based, and provides a sufficient condition appropriate for each case. It describes several useful CRDTs, including container data types supporting both add and remove operations with clean semantics, and more complex types such as graphs, montonic DAGs, and sequences. It discusses some properties needed to implement non-trivial CRDTs.

I found this following a link in the readme for riak dt which said:

WHAT?

Currently under initial development, riak_dt is a platform for convergent data types. It’s built on riak core and deployed with riak. All of our current work is around supporting fast, replicated, eventually consistent counters (though more data types are in the repo, and on the way.) This work is based on the paper – A Comprehensive study of Convergent and Commutative Replicated Data Types – which you may find an interesting read.

WHY?

Riak’s current model for handling concurrent writes is to store sibling values and present them to the client for resolution on read. The client must encode the logic to merge these into a single, meaningful value, and then inform Riak by doing a further write. Convergent data types remove this burden from the client, as their structure guarantees they will deterministically converge to a single value. The simplest of these data types is a counter.

I haven’t thought of merging of subject representatives as a quest for “consistency” but that is one way to think about it.

The paper is forty-seven pages long and has forty-four references, most of which I suspect are necessary to fully appreciate the work.

Having said that, I suspect it will be well worth the effort.

Advanced Data Structures [Jeff Erickson, UIUC]

Tuesday, October 9th, 2012

Advanced Data Structures

From the description:

This course will survey important developments in data structures that have not (yet) worked their way into the standard computer science curriculum. The precise topics will depend on the interests and background of the course participants; see the current schedule for details. Potential topics include include self-adjusting binary search trees, dynamic trees and graphs, persistent data structures, geometric data structures, kinetic data structures, I/O-efficient and cache-oblivious data structures, hash tables and Bloom filters, data structures that beat information-theoretic lower bounds, and applications of these data structures in computational geometry, combinatorial optimization, systems and networking, databases, and other areas of computer science.

The course page has links to similar courses.

For hardy souls exploring data structures per se or for specialized topic maps, an annotated bibliography of readings.

If you haven’t seen it, visit Jeff’s homepage.

To see what Jeff has been up to lately: DBLP: Jeff Erickson.

Memobot

Friday, October 5th, 2012

Memobot

From the webpage:

Memobot is a data structure server written in clojure. It speaks Redis protocol, so any standard redis client can work with it.

For interests in data structures, Clojure or both.

C Linked List Data Structure Explained ….. [Neo4j internals]

Friday, August 31st, 2012

C Linked List Data Structure Explained with an Example C Program by Himanshu Arora.

From the post:

Linked list is one of the fundamental data structures in C.

Knowledge of linked lists is must for C programmers. This article explains the fundamentals of C linked list with an example C program.

Linked list is a dynamic data structure whose length can be increased or decreased at run time.

How Linked lists are different from arrays? Consider the following points :

  • An array is a static data structure. This means the length of array cannot be altered at run time. While, a linked list is a dynamic data structure.
  • In an array, all the elements are kept at consecutive memory locations while in a linked list the elements (or nodes) may be kept at any location but still connected to each other.

When to prefer linked lists over arrays? Linked lists are preferred mostly when you don’t know the volume of data to be stored. For example, In an employee management system, one cannot use arrays as they are of fixed length while any number of new employees can join. In scenarios like these, linked lists (or other dynamic data structures) are used as their capacity can be increased (or decreased) at run time (as an when required).

My Neo4J Internals (update) post pointed you to resources on Neo4j’s use of linked lists.

You may find this explanation of linked list data structures in C helpful.

Be aware that Knuth (TACP, vol. 1, page 233, 3rd ed.) suggests “nodes” may also be referenced as records, entities, beads, items or elements. (“Item,” and “element” being variants found in TACP). I am sure there are others.

Question: Assume you have discovered an example of a variant name for “node.” (One of these or another one.) How would you use that discovery in formulating a search?

A Provably Correct Scalable Concurrent Skip List

Thursday, August 16th, 2012

From High Scalability, report of Paper: A Provably Correct Scalable Concurrent Skip List.

From the post:

In MemSQL Architecture we learned one of the core strategies MemSQL uses to achieve their need for speed is lock-free skip lists. Skip lists are used to efficiently handle range queries. Making the skip-lists lock-free helps eliminate contention and make writes fast.

If this all sounds a little pie-in-the-sky then here’s a very good paper on the subject that might help make it clearer: A Provably Correct Scalable Concurrent Skip List.

The cited paper is by Maurice Herlihy, Yossi Lev, Victor Luchangco, and Nir Shavit. The authors shared Sun Microsystems as an employer so you know the paper is dated.

For more background on lock-free data structures, including Keir Fraser’s “Practical lock freedom” dissertation, see: Practical lock-free data structures.

Binary Search Tree

Friday, June 29th, 2012

Binary Search Tree by Stoimen Popov.

Nothing new but clearly explained and well illustrated, two qualities that make this post merit mentioning.

To say nothing of the related posts at the bottom of this one that cover related material in an equally effective manner.

BTW, if you do use these illustrations in slides or teaching, give credit where credit is due. It will encourage others to contribute as well.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Wednesday, May 2nd, 2012

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform by Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone.

Abstract:

Motivation

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.

Results

We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.

We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting’ strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).

This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

Important work for several reasons.

First, if the human genome is thought of as “big data,” it opens the possibility that compressed full text indexes can be build for other instances of “big data.”

Second, indexing is similar to topic mapping in the sense that pointers to information about a particular subject are gathered to a common location. Indexes often account for synonyms (see also) and distinguish the use of the same word for different subjects (polysemy).

Third, depending on the granularity of tokenizing and indexing, index entries should be capable of recombination to create new index entries.

Source code for this approach:

Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git.

Comments?

Big Data Reference Model (And Panopticons)

Tuesday, April 10th, 2012

Big Data Reference Model

Michael Nygard writes:

A project that approaches Big Data as a purely technical challenge will not deliver results. It is about more than just massive Hadoop clusters and number-crunching. In order to deliver value, a Big Data project has to enable change and adaptation. This requires that there are known problems to be solved. Yet, identifying the problem can be the hardest part. It’s often the case that you have to collect some information to even discover what problem to solve. Deciding how to solve that problem creates a need for more information and analysis. This is an empirical discovery loop similar to that found in any research project or Six Sigma initiative.

Michael takes you on a sensible loop of discover and evaluation, making you more likely (no guarantees) to succeed with your next “big data” project. In particular see the following caution:

… it is tempting to think that we could build a complete panopticon: a universal data warehouse with everything in the company. This is an expensive endeavor, and not a historically successful path. Whether structured or unstructured, any data store is suited to answer some questions but not others. No matter how much you invest in building the panopticon, there will be dimensions you don’t think to support. It is better to skip the massive up-front time and expense, focusing instead on making it very fast and easy to add new data sources or new elements to existing sources.

I like the term panopticon. In part because if its historical association with prisons.

Data warehouses/structures are prisons and suited better for one purpose (or group of purposes) than another.

We must build prisons for today and leave tomorrow’s prisons for tomorrow.

The problem that topic maps trys to address is how to safely transfer prisoners from today’s prisons to tomorrows? Which is made more complicated by some people still using old prisons, sometimes generations of prisons older than most people. Not to mention the variety of prisons across businesses, governments, nationalities.

All of them have legitimate purposes and serve some purpose now, else their users would have migrated their prisoners to a new prison.

I will have to think about the prison metaphor. I think it works fairly well.

Comments?

Trees in the Database: Advanced Data Structures

Monday, March 5th, 2012

Trees in the Database: Advanced Data Structures

Lorenzo Alberton writes:

Despite the NoSQL movement trying to flag traditional databases as a dying breed, the RDBMS keeps evolving and adding new powerful weapons to its arsenal. In this talk we’ll explore Common Table Expressions (SQL-99) and how SQL handles recursion, breaking the bi-dimensional barriers and paving the way to more complex data structures like trees and graphs, and how we can replicate features from social networks and recommendation systems. We’ll also have a look at window functions (SQL:2003) and the advanced reporting features they make finally possible. The first part of this talk will cover several different techniques to model a tree data structure into a relational database: parent-child (adjacency list) model, materialized path, nested sets, nested intervals, hybrid models, Common Table Expressions. Then we’ll move one step forward and see how we can model a more complex data structure, i.e. a graph, with concrete examples from today’s websites. Starting from real-world examples of social networks’ and recommendation systems’ features, and with the help of some graph theory, this talk will explain how to represent and traverse a graph in the database. Finally, we will take a look at Window Functions and how they can be useful for data analytics and simple inline aggregations, among other things. All the examples have been tested on PostgreSQL >= 8.4.

Very impressive presentation!

Definitely makes me want to dust off my SQL installations and manuals for a closer look!

Combinatorial Algorithms and Data Structures

Sunday, February 19th, 2012

Combinatorial Algorithms and Data Structures

In the Berkeley course listed I posted earlier, this course listing came up as a 404.

After a little digging I found it (it has links to the prior versions of the class) and I thought you might want something challenging to start the week!

GraphInsight

Wednesday, February 1st, 2012

GraphInsight

From the webpage:

Interative graph exploration

GraphInsight is a visualization software that lets you explore graph data through high quality interactive representations.

(video omitted)

Data exploration and knowledge extraction from graphs is of great interest nowadays: Knowledge is disseminated in social networks, and services are powered by cloud computing platforms. Data miners deal with graphs every day.

Humans are extremely good in identifying patterns and outliers. We believe that interacting visually with your data can give you a better intuition, and higher confidence on what you are looking for.

The video is just a little over one (1) minute long and is worth seeing.

Won’t tell you how to best display your data but does illustrate some of the capabilities of the software.

There are a number of graph rendering packages already but interactive ones are less common.

Now if we can have interactive graph software that hides/displays the graph underlying a text with all of the sub-graphs related to its content. So that it starts to mimic regular reading practice that goes off on tangents and finds support for ideas in unlikely spaces, that would be something really different.

Data Structures and Algorithms

Thursday, January 5th, 2012

Data Structures and Algorithms with Object-Oriented Design Patterns in Java by Bruno R. Preiss.

From Goals:

The primary goal of this book is to promote object-oriented design using Java and to illustrate the use of the emerging object-oriented design patterns. Experienced object-oriented programmers find that certain ways of doing things work best and that these ways occur over and over again. The book shows how these patterns are used to create good software designs. In particular, the following design patterns are used throughout the text: singleton, container, enumeration, adapter and visitor.

Virtually all of the data structures are presented in the context of a single, unified, polymorphic class hierarchy. This framework clearly shows the relationships between data structures and it illustrates how polymorphism and inheritance can be used effectively. In addition, algorithmic abstraction is used extensively when presenting classes of algorithms. By using algorithmic abstraction, it is possible to describe a generic algorithm without having to worry about the details of a particular concrete realization of that algorithm.

A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. In the past when the topics in this book were taught at the graduate level, an author could rely on students having the needed background in mathematics. However, because the book is targeted for second- and third-year students, it is necessary to fill in the background as needed. To the extent possible without compromising correctness, the presentation fosters intuitive understanding of the concepts rather than mathematical rigor.

Noticed in David Eppstein’s Link Roundup.

Open Data Structures

Thursday, January 5th, 2012

Open Data Structures by Pat Morin.

From “about:”

Open Data Structures covers the implementation and analysis of data structures for sequences (lists), queues, priority queues, unordered dictionaries, and ordered dictionaries.

Data structures presented in the book include stacks, queues, deques, and lists implemented as arrays and linked-list; space-efficient implementations of lists; skip lists; hash tables and hash codes; binary search trees including treaps, scapegoat trees, and red-black trees; and heaps, including implicit binary heaps and randomized meldable heaps.

The data structures in this book are all fast, practical, and have provably good running times. All data structures are rigorously analyzed and implemented in Java and C++. The Java implementations implement the corresponding interfaces in the Java Collections Framework.

The book and accompanying source code are free (libre and gratis) and are released under a Creative Commons Attribution License. Users are free to copy, distribute, use, and adapt the text and source code, even commercially. The book’s LaTeX sources, Java/C++ sources, and build scripts are available through github.

Noticed in David Eppstein’s Link Roundup.

Extreme Cleverness: Functional Data Structures in Scala

Tuesday, December 20th, 2011

Extreme Cleverness: Functional Data Structure in Scala

From the description:

Daniel Spiewak shows how to create immutable data that supports structural sharing, such as: Singly-linked List, Banker’s Queue, 2-3 Finger Tree, Red-Black Tree, Patricia Trie, Bitmapped Vector Trie.

Every now and again I see a presentation that is head and shoulders above even very good presentations. This is one of those.

The coverage of the Bitmapped Vector Trie merits your close attention. Amazing performance characteristics.

Satisfy yourself, see: http://github.com/djspiewak/extreme-cleverness

balanced binary search trees exercise for algorithms and data structures class

Wednesday, November 30th, 2011

balanced binary search trees exercise for algorithms and data structures class by René Pichardt.

From the post:

I created some exercises regarding binary search trees. This time there is no coding involved. My experience from teaching former classes is that many people have a hard time understanding why trees are usefull and what the dangers of these trees is. Therefor I have created some straight forward exercises that nevertheless involve some work and will hopefully help the students to better understand and internalize the concepts of binary search tress which are in my oppinion one of the most fundamental and important concepts in a class about algorithms and data structures.

I visited René’s blog because of the Google n gram post but could not leave without mentioning these exercises.

Great teaching technique!

What parts of topic maps should be illustrated with similar exercises?

PS: Still working on it but I am thinking that the real power of topic maps lies in its lack of precision or rather that a topic map can be as precise or as loose as need be. No pre-set need to have a decidable outcome. Or perhaps rather, it can have a decidable outcome that is the decidable outcome because I say it is so. ;-)

Munnecke, Heath Records and VistA (NoSQL 35 years old?)

Sunday, November 6th, 2011

Tom Munnecke is the inventor of Veterans Health Information Systems and Technology Architecture (VISTA), which is the core for half of the operational electronic health records in existence today.

From the VISTA monograph:

In 1996, the Chief Information Office introduced VISTA, which is the Veterans Health Information Systems and Technology Architecture. It is a rich, automated environment that supports day-to-day operations at local Department of Veterans Affairs (VA) health care facilities.

VISTA is built on a client-server architecture, which ties together workstations and personal computers with graphical user interfaces at Veterans Health Administration (VHA) facilities, as well as software developed by local medical facility staff. VISTA also includes the links that allow commercial off-the-shelf software and products to be used with existing and future technologies. The Decision Support System (DSS) and other national databases that might be derived from locally generated data lie outside the scope of VISTA.

When development began on the Decentralized Hospital Computer Program (DHCP) in the early 1980s, information systems were in their infancy in VA medical facilities and emphasized primarily hospital-based activities. DHCP grew rapidly and is used by many private and public health care facilities throughout the United States and the world. Although DHCP represented the total automation activity at most VA medical centers in 1985, DHCP is now only one part of the overall information resources at the local facility level. VISTA incorporates all of the benefits of DHCP as well as including the rich array of other information resources that are becoming vital to the day-to-day operations at VA medical facilities. It represents the culmination of DHCP’s evolution and metamorphosis into a new, open system, client-server based environment that takes full advantage of commercial solutions, including those provided by Internet technologies.

Yeah, you caught the alternative expansion of DHCP. Surprised me the first time I saw it.

A couple of other posts/resources on Munnecke to consider:

Some of my original notes on the design of VistA and Rehashing MUMPS/Data Dictionary vs. Relational Model.

From the MUMPS/Data Dictionary post:

This is another never-ending story, now going 35 years. It seems that there are these Mongolean hordes of people coming over the horizon, saying the same thing about treating medical informatics as just another transaction processing system. They know banking, insurance, or retail, so therefore they must understand medical informatics as well.

I looked very seriously at the relational model, and rejected it because I thought it was too rigid for the expression of medical informatics information. I made a “grand tour” of the leading medical informatics sites to look at what was working for them. I read and spoke extensively with Chris Date http://en.wikipedia.org/wiki/Christopher_J._Date , Stanford CS prof Gio Wiederhold http://infolab.stanford.edu/people/gio.html (who was later to become the major professor of PhD dropout Sergy Brin), and Wharton professor Richard Hackathorn. I presented papers at national conventions AFIPS and SCAMC, gave colloquia at Stanford, Harvard Medical School, Linkoping University in Sweden, Frankfurt University in Germany, and Chiba University in Japan.

So successful, widespread and mainstream NoSQL has been around for 35 years? ;-)

How to beat the CAP theorem

Sunday, October 30th, 2011

How to beat the CAP theorem by Nathan Marz.

After the Storm video, I ran across this post by Nathan and just had to add it as well!

From the post:

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can’t sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn’t available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn’t in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you’ve ever used a product that told you to “try again later”, you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is “eventual consistency”. If you use an eventually consistent database, then sometimes you’ll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called “read repair”).

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I’m saying that you’re damned if you do and damned if you don’t. The CAP theorem is a fact of nature, so what alternative can there possibly be?

Nathan finds a way and it is as clever as his coding for Storm.

Take your time and read slowly. See what you think. Comments welcome!

Data Structures for Range-Sum Queries (slides)

Saturday, October 29th, 2011

Data Structures for Range-Sum Queries (slides) by Paul Butler.

From the post:

This week I attended the Canadian Undergraduate Mathematics Conference. I enjoyed talks from a number of branches of mathematics, and gave a talk of my own on range-sum queries. Essentially, range-aggregate queries are a class of database queries which involve taking an aggregate (in SQL terms, SUM, AVG, COUNT, MIN, etc.) over a set of data where the elements are filtered by simple inequality operators (in SQL terms, WHERE colname {<, <=, =, >=, >} value AND …). Range-sum queries are the subset of those queries where SUM is the aggregation function.

Due to the nature of the conference, I did my best to make things as accessible to someone with a general mathematics background rather than assuming familiarity with databases or order notation.

I’ve put the slides (pdf link, embedded below also) online. They may be hard to follow as slides, but I hope they pique your interest enough to check out the papers referenced at the end if that’s the sort of thing that interests you. I may turn them into a blog post at some point. The presentation begins with tabular data and shows some of the insights that led to the Dynamic Data Cube, which is a clever data structure for answering range-sum queries.

I will run down the links and see what other materials I can find on the “Dynamic Data Cube” (this post is from 2010). Data structures for range-sum queries look quite interesting.

Timetric

Thursday, October 27th, 2011

Timetric: Everything you need to publish data and research online

Billed as having more than three (3) million public statistics.

Looks like an interesting data source.

Anyone have experience with this site in particular?

Functional Data Structures – Chris Okasaki Publications

Sunday, September 18th, 2011

Functional Data Structures – Chris Okasaki Publications

I was trying to find a paper that Daniel Spiewak mentions in: Extreme Cleverness: Functional Data Structures in Scala when I ran across this listing of publications by Chris Okasaki.

Facing the choice of burying the reference in what seems like an endless list of bookmarks or putting it in my blog where I may find it again and/or it may benefit someone else, I chose the latter course.

Enjoy.