Data Structures « Another Word For It

April 10, 2012

Big Data Reference Model (And Panopticons)

Filed under: BigData,Data Structures,Data Warehouse,Panopticon — Patrick Durusau @ 6:40 pm

Michael Nygard writes:

A project that approaches Big Data as a purely technical challenge will not deliver results. It is about more than just massive Hadoop clusters and number-crunching. In order to deliver value, a Big Data project has to enable change and adaptation. This requires that there are known problems to be solved. Yet, identifying the problem can be the hardest part. It’s often the case that you have to collect some information to even discover what problem to solve. Deciding how to solve that problem creates a need for more information and analysis. This is an empirical discovery loop similar to that found in any research project or Six Sigma initiative.

Michael takes you on a sensible loop of discover and evaluation, making you more likely (no guarantees) to succeed with your next “big data” project. In particular see the following caution:

… it is tempting to think that we could build a complete panopticon: a universal data warehouse with everything in the company. This is an expensive endeavor, and not a historically successful path. Whether structured or unstructured, any data store is suited to answer some questions but not others. No matter how much you invest in building the panopticon, there will be dimensions you don’t think to support. It is better to skip the massive up-front time and expense, focusing instead on making it very fast and easy to add new data sources or new elements to existing sources.

I like the term panopticon. In part because if its historical association with prisons.

Data warehouses/structures are prisons and suited better for one purpose (or group of purposes) than another.

We must build prisons for today and leave tomorrow’s prisons for tomorrow.

The problem that topic maps trys to address is how to safely transfer prisoners from today’s prisons to tomorrows? Which is made more complicated by some people still using old prisons, sometimes generations of prisons older than most people. Not to mention the variety of prisons across businesses, governments, nationalities.

All of them have legitimate purposes and serve some purpose now, else their users would have migrated their prisoners to a new prison.

I will have to think about the prison metaphor. I think it works fairly well.

Comments?

Comments Off

March 5, 2012

Trees in the Database: Advanced Data Structures

Filed under: Data Structures,Database,PostgreSQL,RDBMS,SQL,Trees — Patrick Durusau @ 7:52 pm

Trees in the Database: Advanced Data Structures

Lorenzo Alberton writes:

Despite the NoSQL movement trying to flag traditional databases as a dying breed, the RDBMS keeps evolving and adding new powerful weapons to its arsenal. In this talk we’ll explore Common Table Expressions (SQL-99) and how SQL handles recursion, breaking the bi-dimensional barriers and paving the way to more complex data structures like trees and graphs, and how we can replicate features from social networks and recommendation systems. We’ll also have a look at window functions (SQL:2003) and the advanced reporting features they make finally possible. The first part of this talk will cover several different techniques to model a tree data structure into a relational database: parent-child (adjacency list) model, materialized path, nested sets, nested intervals, hybrid models, Common Table Expressions. Then we’ll move one step forward and see how we can model a more complex data structure, i.e. a graph, with concrete examples from today’s websites. Starting from real-world examples of social networks’ and recommendation systems’ features, and with the help of some graph theory, this talk will explain how to represent and traverse a graph in the database. Finally, we will take a look at Window Functions and how they can be useful for data analytics and simple inline aggregations, among other things. All the examples have been tested on PostgreSQL >= 8.4.

Very impressive presentation!

Definitely makes me want to dust off my SQL installations and manuals for a closer look!

Comments Off

February 19, 2012

Combinatorial Algorithms and Data Structures

Filed under: Combinatorics,Data Structures — Patrick Durusau @ 8:41 pm

Combinatorial Algorithms and Data Structures

In the Berkeley course listed I posted earlier, this course listing came up as a 404.

After a little digging I found it (it has links to the prior versions of the class) and I thought you might want something challenging to start the week!

Comments Off

February 1, 2012

GraphInsight

Filed under: Data Analysis,Data Structures,Graphs,Visualization — Patrick Durusau @ 4:38 pm

GraphInsight

From the webpage:

Interative graph exploration

GraphInsight is a visualization software that lets you explore graph data through high quality interactive representations.

(video omitted)

Data exploration and knowledge extraction from graphs is of great interest nowadays: Knowledge is disseminated in social networks, and services are powered by cloud computing platforms. Data miners deal with graphs every day.

Humans are extremely good in identifying patterns and outliers. We believe that interacting visually with your data can give you a better intuition, and higher confidence on what you are looking for.

The video is just a little over one (1) minute long and is worth seeing.

Won’t tell you how to best display your data but does illustrate some of the capabilities of the software.

There are a number of graph rendering packages already but interactive ones are less common.

Now if we can have interactive graph software that hides/displays the graph underlying a text with all of the sub-graphs related to its content. So that it starts to mimic regular reading practice that goes off on tangents and finds support for ideas in unlikely spaces, that would be something really different.

Comments (2)

January 5, 2012

Data Structures and Algorithms

Filed under: Data Structures — Patrick Durusau @ 4:05 pm

Data Structures and Algorithms with Object-Oriented Design Patterns in Java by Bruno R. Preiss.

From Goals:

The primary goal of this book is to promote object-oriented design using Java and to illustrate the use of the emerging object-oriented design patterns. Experienced object-oriented programmers find that certain ways of doing things work best and that these ways occur over and over again. The book shows how these patterns are used to create good software designs. In particular, the following design patterns are used throughout the text: singleton, container, enumeration, adapter and visitor.

Virtually all of the data structures are presented in the context of a single, unified, polymorphic class hierarchy. This framework clearly shows the relationships between data structures and it illustrates how polymorphism and inheritance can be used effectively. In addition, algorithmic abstraction is used extensively when presenting classes of algorithms. By using algorithmic abstraction, it is possible to describe a generic algorithm without having to worry about the details of a particular concrete realization of that algorithm.

A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context. In the past when the topics in this book were taught at the graduate level, an author could rely on students having the needed background in mathematics. However, because the book is targeted for second- and third-year students, it is necessary to fill in the background as needed. To the extent possible without compromising correctness, the presentation fosters intuitive understanding of the concepts rather than mathematical rigor.

Noticed in David Eppstein’s Link Roundup.

Comments Off

Open Data Structures

Filed under: Data Structures,Java — Patrick Durusau @ 4:04 pm

Open Data Structures by Pat Morin.

From “about:”

Open Data Structures covers the implementation and analysis of data structures for sequences (lists), queues, priority queues, unordered dictionaries, and ordered dictionaries.

Data structures presented in the book include stacks, queues, deques, and lists implemented as arrays and linked-list; space-efficient implementations of lists; skip lists; hash tables and hash codes; binary search trees including treaps, scapegoat trees, and red-black trees; and heaps, including implicit binary heaps and randomized meldable heaps.

The data structures in this book are all fast, practical, and have provably good running times. All data structures are rigorously analyzed and implemented in Java and C++. The Java implementations implement the corresponding interfaces in the Java Collections Framework.

The book and accompanying source code are free (libre and gratis) and are released under a Creative Commons Attribution License. Users are free to copy, distribute, use, and adapt the text and source code, even commercially. The book’s LaTeX sources, Java/C++ sources, and build scripts are available through github.

Noticed in David Eppstein’s Link Roundup.

Comments Off

December 20, 2011

Extreme Cleverness: Functional Data Structures in Scala

Filed under: Data Structures,Functional Programming,Scala — Patrick Durusau @ 8:22 pm

Extreme Cleverness: Functional Data Structure in Scala

From the description:

Daniel Spiewak shows how to create immutable data that supports structural sharing, such as: Singly-linked List, Banker’s Queue, 2-3 Finger Tree, Red-Black Tree, Patricia Trie, Bitmapped Vector Trie.

Every now and again I see a presentation that is head and shoulders above even very good presentations. This is one of those.

The coverage of the Bitmapped Vector Trie merits your close attention. Amazing performance characteristics.

Satisfy yourself, see: http://github.com/djspiewak/extreme-cleverness

Comments Off

November 30, 2011

balanced binary search trees exercise for algorithms and data structures class

Filed under: Algorithms,Data Structures,Search Trees — Patrick Durusau @ 8:11 pm

balanced binary search trees exercise for algorithms and data structures class by René Pichardt.

From the post:

I created some exercises regarding binary search trees. This time there is no coding involved. My experience from teaching former classes is that many people have a hard time understanding why trees are usefull and what the dangers of these trees is. Therefor I have created some straight forward exercises that nevertheless involve some work and will hopefully help the students to better understand and internalize the concepts of binary search tress which are in my oppinion one of the most fundamental and important concepts in a class about algorithms and data structures.

I visited René’s blog because of the Google n gram post but could not leave without mentioning these exercises.

Great teaching technique!

What parts of topic maps should be illustrated with similar exercises?

PS: Still working on it but I am thinking that the real power of topic maps lies in its lack of precision or rather that a topic map can be as precise or as loose as need be. No pre-set need to have a decidable outcome. Or perhaps rather, it can have a decidable outcome that is the decidable outcome because I say it is so.

Comments Off

November 6, 2011

Munnecke, Heath Records and VistA (NoSQL 35 years old?)

Filed under: Data Management,Data Structures,Medical Informatics,MUMPS — Patrick Durusau @ 5:42 pm

Tom Munnecke is the inventor of Veterans Health Information Systems and Technology Architecture (VISTA), which is the core for half of the operational electronic health records in existence today.

From the VISTA monograph:

In 1996, the Chief Information Office introduced VISTA, which is the Veterans Health Information Systems and Technology Architecture. It is a rich, automated environment that supports day-to-day operations at local Department of Veterans Affairs (VA) health care facilities.

VISTA is built on a client-server architecture, which ties together workstations and personal computers with graphical user interfaces at Veterans Health Administration (VHA) facilities, as well as software developed by local medical facility staff. VISTA also includes the links that allow commercial off-the-shelf software and products to be used with existing and future technologies. The Decision Support System (DSS) and other national databases that might be derived from locally generated data lie outside the scope of VISTA.

When development began on the Decentralized Hospital Computer Program (DHCP) in the early 1980s, information systems were in their infancy in VA medical facilities and emphasized primarily hospital-based activities. DHCP grew rapidly and is used by many private and public health care facilities throughout the United States and the world. Although DHCP represented the total automation activity at most VA medical centers in 1985, DHCP is now only one part of the overall information resources at the local facility level. VISTA incorporates all of the benefits of DHCP as well as including the rich array of other information resources that are becoming vital to the day-to-day operations at VA medical facilities. It represents the culmination of DHCP’s evolution and metamorphosis into a new, open system, client-server based environment that takes full advantage of commercial solutions, including those provided by Internet technologies.

Yeah, you caught the alternative expansion of DHCP. Surprised me the first time I saw it.

A couple of other posts/resources on Munnecke to consider:

Some of my original notes on the design of VistA and Rehashing MUMPS/Data Dictionary vs. Relational Model.

From the MUMPS/Data Dictionary post:

This is another never-ending story, now going 35 years. It seems that there are these Mongolean hordes of people coming over the horizon, saying the same thing about treating medical informatics as just another transaction processing system. They know banking, insurance, or retail, so therefore they must understand medical informatics as well.

I looked very seriously at the relational model, and rejected it because I thought it was too rigid for the expression of medical informatics information. I made a “grand tour” of the leading medical informatics sites to look at what was working for them. I read and spoke extensively with Chris Date http://en.wikipedia.org/wiki/Christopher_J._Date , Stanford CS prof Gio Wiederhold http://infolab.stanford.edu/people/gio.html (who was later to become the major professor of PhD dropout Sergy Brin), and Wharton professor Richard Hackathorn. I presented papers at national conventions AFIPS and SCAMC, gave colloquia at Stanford, Harvard Medical School, Linkoping University in Sweden, Frankfurt University in Germany, and Chiba University in Japan.

So successful, widespread and mainstream NoSQL has been around for 35 years?

Comments (3)

October 30, 2011

How to beat the CAP theorem

Filed under: CAP,Data Structures,Database — Patrick Durusau @ 7:05 pm

How to beat the CAP theorem by Nathan Marz.

After the Storm video, I ran across this post by Nathan and just had to add it as well!

From the post:

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can’t sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn’t available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn’t in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you’ve ever used a product that told you to “try again later”, you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is “eventual consistency”. If you use an eventually consistent database, then sometimes you’ll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called “read repair”).

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I’m saying that you’re damned if you do and damned if you don’t. The CAP theorem is a fact of nature, so what alternative can there possibly be?

Nathan finds a way and it is as clever as his coding for Storm.

Take your time and read slowly. See what you think. Comments welcome!

Comments Off

October 29, 2011

Data Structures for Range-Sum Queries (slides)

Filed under: Data Structures,Query Language — Patrick Durusau @ 7:29 pm

Data Structures for Range-Sum Queries (slides) by Paul Butler.

From the post:

This week I attended the Canadian Undergraduate Mathematics Conference. I enjoyed talks from a number of branches of mathematics, and gave a talk of my own on range-sum queries. Essentially, range-aggregate queries are a class of database queries which involve taking an aggregate (in SQL terms, SUM, AVG, COUNT, MIN, etc.) over a set of data where the elements are filtered by simple inequality operators (in SQL terms, WHERE colname {<, <=, =, >=, >} value AND …). Range-sum queries are the subset of those queries where SUM is the aggregation function.

Due to the nature of the conference, I did my best to make things as accessible to someone with a general mathematics background rather than assuming familiarity with databases or order notation.

I’ve put the slides (pdf link, embedded below also) online. They may be hard to follow as slides, but I hope they pique your interest enough to check out the papers referenced at the end if that’s the sort of thing that interests you. I may turn them into a blog post at some point. The presentation begins with tabular data and shows some of the insights that led to the Dynamic Data Cube, which is a clever data structure for answering range-sum queries.

I will run down the links and see what other materials I can find on the “Dynamic Data Cube” (this post is from 2010). Data structures for range-sum queries look quite interesting.

Comments Off

October 27, 2011

Timetric

Filed under: Data Mining,Data Structures — Patrick Durusau @ 4:46 pm

Timetric: Everything you need to publish data and research online

Billed as having more than three (3) million public statistics.

Looks like an interesting data source.

Anyone have experience with this site in particular?

Comments Off

September 18, 2011

Functional Data Structures – Chris Okasaki Publications

Filed under: Data Structures,Functional Programming — Patrick Durusau @ 7:28 pm

Functional Data Structures – Chris Okasaki Publications

I was trying to find a paper that Daniel Spiewak mentions in: Extreme Cleverness: Functional Data Structures in Scala when I ran across this listing of publications by Chris Okasaki.

Facing the choice of burying the reference in what seems like an endless list of bookmarks or putting it in my blog where I may find it again and/or it may benefit someone else, I chose the latter course.

Enjoy.

Comments Off

September 1, 2011

An Introduction to Clojure and Its Capabilities for Data Manipulation

Filed under: Clojure,Data Structures — Patrick Durusau @ 6:01 pm

An Introduction to Clojure and Its Capabilities for Data Manipulation by Jean-François “Jeff” Héon.

From the post:

I mainly use Java at work in an enterprise setting, but I’ve been using Clojure at work for small tasks like extracting data from log files or generating or transforming Java code. What I do could be done with more traditional tools like Perl, but I like the readability of Clojure combined with its Java interoperability. I particularly like the different ways functions can be used in Clojure to manipulate data.

I will only be skimming the surface of Clojure in this short article and so will present a simplified view of the concepts. My goal is for the reader to get to know enough about Clojure to decide if it is worth pursuing further using longer and more complete introduction material already available.

I will start with a mini introduction to Clojure, followed by an overview of sequences and functions combination, and finish off with a real-world example.

You will encounter immutable data structures so be forewarned.

I wonder to what degree mutable data structures arose originally due to lack of storage space and processor limitations? Will have to make a note to check that out.

Comments (1)

July 25, 2011

Stratified B-Tree and Versioned Dictionaries

Filed under: B-trees,Data Structures — Patrick Durusau @ 6:41 pm

Stratified B-Tree and Versioned Dictionaries by Andy Twigg (Acunu). (video)

Abstract:

A classic versioned data structure in storage and computer science is the copy-on-write (CoW) B-tree — it underlies many of today’s file systems and databases, including WAFL, ZFS, Btrfs and more. Unfortunately, it doesn’t inherit the B-tree’s optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO to scale. Yet, nothing better has been developed since. We describe the `stratified B-tree’, which beats all known semi-external memory versioned B-trees, including the CoW B-tree. In particular, it is the first versioned dictionary to achieve optimal tradeoffs between space, query and update performance.

I haven’t had time to watch the video but you can find some other resources on stratified B-Trees at Andy’s post All about stratified B-trees.

Comments (1)

July 20, 2011

The Britney Spears Problem

Filed under: Data Streams,Data Structures,Topic Maps — Patrick Durusau @ 1:05 pm

The Britney Spears Problem by Brian Hayes.

From the article:

Back in 1999, the operators of the Lycos Internet portal began publishing a weekly list of the 50 most popular queries submitted to their Web search engine. Britney Spears—initially tagged a “teen songstress,” later a “pop tart“—was No. 2 on that first weekly tabulation. She has never fallen off the list since then—440 consecutive appearances when I last checked. Other perennials include Pamela Anderson and Paris Hilton. What explains the enduring popularity of these celebrities, so famous for being famous? That’s a fascinating question, and the answer would doubtless tell us something deep about modern culture. But it’s not the question I’m going to take up here. What I’m trying to understand is how we can know Britney’s ranking from week to week. How are all those queries counted and categorized? What algorithm tallies them up to see which terms are the most frequent? (emphasis added)

Deeply interesting discussion on the analysis of stream data and algorithms for the same. Very much worth a close read if you are working on or interested in such issues.

The article concludes:

All this mathematics and algorithmic engineering seems like a lot of work just for following the exploits of a famous “pop tart.” But I like to think the effort might be justified. Years from now, someone will type “Britney Spears” into a search engine and will stumble upon this article listed among the results. Perhaps then a curious reader will be led into new lines of inquiry. (emphasis added)

But what if the user enters “pop tart?” Will they still find this article? Or will it be “hit” number 100,000, which almost no one reaches? As of 20 July 2011, there were some 13 million “hits” for “pop tart” on a popular search engine. I suspect at least some of them are not about Britney Spears.

So, should I encounter a resource about Britney Spears, using the term “pop tart,” how am I going to accumulate those up for posterity?

Or do we all have to winnow search chaff for ourselves?*

*Question for office managers: How much time do you think your staff spends winnowing search chaff already winnowed by another user in your office?

Comments Off

July 1, 2011

…filling space — without cubes

Filed under: Algorithms,Data Structures,Mathematics — Patrick Durusau @ 2:56 pm

Princeton researchers solve problem filling space — without cubes

From the post:

Whether packing oranges into a crate, fitting molecules into a human cell or getting data onto a compact disc, wasted space is usually not a good thing.

Now, in findings published June 20 in the Proceedings of the National Academy Sciences, Princeton University chemist Salvatore Torquato and colleagues have solved a conundrum that has baffled mathematical minds since ancient times — how to fill three-dimensional space with multi-sided objects other than cubes without having any gaps.

The discovery could lead to scientists finding new materials and could lead to advances in communications systems and computer security.

“You know you can fill space with cubes,” Torquato said, “We were looking for another way.” In the article “New Family of Tilings of Three-Dimensional Euclidean Space by Tetrahedra and Octahedra,” he and his team show they have solved the problem.

Not immediately useful for topic maps but will be interesting to see if new data structures emerge from this work.

See the article: New Family of Tilings of Three-Dimensional Euclidean Space by Tetrahedra and Octahedra (pay-per-view site)

Comments Off

June 12, 2011

Dremel: Interactive Analysis of Web-Scale
Datasets

Filed under: BigData,Data Analysis,Data Structures,Dremel,MapReduce — Patrick Durusau @ 4:10 pm

Google, along with Bing and Yahoo! have been attracting a lot of discussion for venturing into web semantics without asking permission.

However that turns out, please don’t miss:

Dremel: interactive analysis of web-scale datasets

Abstract:

Dremel is a scalable, interactive ad hoc query system for analysis of read-only nested data. By combining multilevel execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

I am still working through the article but “…aggregation queries over trillion-row tables in seconds,” is obviously of interest for a certain class of topic map.

Comments (1)

May 3, 2011

Introducing Druid: Real-Time Analytics at a Billion Rows Per Second

Filed under: Data Structures,Dataset — Patrick Durusau @ 1:17 pm

Introducing Druid: Real-Time Analytics at a Billion Rows Per Second

A general overview of Druid and the choices that led up to it.

The next post is said to have details about the architecture, etc.

From what I read here, the holding of all data in memory is one critical part of the solution.

That and having data that can be held in smallish cells.

Tossing blobs, ASCII or binary, into cells, might cause a problem.

Won’t know until the software is available for use by a diverse audience.

I mention it here as an example of defining data sets and requirements in such a way that scalable architectures can be developed, for that particular set of requirements.

There is nothing wrong with having a solution that works best for a particular application.

Ballpoint pens are wonderful writing devices but fail miserably as hammers.

A software or technology solutions that works for your problem is far more valuable than software that solves the general case but not yours.

Comments Off

May 2, 2011

Algoviz.org

Filed under: Algorithms,Data Structures,Visualization — Patrick Durusau @ 10:34 am

Algoviz.org: The Algorithm Visualization Portal

From the website:

AlgoViz.org is a gathering place for users and developers of algorithm visualizations and animations (AVs). It is a gateway to AV-related services, collections, and resources.

An amazing resource for algorithm visualization and animations. The “catalog” has over 500 entries. Along with an annotated bibliography of papers on algorithm visualization, field reports of the use of visualizations in the classroom, forums and other resources.

Visualization of merging algorithms is going to take on increasing importance as TMCL and TMQL increase the range of merging opportunities.

Building on prior techniques and experiences with visualization seems like a good idea.

Comments Off

March 27, 2011

Copy-on-write B-tree finally beaten.

Filed under: Data Structures — Patrick Durusau @ 3:14 pm

Copy-on-write B-tree finally beaten by Andy Twigg, Andrew Byde, Grzegorz Mi?o´s, Tim Moreton, John Wilkesy and Tom Wilkie.

Abstract:

A classic versioned data structure in storage and computer science is the copy-on-write (CoW) B-tree – it underlies many of today’s file systems and databases, including WAFL, ZFS, Btrfs and more. Unfortunately, it doesn’t inherit the B-tree’s optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO to scale. Yet, nothing better has been developed since. We describe the ‘stratified B-tree’, which beats the CoW B-tree in every way. In particular, it is the first versioned dictionary to achieve optimal tradeoffs between space, query and update performance. Therefore, we believe there is no longer a good reason to use CoW B-trees for versioned data stores.

I was browsing a CS blog aggregator when I ran across this. Looked like it would be interesting for anyone writing a versioned data store for a topic map application.

A more detailed account appears as: A. Byde and A. Twigg. Optimal query/update tradeoffs in versioned dictionaries. http://arxiv.org/abs/1103.2566. ArXiv e-prints, March 2011.

******
The Copy-on-write B-tree finally beaten paper has been updated: See: http://arxiv.org/abs/1103.4282v2

Comments Off

March 2, 2011

OSCON Data 2011 Call for Participation

Filed under: Conferences,Data Analysis,Data Mining,Data Models,Data Structures — Patrick Durusau @ 7:07 am

OSCON Data 2011 Call for Participation

Deadline: 11:59pm 03/14/2011 PDT

From the website:

The O’Reilly OSCON Data conference is the first of its kind: bringing together open source culture and data hackers to cover data management at a very practical level. From disks and databases through to big data and analytics, OSCON Data will have instruction and inspiration from the people who actually do the work.

OSCON Data will take place July 25-27, 2011, in Portland, Oregon. We’ll be co-located with OSCON itself.

Proposals should include as much detail about the topic and format for the presentation as possible. Vague and overly broad proposals don’t showcase your skills and knowledge, and our volunteer reviewers aren’t mind readers. The more you can tell us, the more likely the proposal will be selected.

Proposals that seem like a “vendor pitch” will not be considered. The purpose of OSCON Data is to enlighten, not to sell.

Submit a proposal.

Yes, it is right before Balisage but I think worth considering if you are on the West Coast and can’t get to Balisage this year or if you are feeling really robust.

Hmmm, I wonder how a proposal that merges the indexes of the different NoSQL volumes from O’Reilly would be received? You are aware that O’Reilly is re-creating the X-Windows problem that was the genesis of both topic maps and DocBook?

I will have to write that up in detail at some point. I wasn’t there but have spoken to some of the principals who were. Plus I have the notes, etc.

Comments Off

February 25, 2011

…a grain of salt

Filed under: Data Analysis,Data Models,Data Structures,Marketing — Patrick Durusau @ 5:46 pm

Benjamin Bock asked me recently about how I would model a mole of salt in a topic map.

That is a good question but I think we had better start with a single grain of salt and then work our way up from there.

At first blush, and only at first blush, do many subjects look quite easy to represent in a topic map.

A grain of salt looks simple to at first glance, just create a PSI (Published Subject Identifier), put that as the subjectIdentifier on a topic and be done with it.

Well…, except that I don’t want to talk about a particular grain of salt, I want to talk about salt more generally.

OK, one of those, I see.

Alright, same answer as before, except make the PSI for salt in general, not some particular grain of salt.

Well,…., except that when I go to the Wikipedia article on salt, Salt, I find that salt is a compound of chlorine and sodium.

A compound, oh, that means something made up of more than one subject. In a particular type of relationship.

Sounds like an association to me.

Of a particular type, an ionic association. (I looked it up, see: Ionic Compound)

And this association between chlorine and sodium has several properties reported by Wikipedia, here are just a few of them:

Molar mass: 58.443 g/mol

Appearance: Colorless/white crystalline solid

Odor: Odorless

Density: 2.165 g/cm³

Melting point: 801 °C, 1074 K, 1474 °F

Boiling point: 1413 °C, 1686 K, 2575 °F

… and several others.

If you are interested in scientific/technical work, please be aware of CAS, a work product of the American Chemical Society, with a very impressive range unique identifiers. (56 million organic and inorganic substances, 62 million sequences and they have a counter that increments while you are on the page.)

Note that unlike my suggestion, CAS takes the assign a unique identifier view for the substances, sequences and chemicals that they curate.

Oh, sorry, got interested in the CAS as a source for subject identification. In fact, that is a nice segway to consider how to represent the millions and millions of compounds.

We could create associations with the various components being role players but then we would have to reify those associations in order to hang additional properties off of them. Well, technically speaking in XTM we would create non-occurrence occurrences and type those to hold the additional properties.

Sorry, I was presuming the decision to represent compounds as associations. Shout out when I start to presume that sort of thing.

The reason I would represent compounds as associations is that the components of the associations are then subjects I can talk about and even add addition properties to, or create mappings between.

I suspect that CAS has chemistry from the 1800’s fairly well covered but what about older texts? Substances before then may not be of interest to commercial chemists but certainly would be of interest to historians and other scholars.

Use of a topic map plus the CAS identifiers would enable scholars studying older materials to effectively share information about older texts, which have different designations for substances than CAS would record.

You could argue that I could use a topic for compounds, much as CAS does, and rely upon searching in order to discover relationships.

Tis true, tis true, but my modeling preference is for relationships seen as subjects, although I must confess I would prefer a next generation syntax that avoids the reification overhead of XTM.

Given the prevalent of complex relationships/associations as you see from the CAS index, I think a simplification of the representation of associations is warranted.

Sorry, I never did quite reach Benjamin’s question about a mole of salt but I will take up that gauge again tomorrow.

We will see that measurements (which figured into his questions about recipes as well) is an interesting area of topic map design.
*****

PS: Comments and/or suggestions on areas to post about are most welcome. Subject analysis for topic maps is not unlike cataloging in library science to a degree, except that what classification you assign is entirely the work product of your experience, reading and analysis. There are no fixed answers, only the ones that you find the most useful.

Comments (1)

February 13, 2011

January 28, 2011

Functional Data Structures – Post

Filed under: Data Structures,Topic Map Software,Topic Maps — Patrick Durusau @ 7:18 am

On the Theoretical Computer Science blog the following question was asked:

What’s new in purely functional data structures since Okasaki?

Since Chris Okasaki’s 1998 book “Purely functional data structures”, I haven’t seen too many new exciting purely functional data structures appear; I can name just a few:…

What follows is a listing of resources that will be of interest to topic map researchers.

Comments Off

January 23, 2011

Multi-Relational Graph Structures: From Algebra to Application

Filed under: Data Structures,Graphs,Neo4j — Patrick Durusau @ 4:55 pm

Multi-Relational Graph Structures: From Algebra to Application

Important review of graph structures and the development of research on the same over the last couple of decades.

Doesn’t answer the question of what will be the hot application that puts topic maps on every desktop.

Does bring us a little closer to an application that would merit that kind of usage.

Comments Off

A Path Algebra for Mapping Multi-Relational Networks to Single-Relational Networks

Filed under: Data Structures,Graphs,Neo4j,Networks — Patrick Durusau @ 4:54 pm

A Path Algebra for Mapping Multi-Relational Networks to Single-Relational Networks

A proposal for re-use of existing algorithms, designed for single relational networks with multi-relational networks.

By mapping multi-relational networks onto single relational networks.

Makes me wonder if heterogeneous identifications could be mapped in a similar way to a single identifier?

Or would there be too much information loss?

Depends on the circumstances and goals.

Comments Off

Distributed Graph Databases and the Emerging Web of Data

Filed under: Data Structures,Graphs,Neo4j — Patrick Durusau @ 4:52 pm

Distributed Graph Databases and the Emerging Web of Data

Marko A. Rodriguez on distributed graph databases.

I follow his presentation up to the point where he says: Directed multi-relational graph: heterogeneous set of links. (page 6 of 79) and then The multi-relational graph is a very natural representation of the world. (page 22 of 79).

I fully agree that a multi-relational graph is a good start, but what about heterogeneous ways to identify the subjects represented by nodes and links (edges)?

I suppose that goes hand in hand with using URIs as the single identifiers for the subjects represented by nodes and edges.

Presuming one identifier is one way to resolve heterogeneous identification but not a very satisfactory one, at least to me.

Comments Off

January 10, 2011

Engineering basic algorithms of an in-memory text search engine

Filed under: Data Structures,Indexing,Search Engines — Patrick Durusau @ 4:37 pm

Engineering basic algorithms of an in-memory text search engine Authors: Frederik Transier, Peter Sanders Keywords: Inverted index, in-memory search engine, randomization

Abstract:

Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.

Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.

A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.

An interesting comparison of inverted indexes with suffix-arrays.

I am troubled by the reconstruct the input aspects of the paper.

While it is understandable and in some cases, more efficient, for data to be held in a localized data structure, my question is what do we do when data exceeds local storage capacity?

Think about the data held by Lexis/Nexis for example. Where would we put it while creating a custom data structure for its access?

There are data sets, important data sets, that have to be accessed in place.

And those data sets need to be addressed using topic maps.

*****
You may recall from the TAO paper by Steve Pepper the illustration of topics, associations and occurrences floating above a data set.

While topic map formats have been useful in many ways, they have distracted from the vision of topic maps as an information overlay as opposed to yet-another-format.

Formats are just that, formats. Pick one.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 10, 2012

March 5, 2012

February 19, 2012

February 1, 2012

January 5, 2012

December 20, 2011

November 30, 2011

November 6, 2011

October 30, 2011

October 29, 2011

October 27, 2011

September 18, 2011

September 1, 2011

July 25, 2011

July 20, 2011

July 1, 2011

June 12, 2011

May 3, 2011

May 2, 2011

March 27, 2011

March 2, 2011

February 25, 2011

February 13, 2011

January 28, 2011

January 23, 2011

January 10, 2011