Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 10, 2011

Efficient set intersection for inverted indexing

Filed under: Data Structures,Information Retrieval,Sets — Patrick Durusau @ 4:08 pm

Efficient set intersection for inverted indexing Authors: J. Shane Culpepper, Alistair Moffat Keywords: Compact data structures, information retrieval, set intersection, set representation, bitvector, byte-code

Abstract:

Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.

The treatment of set intersection caught my attention.

Unlike document sets, topic maps have restricted sets of properties or property values that will form the basis for set intersection (merging in topic maps lingo).

Topic maps also differ in that identity bearing properties are never ignored, whereas in searching a reverse index, terms can be included in the index that are ignored in a particular query.

What impact those characteristics will have on set intersection for topic maps remains a research question.

Memoirs of a Graph Addict:
Despair to Redemption

Filed under: Data Structures,Graphs,Neo4j — Patrick Durusau @ 11:24 am

Memoirs of a Graph Addict: Despair to Redemption Author: Marko A. Rodriguez

Alex Popescu says this slide deck covers:

  • graph structures
  • graph databases
  • graph applications
  • TinkerPop product suite

Which is true but omits that Marko also covers:

  • 2014 – 75% decrease in world population – rise of dynamically distributed democracy
  • 2018 – Eudaemonic Engine: Seeking Virtue Through Circuitry
  • 2023 – Universal Computer: A Single Computational Substrate
  • 2030 – “Man learns to encode themselves into the URI namespace…”

The graph parts are useful, your mileage may vary on the rest.

December 12, 2010

Krati – A persistent high-performance data store

Filed under: Data Structures,NoSQL — Patrick Durusau @ 5:53 pm

Krati – A persistent high-performance data store

From the website:

Krati is a simple persistent data store with very low latency and high throughput. It is designed for easy integration with read-write-intensive applications with little effort in tuning configuration, performance and JVM garbage collection….

Simply put, Krati

  • supports varying-length data array
  • supports key-value data store access
  • performss append-only writes in batches
  • has write-ahead redo logs and periodic checkpointing
  • has automatic data compaction (i.e. garbage collection)
  • is memory-resident (or OS page cache resident) yet persistent
  • allows single-writer and multiple readers

Or you can think of Krati as

  • Berkeley DB JE backed by hash-based indexing rather than B-tree
  • A hashtable with disk persistency at the granularity of update batch

If you use Krati as part of a topic map application please share your experience.

December 9, 2010

Developing High Quality Data Models – Book

Filed under: Data Models,Data Structures,Ontology — Patrick Durusau @ 11:39 am

Developing High Quality Data Models by Dr. Matthew West is due out in January of 2011. (Pre-order: Elsevier, Amazon)

From the website:

Anyone charged with developing a data model knows that there is a wide variety of potential problems likely to arise before achieving a high quality data model. With dozens of attributes and millions of rows, data modelers are in always danger of inconsistency and inaccuracy. The development of the data model itself could result in difficulties presenting accurate data. The need to improve data models begins in getting it right in the first place.

Developing High Quality Data Models uses real-world examples to show you how to identify a number of data modeling principles and analysis techniques that will enable you to develop data models that consistently meet business requirements. A variety of generic data model patterns that exemplify the principles and techniques discussed build upon one another to give a powerful and integrated generic data model with wide applicability across many disciplines. The principles and techniques outlined in this book are applicable in government and industry, including but not limited to energy exploration, healthcare, telecommunications, transportation, military defense, transportation and so on.

Table of Contents:

Preface
Chapter 1- Introduction
Chapter 2- Entity Relationship Model Basics
Chapter 3- Some types and uses of data models
Chapter 4- Data models and enterprise architecture
Chapter 5- Some observations on data models and data modeling
Chapter 6- Some General Principles for Conceptual, Integration and Enterprise Data Models
Chapter 7- Applying the principles for attributes
Chapter 8- General principles for relationships
Chapter 9- General principles for entity types
Chapter 10- Motivation and overview for an ontological framework
Chapter 12- Classes
Chapter 13- Intentionally constructed objects
Chapter 14- Systems and system components
Chapter 15- Requirements specifications
Chapter 16- Concluding Remarks
Chapter 17- The HQDM Framework Schema

I first became familiar with the work of Dr. West from Ontolog. You can visit his publications page to see why I am looking forward to this book.

Citation of and comments on this work will follow as soon as access and time allow.

December 6, 2010

GT.M High end TP database engine

Filed under: Data Structures,GT.M,node-js,NoSQL,Software — Patrick Durusau @ 4:55 am

GT.M High end TP database engine (Sourceforge)

Description from the commercial version:

The GT.M data model is a hierarchical associative memory (i.e., multi-dimensional array) that imposes no restrictions on the data types of the indexes and the content – the application logic can impose any schema, dictionary or data organization suited to its problem domain.* GT.M’s compiler for the standard M (also known as MUMPS) scripting language implements full support for ACID (Atomic, Consistent, Isolated, Durable) transactions, using optimistic concurrency control and software transactional memory (STM) that resolves the common mismatch between databases and programming languages. Its unique ability to create and deploy logical multi-site configurations of applications provides unrivaled continuity of business in the face of not just unplanned events, but also planned events, including planned events that include changes to application logic and schema.

There are clients for node-js:

http://github.com/robtweed/node-mdbm
http://github.com/robtweed/node-mwire

Local topic map software is interesting and useful but for scaling to the enterprise level, something different is going to be required.

Reports of implementing the TMDM or other topic map legends with a GT.M based system are welcome.

December 3, 2010

Constructions From Dots And Lines

Filed under: Data Structures,Graphs — Patrick Durusau @ 9:47 am

Constructions From Dots And Lines. Authors: Marko A. Rodriguez, Peter Neubauer

Abstract:

A graph is a data structure composed of dots (i.e. vertices) and lines (i.e. edges). The dots and lines of a graph can be organized into intricate arrangements. The ability for a graph to denote objects and their relationships to one another allow for a surprisingly large number of things to be modeled as a graph. From the dependencies that link software packages to the wood beams that provide the framing to a house, most anything has a corresponding graph representation. However, just because it is possible to represent something as a graph does not necessarily mean that its graph representation will be useful. If a modeler can leverage the plethora of tools and algorithms that store and process graphs, then such a mapping is worthwhile. This article explores the world of graphs in computing and exposes situations in which graphical models are beneficial.

A relatively short (11 pages) and entertaining (it takes all kinds you know) treatment of graphs and their properties.

The depiction of the types of graphs and the possibility of combining types of graphs in figure 2 is worth downloading the article but please read it in full.

Questions:

  1. Evaluate the graph types for suitability representing topic maps.
  2. Which graph type looks like the best fit for topic maps? (3-5 pages, no citations)
  3. Which graph type looks like the worst fit for topic maps? (3-5 pages, no citations)

November 20, 2010

Associations: The Kind They Pay For

Filed under: Associations,Authoring Topic Maps,Data Mining,Data Structures — Patrick Durusau @ 4:56 pm

Fun at a Department Store: Data Mining Meets Switching Theory Author(s): Anna Bernasconi, Valentina Ciriani, Fabrizio Luccio, Linda Pagli Keywords: SOP, Implicants, Data Mining, Frequent Itemsets, Blulife

Abstract:

In this paper we introduce new algebraic forms, SOP +  and DSOP + , to represent functions f:{0,1}n → ℕ, based on arithmetic sums of products. These expressions are a direct generalization of the classical SOP and DSOP forms.

We propose optimal and heuristic algorithms for minimal SOP +  and DSOP +  synthesis. We then show how the DSOP +  form can be exploited for Data Mining applications. In particular we propose a new compact representation for the database of transactions to be used by the LCM algorithms for mining frequent closed itemsets.

A new technique for extracting associations between items present (or absent) in transactions (sales transactions).

Of interest to people with the funds to pay for data mining and topic maps.

Topic maps are useful to bind the mining of such associations to other information systems, such as supply chains.

Questions:

  1. How would you use data mining of transaction associations to guide collection development? (3-5 pages, with citations)
  2. How would you use topic maps with the mining of transaction associations? (3-5 pages, no citations)
  3. How would you bind an absence of data to other information? (3-5 pages, no citations)

Observation: Intelligence agencies recognize the absence of data as an association. Binding that absence to other date is a job for topic maps.

October 31, 2010

OpenII

Filed under: Data Structures,Heterogeneous Data,Information Retrieval,Software — Patrick Durusau @ 7:20 pm

OpenII

From the website:

OpenII is a collaborative effort spearheaded by The MITRE Corporation and Google to create a suite of open-source tools for information integration. The project is leveraging the latest developments in research on information integration to create a platform on which integration applications can be built and further research can be conducted.

The motivation for OpenII is that although a significant amount of research has been conducted on information integration, and several commercial systems have been deployed, many information integration applications are still hard to build. In research, we often innovate on a specific aspect of information integration, but then spend much our time building (and rebuilding) other components that we need in order to validate our contributions. As a result, the research prototypes that have been built are generally not reusable and do not inter-operate with each other. On the applications side, information integration comes in many flavors, and therefore it is hard for commercial products to serve all the needs. Our goal is to create tools that can be applied in a variety of architectural contexts and can easily be tailored to the needs of particular domains.

OpenII tools include, among others, wrappers for common data sources, tools for creating matches and mappings between disparate schemas, a tool for searching collections of schemas and extending schemas, and run-time tools for processing queries over heterogeneous data sources.

The M3 metamodel:

The fundamental building block in M3 is the entity. An entity represents information about a set of related real-world objects. Associated with each entity is a set of attributes that indicate what information is captured about each entity. For simplicity, we assume that at most one value can be associated with each attribute of an entity.

The project could benefit from a strong injection of subject identity based thinking and design.

October 6, 2010

Fast and Compact Web Graph Representations

Filed under: Data Structures,Graphs,Navigation,Searching,Software — Patrick Durusau @ 7:10 am

Fast and Compact Web Graph Representations Authors: Francisco Claude, Gonzalo Navarro Keywords: Compression, Web Graph, Data Structures

Abstract:

Compressed graph representations, in particular for Web graphs, have become an attractive research topic because of their applications in the manipulation of huge graphs in main memory. The state of the art is well represented by the WebGraph project, where advantage is taken of several particular properties of Web graphs to offer a trade-off between space and access time. In this paper we show that the same properties can be exploited with a different and elegant technique that builds on grammar-based compression. In particular, we focus on Re-Pair and on Ziv-Lempel compression, which, although cannot reach the best compression ratios of WebGraph, achieve much faster navigation of the graph when both are tuned to use the same space. Moreover, the technique adapts well to run on secondary memory and in distributed scenarios. As a byproduct, we introduce an approximate Re-Pair version that works efficiently with severely limited main memory.

Software & Examples: Fast and Compact Web Graph Representations

As topic maps grow larger and/or memory space becomes smaller (comparatively speaking), compressed graph work becomes increasingly relevant.

Gains in navigation speed are always welcome.

« Newer Posts

Powered by WordPress