Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 22, 2011

Best Practices with Hadoop – Real World Data

Filed under: Hadoop — Patrick Durusau @ 6:38 pm

Best Practices with Hadoop – Real World Data

Thursday, June 23, 2011 11:00 AM – 12:00 PM PDT

From the webpage:

Ventana Research recently completed data collection for the first large-scale, effective research on Hadoop and Information Management. Be among the first to get a glimpse into the results by attending this webinar, sponsored by Cloudera and Karmasphere.

David Menninger of Ventana Research will share some of the preliminary findings from the survey. Register now to learn how to improve your information management efforts with specific knowledge of the best practices for managing large-scale data with Hadoop.

About the presenter:

David Menninger is responsible for the research direction of information technologies at Ventana Research covering major areas including Analytics, Business Intelligence and Information Management. David brings to Ventana Research over two decades of experience, through which he has marketed and brought to market some of the leading edge technologies for helping organizations analyze data to support a range of action-taking and decision-making processes.

I suspect mostly a promo for the “details” but still could be worth attending.

DocumentLens

Filed under: Bioinformatics,Biomedical,DocumentLens,Navigation — Patrick Durusau @ 6:37 pm

DocumentLens – A Revolution In How Researchers Access Information & Colleagues

From the post:

Keeping up with the flood of scientific information has been challenging…Spotting patterns and extracting useful information has been even harder. DocumentLens™ has just made it easier to gain insightful knowledge from information and to share ideas with collaborators.

Praxeon, Inc., the award-winning Boston-based leader in delivering knowledge solutions for the Healthcare and Life Science communities, today announced the launch of DocumentLens™. Their cloud-based web application helps scientific researchers deal with the ever increasing deluge of online and electronic data and information from peer-reviewed journals, regulatory sites, patents and proprietary sources. DocumentLens provides an easy-to-utilize environment to enrich discovery, enhance idea generation, shorten the investigation time, improve productivity and engage collaboration.

“One of the most challenging problems researchers face is collecting, integrating and understanding new information. Keeping up with peer-reviewed journals, regulatory sites, patents and proprietary sources, even in a single area of research, is time consuming. But failure to keep up with information from many different sources results in knowledge gaps and lost opportunities,” stated Dr. Dennis Underwood, Praxeon CEO.

“DocumentLens is a web-based tool that enables you to ask the research question you want to ask – just as you would ask a colleague,” Underwood went on to say. “You can also dive deeper into research articles, explore the content and ideas using DocumentLens and integrate them with sources that you trust and rely on. DocumentLens takes you not only to the relevant documents, but to the most relevant sections saving an immense amount of time and effort. Our DocumentLens Navigators open up your content, using images and figures, chemistry and important topics. Storylines provide a place to accumulate and share insights with colleagues.”

Praxeon has created www.documentlens.com, a website devoted to the new application that contains background on the use of the software, the Eye of the Lens blog (http://www.documentlens.com/blog), and a live version of DocumentLens™ for visitors to try out free-of-charge to see for themselves firsthand the value of the application.

OK, so I do one of the sandbox pre-composed queries: “What is the incidence and prevalence of dementia?”

and DocumentLens reports back that page 15 of a document has relevant information (note, not the entire document but a particular page), highlighted material included:

conducting a collaborative, multicentre trial in FTLD. Such a collaborative effort will certainly be necessary to recruit the cohort of over 200 FTLD patients per trial that may be needed to demonstrate treatment effects in FTLD.[194]

3. Ratnavalli E, Brayne C, Dawson K, et al. The prevalence of frontotemporal dementia. Neurology 2002;58:1615–21. [PubMed: 12058088]

4. Mercy L, Hodges JR, Dawson K, et al. Incidence of early-onset dementias in Cambridgeshire,

8. Gislason TB, Sjogren M, Larsson L, et al. The prevalence of frontal variant frontotemporal dementia and the frontal lobe syndrome in a population based sample of 85 year olds. J Neurol Neurosurg

The first text block has no obvious (or other) relevance to the question of incidence or prevalence of dementia.

The incomplete marking of citations 4 and 8 occurs for no apparent reason.

Like any indexing resource, its value depends on the skill of the indexers.

There are the usual issues, how do I reliably share information with other DocumentLens or even non-DocumentLens users? Can I and other users create interoperable files in parallel? Do we need or required to have a common vocabulary? How do we integrate materials that use other vocabularies?

(Do send a note to the topic map naysayers. Product first, then start selling it to customers.)

Biodiversity Indexing: Migration from MySQL to Hadoop

Filed under: Hadoop,Hibernate,MySQL — Patrick Durusau @ 6:36 pm

Biodiversity Indexing: Migration from MySQL to Hadoop

From the post:

The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the Hadoop stack.

The Data Portal was launched in 2007. It consists of crawling components and a web application, implemented in a typical Java solution consisting of Spring, Hibernate and SpringMVC, operating against a MySQL database. In the early days the MySQL database had a very normalized structure, but as content and throughput grew, we adopted the typical pattern of denormalisation and scaling up with more powerful hardware. By the time we reached 100 million records, the occurrence content was modeled as a single fixed-width table. Allowing for complex searches containing combinations of species identifications, higher-level groupings, locality, bounding box and temporal filters required carefully selected indexes on the table. As content grew it became clear that real time indexing was no longer an option, and the Portal became a snapshot index, refreshed on a monthly basis, using complex batch procedures against the MySQL database. During this growth pattern we found we were moving more and more operations off the database to avoid locking, and instead partitioned data into delimited files, iterating over those and even performing joins using text files by synthesizing keys, sorting and managing multiple file cursors. Clearly we needed a better solution, so we began researching Hadoop. Today we are preparing to put our first Hadoop process into production.

Awesome project!

Where would you suggest the use of topic maps and subject identity to improve the project?

June 21, 2011

Investigating thousands (or millions) of documents by visualizing clusters

Filed under: Clustering,Visualization — Patrick Durusau @ 7:11 pm

Investigating thousands (or millions) of documents by visualizing clusters

From the post:

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference last week, where I discuss some of our recent work at the AP with the Iraq and Afghanistan war logs.

Very good presentation which includes:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.
  • Knight News Challenge application for “Overview,” the open-source system we’d like to build for doing this and other kinds of visual explorations of large document sets. If you like our work, why not leave a comment on our proposal?

The war logs are not “yesterday’s news.” People of all nationalities are still dying. And those responsible go unpunished.

Public newspapers and military publications will list prior postings of military personnel. Indexing those against the locations in the war logs using a topic map could produce a first round of people to ask about events and decisions reported in the war logs. (Of course, our guardians have cleansed all the versions I know of and pronounced them safe for our reading. Unaltered versions would work better. I can only imagine what would have happened to Shakespeare as a war document.)

Andrew Ng – Machine Learning Materials

Filed under: Machine Learning — Patrick Durusau @ 7:10 pm

Andrew Ng – Machine Learning Materials

Course materials for the lectures Machine Learning – Andrew Ng – YouTube.

Thanks to @philipmlong for the pointer!

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 7:10 pm

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0 by Simon Willnauer, Apache Lucene PMC.

Abstract:

Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the “next generation” features. DocValues enable Lucene to efficiently store and retrieve type-safe Document & Value pairs in a column stride fashion either entirely memory resident random access or disk resident iterator based without the need to un-invert fields. It’s final goal is to provide a independently update-able per document storage for scoring, sorting or even filtering. This talk will introduce the current state of development, implementation details, its features and how DocValues have been integrated into Lucene’s Codec API for full extendability.

Excellent video!

Category Theory (eBooks)

Filed under: Category Theory — Patrick Durusau @ 7:09 pm

Category Theory (eBooks)

A nice collection of pointers to eBooks on category theory for the most part.

Brisk 1.0 Beta 2 Released

Filed under: Brisk,NoSQL — Patrick Durusau @ 7:08 pm

Brisk 1.0 Beta 2 Released

New Features:

BRISK-12

Apache Pig Integration. See the DataStax Documentation for more information about using Pig in Brisk.

BRISK-89

Job Tracker Failover. See the DataStax Documentation for more information about using the new brisktool movejt command.

BRISK-207

New Snappy Compression Codec built on Google Snappy is now used internally for automatic CassandraFS block compression.

BRISK-180

Automap Cassandra Column Families to Hive Tables in the Brisk Hive Metastore.

BRISK-152

Add a second HDFS layer in CassandraFS for long-term data storage. This is needed because the blocks column family in CFS requires frequent compactions – Hadoop uses it during MapReduce processing to store small files and temporary data. Compaction cleans this temporary data up after it is not needed anymore. Now there is the cfs:/// and cfs-archive:/// endpoints within CFS. The blocks column family in cfs-archive:/// has compaction disabled to improve performance for static data stored in CFS.

June 20, 2011

Massively Parallel Database Startup to Reap Multicore Dividends

Filed under: GPU,NoSQL — Patrick Durusau @ 3:36 pm

Massively Parallel Database Startup to Reap Multicore Dividends

From the post:

The age of multicore couldn’t have come at a better time. Plagued with mounting datasets and a need for suitable architectures to contend with them, organizations are feeling the resource burden in every database corner they look.

German startup Parstream claims its found an answer to those problems. The company has come up with a solution to harness the power of GPU computing at the dawn of the manycore plus big data day–and it just might be onto something.

The unique element in ParStream’s offering is that it is able to exploit the coming era of multicore architectures, meaning that it will be able to deliver results faster with lower resource usage. This is in addition to its claim that it can eliminate the need for data decompression entirely, which if it is proven to be the case when their system is available later this summer, could change the way we think about system utilization when performing analytics on large data sets.

ParStream is appearing as an exhibitor at: ISC’11 Supercomputing Conference 2011, June 20 – 22, 2011. I can’t make the conference but am interested in your reactions to the promised demos.

Erlang and First-Person Shooters

Filed under: Erlang — Patrick Durusau @ 3:35 pm

Erlang and First-Person Shooters

Use of Erlang by DemonWare to run an online gaming site with millions of concurrent users.

Mostly an overview.

Makes me wonder:

  1. How the information needs in a first-person shooter game are different, if at all, from the information needs of a topic map user?
  2. Can a war game simulation be used to demonstrate the utility of topic maps in a military situation?
  3. Has semantic impedance been modeled in war game simulations? (Or gathered based on actual experience?)

Hibernate OGM: birth announcement

Filed under: Hibernate,JBoss — Patrick Durusau @ 3:34 pm

Hibernate OGM: birth announcement

From the post:

This is a pretty exciting moment, the first public alpha release of a brand new project: Hibernate OGM. Hibernate OGM stands for Object Grid Mapping and its goal is to offer a full-fledged JPA engine storing data into NoSQL stores. This is a rather long blog entry so I’ve split it into distinct sections from goals to technical detail to future.

Note that I say it’s the first public alpha because the JBoss World Keynote 2011 was powered by Hibernate OGM Alpha 1. Yes it was 100% live and nothing was faked, No it did not crash 🙂 Sanne explained in more detail how we used Hibernate OGM in the demo. This blog entry is about Hibernate OGM itself and how it works.

Congratulations!

Designing and Refining Schema Mappings via Data Examples

Filed under: Database,Mapping,Schema — Patrick Durusau @ 3:34 pm

Designing and Refining Schema Mappings via Data Examples by Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang-Chiew Tan, from SIGMOD ’11.

Abstract:

A schema mapping is a specification of the relationship between a source schema and a target schema. Schema mappings are fundamental building blocks in data integration and data exchange and, as such, obtaining the right schema mapping constitutes a major step towards the integration or exchange of data. Up to now, schema mappings have typically been specified manually or have been derived using mapping-design systems that automatically generate a schema mapping from a visual specification of the relationship between two schemas. We present a novel paradigm and develop a system for the interactive design of schema mappings via data examples. Each data example represents a partial specification of the semantics of the desired schema mapping. At the core of our system lies a sound and complete algorithm that, given a finite set of data examples, decides whether or not there exists a GLAV schema mapping (i.e., a schema mapping specified by Global-and-Local-As-View constraints) that “fits” these data examples. If such a fitting GLAV schema mapping exists, then our system constructs the “most general” one. We give a rigorous computational complexity analysis of the underlying decision problem concerning the existence of a fitting GLAV schema mapping, given a set of data examples. Specifically, we prove that this problem is complete for the second level of the polynomial hierarchy, hence, in a precise sense, harder than NP-complete. This worst-case complexity analysis notwithstanding, we conduct an experimental evaluation of our prototype implementation that demonstrates the feasibility of interactively designing schema mappings using data examples. In particular, our experiments show that our system achieves very good performance in real-life scenarios.

Two observations:

1) The use of data examples may help overcome the difficulty of getting users to articulate “why” a particular mapping should occur.

2) Data examples that support mappings, if preserved, could be used to illustrate for subsequent users “why” particular mappings were made or even should be followed in mappings to additional schemas.

Mapping across revisions of a particular schema or across multiple schemas at a particular time is likely to benefit from this technique.

MAD Skills: New Analysis Practices for Big Data

Filed under: Analytics,BigData,Data Integration,SQL — Patrick Durusau @ 3:33 pm

MAD Skills: New Analysis Practices for Big Data by Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton.

Abstract:

As massive data acquisition and storage becomes increasingly aff ordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present data-parallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

I found this passage very telling:

These desires for speed and breadth of data raise tensions with Data Warehousing orthodoxy. Inmon describes the traditional view:

There is no point in bringing data … into the data warehouse environment without integrating it. If the data arrives at the data warehouse in an unintegrated state, it cannot be used to support a corporate view of data. And a corporate view of data is one of the essences of the architected environment [13]

Unfortunately, the challenge of perfectly integrating a new data source into an “architected” warehouse is often substantial, and can hold up access to data for months – or in many cases, forever. The architectural view introduces friction into analytics, repels data sources from the warehouse, and as a result produces shallow incomplete warehouses. It is the opposite of the MAD ideal.

Marketing question for topic maps: Do you want a shallow, incomplete data warehouse?

Admittedly there is more to it, topic maps enable the integration of both data structures as well as the data itself. Both are subjects in the view of topic maps. Not to mention capturing the reasons why certain structures or data were mapped to other structures or data. I think the name for that is an audit trail.

Perhaps we should ask: Does your data integration methodology offer an audit trail?

(See MADLib for the source code growing out of this effort.)

Neo4j for PHP

Filed under: Neo4j,PHP — Patrick Durusau @ 3:32 pm

Neo4j for PHP

From the post:

Lately, I’ve been playing around with the graph database Neo4j and its application to certain classes of problems. Graph databases are meant to solve problems in domains where data relationships can be multiple levels deep. For example, in a relational database, it’s very easy to answer the question “Give me a list of all actors who have been in a movie with Kevin Bacon”:

Using Neo4j with a scripting language, PHP, that is popular with librarians.

Vol. 15: Understanding Dynamo — with Andy Gross

Filed under: NoSQL,Riak — Patrick Durusau @ 3:31 pm

Vol. 15: Understanding Dynamo — with Andy Gross

From the webpage:

Basho’s VP of Engineering runs us through the tenets of Dynamo systems. From Consistent Hashing to Vector Clocks, Gossip, Hinted Handoffs and Read Repairs. (Recorded on October 13, 2010 in San Francisco, CA.)

You may want to compare the presentation of Andy Gross at Riak Core: Dynamo Building Blocks. Basically the same material but worded differently.

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Filed under: Algorithms,Entity Extraction,String Matching — Patrick Durusau @ 3:31 pm

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction by Guoliang Li, Dong Deng, and Jianhua Feng in SIGMOD ’11.

Abstract:

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

Entity extraction should be high on your list of topic map skills. A good set of user defined dictionaries is a good start. Creating those dictionaries as topic maps is an even better start.

On string metrics, you might want to visit Sam’s String Metrics, which lists some thirty algorithms.

June 19, 2011

Open Government Data 2011 wrap-up

Filed under: Conferences,Dataset,Government Data,Public Data — Patrick Durusau @ 7:35 pm

Open Government Data 2011 wrap-up by Lutz Maicher.

From the post:

On June 16, 2011 the OGD 2011 – the first Open Data Conference in Austria – took place. Thanks to a lot of preliminary work of the Semantic Web Company the topic open (government) data is very hot in Austria, especially in Vienna and Linz. Hence 120 attendees (see the list here) for the first conference is a real success. Congrats to the organizers. And congrats to the community which made the conference to a very vital and interesting event.

If there is a Second Open Data Conference, it is a venue where topic maps should put in an appearance.

PublicData.EU Launched During DAA

Filed under: Dataset,Government Data,Public Data — Patrick Durusau @ 7:33 pm

PublicData.EU Launched During DAA

From the post:

During the Digital Agenda Assembly this week in Brussels the new portal PublicData.EU was launched in beta. This is a step aimed to make public data easier to find across the EU. As it says on the ‘about’ page:

“In order to unlock the potential of digital public sector information, developers and other prospective users must be able to find datasets they are interested in reusing. PublicData.eu will provide a single point of access to open, freely reusable datasets from numerous national, regional and local public bodies throughout Europe.

Information about European public datasets is currently scattered across many different data catalogues, portals and websites in many different languages, implemented using many different technologies. The kinds of information stored about public datasets may vary from country to country, and from registry to registry. PublicData.eu will harvest and federate this information to enable users to search, query, process, cache and perform other automated tasks on the data from a single place. This helps to solve the “discoverability problem” of finding interesting data across many different government websites, at many different levels of government, and across the many governments in Europe.

In addition to providing access to official information about datasets from public bodies, PublicData.eu will capture (proposed) edits, annotations, comments and uploads from the broader community of public data users. In this way, PublicData.eu will harness the social aspect of working with data to create opportunities for mass collaboration. For example, a web developer might download a dataset, convert it into a new format, upload it and add a link to the new version of the dataset for others to use. From fixing broken URLs or typos in descriptions to substantive comments or supplementary documentation about using the datasets, PublicData.eu will provide up to date information for data users, by data users.”

PublicData.EU is built by the Open Knowledge Foundation as part of the LOD2 project. “PublicData.eu is powered by CKAN, a data catalogue system used by various institutions and communities to manage open data. CKAN and all its components are open source software and used by a wide community of catalogue operators from across Europe, including the UK Government’s data.gov.uk portal.”

Here’s a European marketing opportunity for topic maps. How would a topic map solution be different from what is offered here? (There are similar opportunities in the US as well.)

A Short Tutorial on Doctor Who (and Neo4j)

Filed under: Neo4j,NoSQL — Patrick Durusau @ 7:31 pm

A Short Tutorial on Doctor Who (and Neo4j)

Where: The Skills Matter eXchange, London
When: 29 Jun 2011 Starts at 18:30

From the website:

With June’s Neo4j meeting we’re moving to our new slot of last Wednesday of the month (29 June). But more importantly, we’re going to be getting our hands on the code. We’ve packed a wealth of Doctor Who knowledge into a graph, ready for you to start querying. At the end of 90 minutes and a couple of Koans, you’ll be answering questions about the Doctor Who universe like a die-hard fan. You’ll need a laptop, your Java IDE of choice, and a copy of the Koans, which you can grab from http://bit.ly/neo4j-koan

Not a bad way to spend a late June evening in London.

Someone needs to post this to a Dr. Who fan site. Might attract some folks to Java/Neo4j!

June 18, 2011

Do you like it rough?

Filed under: Query Language — Patrick Durusau @ 5:46 pm

Infobright Rough Query: Aproximating Query Results
by Alex Popescu.

Interesting post about Infobright’s “rough queries” that return a range of data, which can be mined with more specific queries.

Makes me wonder about the potential for “rough” merging that creates sets of similar subjects, which are themselves subject (sorry) to further refinement. Depends on the amount of resources you want to spend on the merging process.

One level could be that you get the equivalent of a current typical search engine result. Most of it maybe relevant to something, maybe even your query, but who wants to slog through > 10,000 “hits.”

The next level could be far greater refinement that gets you down to relevant “hits” in the 1,000 range. With a following level of 100 “hits.”

The last level could be an editorial piece with transcluded information from a variety of sources and links to more information. Definitely an editorial product.

Price goes up as the amount of noise goes down.

The Plasma graph query engine

Filed under: Graphs,Plasma — Patrick Durusau @ 5:45 pm

The Plasma graph query engine

From the blog post:

The Neo4J team recently blogged about a graph query language called Cypher, and reading their post got me motivated to write about something I’ve been working on for a while. Plasma is a graph query engine written in Clojure. Currently it sits on top of the Jiraph graph database that was written by Justin Balthrop and Lance Bradley for a geneology website, Geni. (It would be less than a days work to get it running on top of another graph database though.) The query engine is built using a library of query operators that are combined to form dataflow graphs, and it uses Zach Tellman’s asynchronous events library, Lamina, to provide concurrent, non-blocking execution of queries.

Jiraph

Filed under: Clojure,Graphs — Patrick Durusau @ 5:44 pm

Jiraph

From README:

Jiraph is an embedded graph database for Clojure. It is extremely fast and can walk 100,000 edges in about 3 seconds on my laptop. It uses Tokyo Cabinet for backend storage.

Multi-layer Graph

For performance and scalability, graphs in Jiraph are multi-layer graphs. Nodes exist on every layer. In this way, node data can be partitioned across all layers (similar to column families in some nosql databases). For our purposes, we’ll call the node data on a particular layer a node slice. Edges, on the other hand, can only exist on a single layer. All edges on a specific layer generally correspond in some way. The layer name can be thought of as the edge type, or alternatively, multiple similar edge types can exist on one layer.

Though layers can be used to organize your data, the primary motivation for layers is performance. All data for each layer is stored in a separate data store, which partitions the graph and speeds up walks by allowing them to load only the subset of the graph data they need. This means you should strive to put all graph data needed for a particular walk in one layer. This isn’t always possible, but it will improve speed because only one disk read will be required per walk step.

A Jiraph graph is just a clojure map of layer names (keywords) to datatypes that implement the jiraph.layer/Layer protocol.

Nodes and Edges

Every node slice is just a clojure map of attributes. It is conventional to use keywords for the keys, but the values can be arbitrary clojure data structures. Each edge is also a map of attributes. Internally, outgoing edges are stored as a map from node-ids to attributes in the :edges attribute on the corresponding node slice. This way, a node and all its outgoing edges can be loaded with one disk read.
…(more follows)

I like the rewind feature, that could be very helpful.

Neo4j Top Ten

Filed under: Neo4j — Patrick Durusau @ 5:39 pm

The top ten ways to get to know Neo4j

A post from back in February of 2011, on the release of Neo4j 1.0, outlining ten ways to get started with Neo4j.

Which way do you like the best?

VoltDB

Filed under: SQL,VoltDB — Patrick Durusau @ 5:39 pm

VoltDB

From the website:

VoltDB is a blazingly fast relational database system. It is specifically designed for modern software applications that are pushed beyond their limits by high velocity data sources. This new generation of systems – real-time feeds, machine-generated data, micro-transactions, high performance content serving – requires database throughput that can reach millions of operations per second. What’s more, the applications that use this data must be able to scale on demand, provide flawless fault tolerance and give real-time visibility into the data that drives business value.

Note that the “community” version is only for development, testing, tuning. If you want to go to deployment, commercial licensing kicks in.

It’s encouraging to see all the innovation and development in SQL, NoSQL (mis-named but has stuck), graph databases and the like. Only practical experience will decide which ones survive but in any event, data will be more accessible than ever before. Data analysis and not data access skills will come to the fore.

June 17, 2011

Moma, What do URLs in RDF Mean?

Filed under: RDF,Semantic Web — Patrick Durusau @ 7:23 pm

Lars Marius Garshol says in a tweet:

The old “how to find what URIs represent information resources in RDF” issue returns, now with real consequences

pointing to: How to find what URLs in an RDF graph refer to information resources?.

You may also be interested in Jenni Tennison’s summary of a recent TAG meeting on the subject:

URI Definition Discovery and Metadata Architecture

The afternoon session on Tuesday was spent on Jonathan Rees’s work on the Architecture of the World Wide Semantic Web, which covers, amongst other things, what people in semantic web circles call httpRange-14. At core, this is about the kinds of URIs we can use to refer to real-world things, what the response to HTTP requests on those URIs should be, and how we find out information about these resources.

Jonathan has put together a document called Providing and discovering definitions of URIs which covers the various ways that have been suggested over time, including the 303 method that was recommended by the TAG in 2005 and methods that have been suggested by various people since that time.

It’s clear that the 303 method has lots of practical shortcomings for people deploying linked data, and isn’t the way in which URIs are commonly used by Facebook and schema.org, who don’t currently care about using separate URIs for documents and the things those documents are about. We discussed these alongside concerns that we continue to support people who want to do things like describe the license or provenance of a document (as well as the facts that it contains) and don’t introduce anything that is incompatible with the ways in which people who have been following recommended practice are publishing their linked data. The general mood was that we need to support some kind of ‘punning’, whereby a single URI could be used to refer to both a document and a real-world thing, with different properties being assigned to different ‘views’ of that resource.

Jonathan is going to continue to work on the draft, incorporating some other possible approaches. It’s a very contentious topic within the linked data community. My opinion is while we need to provide some ‘good practice’ guides for linked data publishers, we can’t just stick to a theoretical ideal that experience has shown not to be practical. What I’d hope is that the TAG can help to pull together the various arguments for and against different options, and document whatever approach the wider community supports.

My suggested “best practice” is to not trust linked data, RDF, or topic maps data unless it is tested (passes) and you trust its point of origin.

Anymore than you would print your credit card number and pin on the side of your car. Blind trust in any data source is a bad idea.

Natural Language Processing with Hadoop and Python

Filed under: Hadoop,Natural Language Processing,Python — Patrick Durusau @ 7:19 pm

Natural Language Processing with Hadoop and Python

From the post:

If you listen to analysts talk about complex data, they all agree, it’s growing, and faster than anything else before. Complex data can mean a lot of things, but to our research group, ever increasing volumes of naturally occurring human text and speech—from blogs to YouTube videos—enable new and novel questions for Natural Language Processing (NLP). The dominating characteristic of these new questions involves making sense of lots of data in different forms, and extracting useful insights.

Now that I think about it, a lot of the input from various intelligence operations consists of “naturally occurring human text and speech….” Anyone can crunch lots of text/speech, the question is being a good enough analyst to extract something useful.

Clojure Tutorial For the Non-Lisp Programmer

Filed under: Clojure — Patrick Durusau @ 7:15 pm

Clojure Tutorial For the Non-Lisp Programmer

From the post:

Clojure is a new programming language that uses the Java Virtual Runtime as its platform. Clojure is a dialect of Lisp. The language home page is at http://clojure.org/.

It is all written that directly.

June 16, 2011

How do I start with Machine Learning?

Filed under: Machine Learning — Patrick Durusau @ 3:43 pm

How do I start with Machine Learning?

This question on Hacker News drew a number of useful responses.

Now to just get the equivalent number of high quality responses to: “How do I start with Topic Maps?”

Graph Pattern Matching with Gremlin 1.1

Filed under: Gremlin,Neo4j — Patrick Durusau @ 3:42 pm

Graph Pattern Matching with Gremlin 1.1 by Marko Rodriguez.

From the post:

Gremlin 1.1 was released on June 15, 2011. A major aspect of this release includes traversal-based graph pattern matching. This post provides a review of this functionality.

As usual, a must read if you are interested in Neo4j or Gremlin.

Evaluating Text Extraction Algorithms

Filed under: Data Mining,Text Extraction — Patrick Durusau @ 3:42 pm

Evaluating Text Extraction Algorithms

From the post:

Lately I’ve been working on evaluating and comparing algorithms, capable of extracting useful content from arbitrary html documents. Before continuing I encourage you to pass trough some of my previous posts, just to get a better feel of what we’re dealing with; I’ve written a short overview, compiled a list of resources if you want to dig deeper and made a feature wise comparison of related software and APIs.

If you’re not simply creating topic map content, you are mining content from other sources, such as texts, to point to or include in a topic map. A good set of posts on tools and issues surrounding that task.

« Newer PostsOlder Posts »

Powered by WordPress