Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 24, 2011

RTextTools:…v.1.3 New Release

Filed under: Machine Learning,R — Patrick Durusau @ 6:55 pm

RTextTools: a machine learning library for text classification

From the post:

RTextTools v1.3 was released on August 21, and the package binaries are now available on CRAN. This update fixes a major bug with the stemmers, and it is highly recommended you upgrade to the latest version. Other changes include optimization of existing functions and improvements to the documentation.

Additionally, Duncan Temple Lang has graciously released Rstem on CRAN, meaning that the RTextTools package is now fully installable using the install.packages(“RTextTools”) command within R 2.13+. The repository at install.rtexttools.com will continue to work through the end of September.

From about the project:

RTextTools is a free, open source machine learning package for automatic text classification that makes it simple for both novice and advanced users to get started with supervised learning. The package includes nine algorithms for ensemble classification (SVM, SLDA, BOOSTING, BAGGING, RF, GLMNET, TREE, NNET, MAXENT), comprehensive analytics, and thorough documentation.

August 23, 2011

Lucene 4.0: The Revolution

Filed under: Indexing,Lucene — Patrick Durusau @ 6:46 pm

Lucene 4.0: The Revolution by Simon Willnauer.

From the post:

The near-limitless innovative potential of a thriving open source community often has to be tempered by the need for a steady roadmap with version compatibility. As a result, once the decision to break backward compatibility in Lucene 4.0 had been made, it opened the floodgates on a host of step changes, which, together, will deliver a product whose performance is unrecognisable from previous 3.x releases.

One of the most significant changes in Lucene 4.0 is the full switch to using bytes (UTF8) in place of text strings for indexing within the search engine library. This change has improved the efficiency of a number of core processes: the ‘term dictionary’, used as a core part of the index, can now be loaded up to 30 times faster; it uses 10% of the memory; and search speeds are increased by removing the need for string conversion.

This switch to using bytes for indexing has also facilitated one of the main goals for Lucene 4.0, which is ‘flexible indexing’. The data structure for the index format can now be chosen and loaded into Lucene as a pluggable codec. As such, optimised codecs can be loaded to suit the indexing of individual datasets or even individual fields.

The performance enhancements through flexible indexing are highly case specific. However, flexible indexing introduces an entirely new dimension to the Lucene project. New indexing codecs can be developed and existing ones updated without the need for hard-coding within Lucene. There is no longer any need for project-level compromise on the best general-purpose index formats and data structures. A new field of specialised codec development can take place independently from development of the Lucene kernel.

Looks like the time to be learning new features of Lucene 4.0 is now!

Flexible indexing! That sounds very cool.

How Far You Can Get Using Machine Learning Black-Boxes [?]

Filed under: Machine Learning — Patrick Durusau @ 6:40 pm

How Far You Can Get Using Machine Learning Black-Boxes [?]

Abstract:

Supervised Learning (SL) is a machine learning research area which aims at developing techniques able to take advantage from labeled training samples to make decisions over unseen examples. Recently, a lot of tools have been presented in order to perform machine learning in a more straightforward and transparent manner. However, one problem that is increasingly present in most of the SL problems being solved is that, sometimes, researchers do not completely understand what supervised learning is and, more often than not, publish results using machine learning black-boxes. In this paper, we shed light over the use of machine learning black-boxes and show researchers how far they can get using these out-of-the-box solutions instead of going deeper into the machinery of the classifiers. Here, we focus on one aspect of classifiers namely the way they compare examples in the feature space and show how a simple knowledge about the classifier’s machinery can lift the results way beyond out-of-the-box machine learning solutions.

Not surprising that understanding how to use a tool leads to better results. A reminder, particularly one that illustrates how to better use a tool, is always welcome.

Using Emacs as a front-end for R

Filed under: R — Patrick Durusau @ 6:39 pm

Using Emacs as a front-end for R

From the post:

Back when I was a grad student, I was a devoted Emacs user. I basically used it like an operating system: it wasn’t just my text editor, but also my mail reader, my Web browser, my news reader, and so much more. (I once even asked our sysadmin to change my default shell to /usr/bin/emacs. He refused.) So when I started doing development in the S language, it was inevitable for me to think I could tweak some existing Emacs scripts and make it easier for me to edit S code. Sure enough, this turned into a major project (S-mode) with several collaborators, that culminated in being able to run the S interpreter within Emacs, and get (then-radical) features like command history and transcript management. When R came along, a new team adapted S-mode for R, and ESS — Emacs Speaks Statistics — was born.

Whether you are a grad student or no, if you use Emacs, you really need to take a look at this post.

Modeling Social and Information Networks: Opportunities for Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 6:39 pm

Modeling Social and Information Networks: Opportunities for Machine Learning

The description:

Emergence of the web, social media and online social networking websites gave rise to detailed traces of human social activity. This offers many opportunities to analyze and model behaviors of millions of people. For example, we can now study ”planetary scale” dynamics of a full Microsoft Instant Messenger network of 240 million people, with more than 255 billion exchanged messages per month. Many types of data, especially web and “social” data, come in a form of a network or a graph. This tutorial will cover several aspects of such network data: macroscopic properties of network data sets; statistical models for modeling large scale network structure of static and dynamic networks; properties and models of network structure and evolution at the level of groups of nodes and algorithms for extracting such structures. I will also present several applications and case studies of blogs, instant messaging, Wikipedia and web search. Machine learning as a topic will be present throughout the tutorial. The idea of the tutorial is to introduce the machine learning community to recent developments in the area of social and information networks that underpin the Web and other on-line media.

Very good tutorial on social and information networks. Almost 2.5 hours in length.

Slides.

Mulan: A Java Library for Multi-Label Learning

Filed under: Java,Machine Learning — Patrick Durusau @ 6:38 pm

Mulan: A Java Library for Multi-Label Learning

From the website:

Mulan is an open-source Java library for learning from multi-label datasets. Multi-label datasets consist of training examples of a target function that has multiple binary target variables. This means that each item of a multi-label dataset can be a member of multiple categories or annotated by many labels (classes). This is actually the nature of many real world problems such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. An introduction on mining multi-label data is provided in (Tsoumakas et al., 2010).

Currently, the library includes a variety of state-of-the-art algorithms for performing the following major multi-label learning tasks:

  • Classification. This task is concerned with outputting a bipartition of the labels into relevant and irrelevant ones for a given input instance.
  • Ranking. This task is concerned with outputting an ordering of the labels, according to their relevance for a given data item
  • Classification and ranking. A combination of the two tasks mentioned-above.

In addition, the library offers the following features:

  • Feature selection. Simple baseline methods are currently supported.
  • Evaluation. Classes that calculate a large variety of evaluation measures through hold-out evaluation and cross-validation.

Query Execution in Column-Oriented Database Systems

Filed under: Column-Oriented,Database,Query Language — Patrick Durusau @ 6:38 pm

Query Execution in Column-Oriented Database Systems by Daniel Abadi (Ph.D. thesis).

Apologies for the length of the quote but this is an early dissertation on column-oriented data systems and I want to entice you into reading it. Not so much for the techniques, which are now common but the analysis.

Abstract:

There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage interface: store the table row-by-row, or store the table column-by-column. Historically, database system implementations and research have focused on the row-by row data layout, since it performs best on the most common application for database systems: business transactional data processing. However, there are a set of emerging applications for database systems for which the row-by-row layout performs poorly. These applications are more analytical in nature, whose goal is to read through the data to gain new insight and use it to drive decision making and planning.

In this dissertation, we study the problem of poor performance of row-by-row data layout for these emerging applications, and evaluate the column-by-column data layout opportunity as a solution to this problem. There have been a variety of proposals in the literature for how to build a database system on top of column-by-column layout. These proposals have different levels of implementation effort, and have different performance characteristics. If one wanted to build a new database system that utilizes the column-by-column data layout, it is unclear which proposal to follow. This dissertation provides (to the best of our knowledge) the only detailed study of multiple implementation approaches of such systems, categorizing the different approaches into three broad categories, and evaluating the tradeoffs between approaches. We conclude that building a query executer specifically designed for the column-by-column query layout is essential to achieve good performance.

Consequently, we describe the implementation of C-Store, a new database system with a storage layer and query executer built for column-by-column data layout. We introduce three new query execution techniques that significantly improve performance. First, we look at the problem of integrating compression and execution so that the query executer is capable of directly operating on compressed data. This improves performance by improving I/O (less data needs to be read off disk), and CPU (the data need not be decompressed). We describe our solution to the problem of executer extensibility – how can new compression techniques be added to the system without having to rewrite the operator code? Second, we analyze the problem of tuple construction (stitching together attributes from multiple columns into a row-oriented ”tuple”). Tuple construction is required when operators need to access multiple attributes from the same tuple; however, if done at the wrong point in a query plan, a significant performance penalty is paid. We introduce an analytical model and some heuristics to use that help decide when in a query plan tuple construction should occur. Third, we introduce a new join technique, the “invisible join” that improves performance of a specific type of join that is common in the applications for which column-by-column data layout is a good idea.

Finally, we benchmark performance of the complete C-Store database system against other column-oriented database system implementation approaches, and against row-oriented databases. We benchmark two applications. The first application is a typical analytical application for which column-by-column data layout is known to outperform row-by-row data layout. The second application is another emerging application, the Semantic Web, for which column-oriented database systems are not currently used. We find that on the first application, the complete C-Store system performed 10 to 18 times faster than alternative column-store implementation approaches, and 6 to 12 times faster than a commercial database system that uses a row-by-row data layout. On the Semantic Web application, we find that C-Store outperforms other state-of-the-art data management techniques by an order of magnitude, and outperforms other common data management techniques by almost two orders of magnitude. Benchmark queries, which used to take multiple minutes to execute, can now be answered in several seconds.

Chemical Entity Semantic Specification:…(article)

Filed under: Cheminformatics,RDF,Semantic Web — Patrick Durusau @ 6:37 pm

Chemical Entity Semantic Specification: Knowledge representation for efficient semantic cheminformatics and facile data integration by Leonid L Chepelev and Michel Dumontier, Journal of Cheminformatics 2011, 3:20doi:10.1186/1758-2946-3-20.

Abstract

Background
Over the past several centuries, chemistry has permeated virtually every facet of human lifestyle, enriching fields as diverse as medicine, agriculture, manufacturing, warfare, and electronics, among numerous others. Unfortunately, application-specific, incompatible chemical information formats and representation strategies have emerged as a result of such diverse adoption of chemistry. Although a number of efforts have been dedicated to unifying the computational representation of chemical information, disparities between the various chemical databases still persist and stand in the way of cross-domain, interdisciplinary investigations. Through a common syntax and formal semantics, Semantic Web technology offers the ability to accurately represent, integrate, reason about and query across diverse chemical information.

Results
Here we specify and implement the Chemical Entity Semantic Specification (CHESS) for the representation of polyatomic chemical entities, their substructures, bonds, atoms, and reactions using Semantic Web technologies. CHESS provides means to capture aspects of their corresponding chemical descriptors, connectivity, functional composition, and geometric structure while specifying mechanisms for data provenance. We demonstrate that using our readily extensible specification, it is possible to efficiently integrate multiple disparate chemical data sources, while retaining appropriate correspondence of chemical descriptors, with very little additional effort. We demonstrate the impact of some of our representational decisions on the performance of chemically-aware knowledgebase searching and rudimentary reaction candidate selection. Finally, we provide access to the tools necessary to carry out chemical entity encoding in CHESS, along with a sample knowledgebase.

Conclusions
By harnessing the power of Semantic Web technologies with CHESS, it is possible to provide a means of facile cross-domain chemical knowledge integration with full preservation of data correspondence and provenance. Our representation builds on existing cheminformatics technologies and, by the virtue of RDF specification, remains flexible and amenable to application- and domain-specific annotations without compromising chemical data integration. We conclude that the adoption of a consistent and semantically-enabled chemical specification is imperative for surviving the coming chemical data deluge and supporting systems science research.

Project homepage: Chemical Entity Semantic Specification

Predictive Model Markup Language

Filed under: Predictive Model Markup Language (PMML) — Patrick Durusau @ 6:37 pm

Predictive Model Markup Language

From the wiki page:

The Predictive Model Markup Language (PMML) is an XML-based markup language developed by the Data Mining Group (DMG) to provide a way for applications to define models related to predictive analytics and data mining and to share those models between PMML-compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor’s application and use other vendors’ applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is straightforward.

Since PMML is an XML-based standard, the specification comes in the form of an XML schema.

Curious if anyone has experience with PMML with or without topic maps?

August 22, 2011

Bio-recipes (Bioinformatics recipes) in Darwin

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:43 pm

Bio-recipes (Bioinformatics recipes) in Darwin

If you are working on topic maps and bioinformatics, you are likely to find this a useful resource.

From the webpage:

Bio-recipes are a collection of Darwin example programs. They show how to solve standard problems in Bioinformatics. Each bio-recipe consists of an introduction, explanations, graphs, figures, and most importantly, Darwin commands (the input commands and the output that they produce) that solve the given problem.

Darwin is an interactive language of the same lineage as Maple designed to solve problems in Bioinformatics. It relies on a simple language for the interactive user, plus the infrastructure necessary for writing object oriented libraries, plus very efficient primitive operations. The primitive operations of Darwin are the most common and time consuming operations typical of bioinformatics, including linear algebra operations.

The reasons behind this particular format are the following.

  1. It is much easier to understand an algorithm or a procedure or even a theorem, when it is illustrated with a running example.
  2. The procedures, as written, may be run on different data and hence serve a useful purpose.
  3. It is an order of magnitude easier to modify a correct, existing program, than to write a new one from scratch. This is particularly true for non-computer scientists.
  4. The full examples show some features of the language and of the system that may not known to the casual user of the Darwin, hence they serve a tutorial purpose.

BTW, see also:

DARWIN – A Genetic Algorithm Programming Language

The Darwin Manual

Finite State Transducers in Lucene

Filed under: Indexing,Software — Patrick Durusau @ 7:42 pm

I found part 1 of this series a DZone but there was no reference to part 2. I found part 2 by tracing the article back to its original blog post and seeing it was followed by part 2.

Using Finite State Transducers in Lucene

Finite State Transducers, Part 2

I won’t try to summarize the posts, they are short and heavy on links to more material but would quote this comment from the second article:

To test this, I indexed the first 10 million 1KB documents derived from Wikipedia’s English database download. The resulting RAM required for the FST was ~38% – 52% smaller (larger segments see more gains, as the FST “scales up” well). Not only is the RAM required much lower, but term lookups are also faster: the FuzzyQuery united~2 was ~22% faster.

If using less RAM and faster lookups are of interest to you, these posts should be on your reading list.

Public Dataset Catalogs Faceted Browser

Filed under: Dataset,Facets,Linked Data,RDF — Patrick Durusau @ 7:42 pm

Public Dataset Catalogs Faceted Browser

A faceted browser for the catalogs, not their content.

Filter on coverage, location, country (not sure how location and country usefully differ), catalog status (seems to mix status and data type), and managed by.

Do be aware that as the little green balloons disappear with your selection that more of the coloring of the map itself appears.

I mention that because at first it seemed the map was being colored based on the facets I choose. Such as Europe is suddenly dark green when I chose the United States in the filter. Confusing at first and makes me wonder, why use a map with underlying coloration anyway? A white map with borders would be a better display background for the green balloons indicating catalog locations.

BTW, if you visit a catalog and then use the back button, all your filters are reset. Not a problem now with a small set of filters and only 100 catalogs but should this resource continue to grow, that could become a usability issue.

Graph Theory – Tutorial

Filed under: Graphs,Social Graphs — Patrick Durusau @ 7:41 pm

A series of three graph theory tutorials by Jesse Farmer.

Graph Theory: Part 1 (Introduction)

Graph Theory: Part II (Linear Algebra)

Graph Theory: Part III (Facebook)

Brings you up to speed on common measures in social graphs.

Scaling Up Machine Learning, the Tutorial

Filed under: BigData,Machine Learning — Patrick Durusau @ 7:41 pm

Scaling Up Machine Learning, the Tutorial, KDD 2011 by Ron Bekkerman, Misha Bilenko and John Langford.

From the webpage:

This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. Demand for scaling up machine learning is task-specific: for some tasks it is driven by the enormous dataset sizes, for others by model complexity or by the requirement for real-time prediction. Selecting a task-appropriate parallelization platform and algorithm requires understanding their benefits, trade-offs and constraints. This tutorial focuses on providing an integrated overview of state-of-the-art platforms and algorithm choices. These span a range of hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters), programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ), and learning settings (e.g., semi-supervised and online learning). The tutorial is example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., speech recognition and object recognition in vision).

The tutorial is based on (but not limited to) the material from our upcoming Cambridge U. Press edited book which is currently in production and will be available in December 2011.

The slides are informative and entertaining. Interested in seeing if the book is the same.

OrientDB v1.0rc5 New

Filed under: NoSQL,OrientDB — Patrick Durusau @ 7:40 pm

OrientDB v1.0rc5: improved index and transactions, better crossing of trees and graphs

Just quickly:

  • SQL engine: new [] operator to extract items from lists, sets, maps and arrays
  • SQL engine: ORDER BY works with projection alias
  • SQL engine: Cross trees and graphs in projections
  • SQL engine: IN operator uses Index when available
  • Fixed all known bugs on transaction recovery
  • Rewritten the memory management of MVRB-Tree: now it’s faster and uses much less RAM
  • Java 5 compatibility of common and core subprojects
  • 16 issues fixed in total

Full list: http://code.google.com/p/orient/issues/list?can=1&q=label%3Av1.0rc5

August 21, 2011

Trouble with 1 Trillion Triples?

Filed under: AllegroGraph,RDF — Patrick Durusau @ 7:09 pm

Franz’s AllegroGraph® Sets New Record – 1 Trillion RDF Triples”

From the post:

OAKLAND, Calif. — August 16, 2011 — Franz Inc., a leading supplier of Graph Database technology, with critical support from Stillwater SuperComputing Inc. and Intel, today announced it has achieved its goal of being the first to load and query a NoSQL database with a trillion RDF statements. RDF (also known as triples or quads), the cornerstone of the Semantic Web, provides a more flexible way to represent data than relational database and is at the heart of the W3C push for the Semantic Web.

A trillion RDF Statements eclipses the current state of the art for the Semantic Web data management but is a primary interest for companies like Amdocs that use triples to represent real-time knowledge about telecom customers. Per-customer, Amdocs uses about 4,000 triples, so a large telecom like China Mobile would easily need 2 trillion triples to have detailed knowledge about each single customer.

Impressive milestone for a NoSQL solution and the Semantic Web.

The unanswered Semantic Web management question is:

What to do with inconsistent semantics spread over 1 trillion (or more) triples?

NoSQL Patterns

Filed under: NoSQL — Patrick Durusau @ 7:08 pm

NoSQL Patterns by Ricky Ho.

Ricky has put together a great summary of what NoSQL solutions have in common. Ranging from the API model, consistent hashing and other NoSQL earmarks. Recommended if you are new to the area.

YeSQL?

Filed under: NoSQL,SQL — Patrick Durusau @ 7:07 pm

Perspectives on NoSQL by Gavin M. Roy.

I don’t remember how I found this presentation but it is quite interesting.

Starts with a review of NoSQL database options, one slide summaries.

Compares them to PostgreSQL 9.0b1 using KVPBench, http://github.com/gmr/kvpbench.

Concludes that SQL databases perform as well if not out-performing NoSQL databases.

Really depends on the benchmark or more importantly, what use case is at hand. Use the most appropriate technology, SQL or not.

Still, I like the slide with database administrators running with scissors. I have always wondered what that would look like. Now I know. It isn’t pretty.

Microdata and RDFa Living Together in Harmony

Filed under: Microdata,RDFa — Patrick Durusau @ 7:07 pm

Microdata and RDFa Living Together in Harmony by Jeni Tennison.

From the post:

One of the options that the TAG put forward when it asked the W3C to put together task force on embedded data in HTML was the co-existence of RDFa and microdata. If that’s what we’re headed for, what might make things easier for consumers and publishers who have to live in that world?

In a situation where there are two competing standards, I think that developers — both on the publication and consumption sides — are going to want to hedge their bets. They will want to avoid being tied to one syntax in case it turns out that that syntax isn’t supported by the majority of publishers/consumers in the long term and they have to switch.

Publishers like us at legislation.gov.uk who are aiming to share their data to whoever is interested in it (rather than having a particular consumer in mind) are also likely to want to publish in both microdata and RDFa, rather than force potential consumers to adopt a particular processing model, and will therefore need to mix the syntaxes within their pages.

Interesting and detailed analysis of the issues of reconciling microdata and RDFa.

Jeni asks if this type of analysis is worthy of something more official than a blog post.

I would say yes. I think this sort of mapping analysis should be published along with any competing format.

You would not frequent a software project that lacks version control.

Why use a data format/annotation that doesn’t provide a mapping to “competing” formats? (The emphasis being on “competing” formats. Not mappings to any possible format but to those in direct competition with the proposed format/annotation system.)

I have no objection to new formats but if there is an existing format, document its shortcomings and a mapping to the new format, along with where the mapping fails.

Doesn’t save us from competing formats but it may ease the evaluation and/or co-existence of formats.

From a topic map perspective, such a mapping is just more grist for the mill.

Functional Programming Is Hard, That’s Why It’s Good

Filed under: Functional Programming — Patrick Durusau @ 7:06 pm

Functional Programming Is Hard, That’s Why It’s Good by David Fayrum.

From the post:

Odds are, you don’t use a functional programming language every day. You probably aren’t getting paid to write code in Scala, Haskell, Erlang, F#, or a Lisp Dialect. The vast majority of people in the industry use OO languages like Python, Ruby, Java or C#–and they’re happy with them. Sure, they might occasionally use a “functional feature” like “blocks” now and then, but they aren’t writing functional code.

And yet, for years we’ve been told that functional languages are awesome. I still remember how confused I was when I first read ESR’s famous essay about learning Lisp. Most people are probably more familiar with Paul Graham’s “Beating The Averages” which makes the case that:

But with Lisp our development cycle was so fast that we could sometimes duplicate a new feature within a day or two of a competitor announcing it in a press release. By the time journalists covering the press release got round to calling us, we would have the new feature too.

A common thread among people proselytizing functional programming is that learning this new, functional language is “good for you”; almost like someone prescribing 30m in the gym a day will “make you fit,” but it also implies difficulty and dedication. Haskell, Ocaml, and Scala are different from Lisp in that they have a certain notoriety for being very hard to learn. Polite people call this “being broad & deep”. Less polite people call it “mental masturbation” or “academic wankery” or just plain “unnecessary.” I submit that this difficulty is a familiar situation, and it’s a strong indicator that learning one of these languages will make you more productive and competent at writing software.

I will leave you to read the rest for yourself.

GraphGL, network visualization with WebGL

Filed under: GraphGL,WebGL — Patrick Durusau @ 7:05 pm

GraphGL, network visualization with WebGL.

From the introduction:

GraphGL is a network visualization library designed for rendering (massive) graphs in web browsers and puts dynamic graph exploration on the web another step forward. In short, it calculates the layout of the graph in real time and is therefore suitable for static files (exported GraphML/GEXF files) and for dynamic files (LinkedIn InMaps would be one such example).

As such, it is both a replacement for Gephi and a complimentary tool, similar to Seadragon, providing another method for displaying graphs in a Web browser.

A very good article that also covers technical issues of graph rendering in a web context.

A project to watch, or better yet, to help advance.

Spring Data Graph 1.1.0 with Neo4j support released

Filed under: Cypher,Graphs,Gremlin,Neo4j,Spring Data — Patrick Durusau @ 7:05 pm

Spring Data Graph 1.1.0 with Neo4j support released

From the wiki:

We are pleased to announce that the second release (1.1.0.RELEASE) of the Spring Data Graph project with Neo4j support is now available!

After the first public release of Spring Data Graph in April 2011 we mainly focused on user feedback.

With the improved documentation around the tooling and an upgraded AspectJ version we addressed many of the AspectJ issues that where reported by users. With the latest STS and Eclipse and hopefully with Idea11 it is possible to develop Spring Data Graph applications without the red wiggles. To further ease the development we also provided sample build scripts for ant/ivy and a plugin for gradle.

Of course we kept pace with development of Neo4j, currently using the latest stable release of Neo4j (1.4.1).

During the last months of Neo4j development the improved querying (Cypher, Gremlin) support was one of the important aspects.

So we strove to support it on all levels. Now, it is possible to execute Cypher queries from Spring Data Graph Repositories, from the Neo4j-Template but also as part of dynamic field annotations and via the introduced entity methods. The same goes for Gremlin scripts. What’s possible with this new expressive power? Let’s take a look. …

OK, better? Worse? About the same? Projects can’t improve without your feedback. Issues discussed only around water coolers can’t be addressed. Yes?

There’s some famous so-and-so’s Law about non-reported comments but I can’t find the reference. You?

Public Policy by Bayesian Model?

Filed under: Bayesian Models — Patrick Durusau @ 7:04 pm

Discussion of “Bayesian Models and Methods in Public Policy and Government Settings” by S. E. Fienberg by David J. Hand.

Abstract:

Fienberg convincingly demonstrates that Bayesian models and methods represent a powerful approach to squeezing illumination from data in public policy settings. However, no school of inference is without its weaknesses, and, in the face of the ambiguities, uncertainties, and poorly posed questions of the real world, perhaps we should not expect to find a formally correct inferential strategy which can be universally applied, whatever the nature of the question: we should not expect to be able to identify a “norm” approach. An analogy is made between George Box’s “no models are right, but some are useful,” and inferential systems.

A cautionary tale that reaches beyond Bayesian models. It is very often (always?) the case that models find the object of investigation.

August 20, 2011

WordNet Data > 10.3 Billion Unique Values

Filed under: Dataset,Linguistics,WordNet — Patrick Durusau @ 8:08 pm

WordNet Data > 10.3 Billion Unique Values

Wanted to draw your attention to some WordNet data files.

From the readme.TXT file in the directory:

As of August 19, 2011 pairwise measures for all nouns using the path measure are available. This file is named WordNet-noun-noun-path-pairs.tar. It is approximately 120 GB compressed. In this file you will find 146,312 files, one for each noun sense. Each file consists of 146,313 lines, where each line (except the first) contains a WordNet noun sense and the similarity to the sense featured in that particular file. Doing the math here, you find that each .tar file contains
about 21,000,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion unique values.

We are currently running wup, res, and lesk, but do not have an estimated date of availability yet.

BTW, on verb data:

These files were created with WordNet::Similarity version 2.05 using WordNet 3.0. They show all the pairwise verb-verb similarities found in WordNet according to the path, wup, lch, lin, res, and jcn measures. The path, wup, and lch are path-based, while res, lin, and jcn are based on information content.

As of March 15, 2011 pairwise measures for all verbs using the six measures above are availble, each in their own .tar file. Each *.tar file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx 2.0 – 2.4 GB compressed. In each of these .tar files you will find 25,047 files, one for each verb sense. Each file consists of 25,048 lines, where each line (except the first) contains a WordNet verb sense and the similarity to the sense featured in that particular file. Doing
the math here, you find that each .tar file contains about 625,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have a bit more than 300 million unique values.

B.A.D. Data Is Not Always Bad…If You Have a Data Scientist

Filed under: Data Analysis,Marketing — Patrick Durusau @ 8:07 pm

B.A.D. Data Is Not Always Bad…If You Have a Data Scientist by Frank Coleman.

From the post:

How many times have you heard, “Bad data means bad decisions”? Starting with the Best Available Data (B.A.D.) is a great approach because it gets the inspection process moving. The best way to engage key stakeholders is to show them their numbers, even if you have low confidence in the results. If done well, you will be speaking with a group of passionate colleagues!

People are often afraid to start measuring a project or initiative because they have low confidence in the quality of the data they are accessing. But there is a great deal you can do with B.A.D. data; start by looking for trends. Many times the trend is all you really need to get going. Make sure you also understand what the distribution of this data looks like. You don’t have to be a Six Sigma black belt (though it helps) to know if the data has a normal distribution. From there you can “geek out” if you want, but your time will be better served by keeping it simple – especially at this stage.

A bit “practical” for my tastes, ;-), but worth your attention.

The secrets to successful data visualization

Filed under: Data Analysis,Visualization — Patrick Durusau @ 8:07 pm

The secrets to successful data visualization by Reena Jana.

From the post:

Effective data visualization is about more than designing an eye-catching graphic. It’s about telling a clear and accurate story that draws readers in via powerful choices of shapes and colors. These are some of the observations you’ll find in the insightful new book Visualize This: The Flowing Data Guide to Design, Visualization, and Statistics (Wiley) by Nathan Yau, the blogger behind the popular site Flowing Data. On his blog, Yau analyzes a wide variety of graphs and charts from around the world–and often sparks online discussions and debates among designers.

You have seen the Flowing Data site mentioned here more than once or twice.

If you don’t read another post this weekend, go to Reena’s post and read it. You will get something from it.

The Open Data Manual

Filed under: Open Data,Topic Maps — Patrick Durusau @ 8:06 pm

The Open Data Manual

From the website:

This report discusses legal, social and technical aspects of open data. The manual can be used by anyone but is especially designed for those seeking to open up data. It discusses the why, what and how of open data – why to go open, what open is, and the how to ‘open’ data.

From the introduction:

Do you know exactly how much of your tax money is spent on street lights or on cancer research? What is the shortest, safest and most scenic bicycle route from your home to your work? And what is in the air you breathe along the way? Where in your region will you find the best job opportunities and the highest number of fruit trees per capita? When can you influence decisions about topics you deeply care about, and whom should you talk to?

New technologies now make it possible to build the services to answer these questions automatically. Much of the data you would need to answer these questions is generated by public bodies. However, often the data required is not yet available in a form that makes it easy to use. This book is about how to unlock the potential of official and other information to enable new services, to improve the lives of citizens and make government and society work better.

The more data that is available, the more data topic maps can integrate and deliver for your purposes.

May the Index be with you!

Filed under: MySQL,Query Language,SQL — Patrick Durusau @ 8:06 pm

May the Index be with you! by Lawrence Schwartz.

From the post:

The summer’s end is rapidly approaching — in the next two weeks or so, most people will be settling back into work. Time to change your mindset, re-evaluate your skills and see if you are ready to go back from the picnic table to the database table.

With this in mind, let’s see how much folks can remember from the recent indexing talks my colleague Zardosht Kasheff gave (O’Reilly Conference, Boston, and SF MySQL Meetups). Markus Winand’s site “Use the Index, Luke!” (not to be confused with my favorite Star Wars parody, “Use the Schwartz, Lone Starr!”), has a nice, quick 5 question indexing quiz that can help with this.

Interesting enough to request an account so I could download ToKuDB v.5.0. Uses fractal trees for indexing speed. Could be interesting. More on that later.

Clojure Toolbox

Filed under: Clojure — Patrick Durusau @ 8:05 pm

Clojure Toolbox

From the website:

A categorised directory of libraries and tools for Clojure.

Linked Data Patterns – New Draft

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 8:04 pm

Linked Data Patterns – New Draft – Leigh Dobbs and Ian Davis have released a new draft.

From the website:

A pattern catalogue for modelling, publishing, and consuming Linked Data.

Think of it as Linked Data without all the Put your hand on your computer and feel the power of URI stuff you hear in some quarters.

For example, the solution for for How do we publish non-global identifiers in RDF? is:

Create a custom property, as a sub-class of the dc:identifier property for relating the existing literal key value with the resource.

And the discussion reads:

While hackable URIs are a useful short-cut they don’t address all common circumstances. For example different departments within an organization may have different non-global identifiers for a resource; or the process and format for those identifiers may change over time. The ability to algorithmically derive a URI is useful but limiting in a global sense as knowledge of the algorithm has to be published separately to the data.

By publishing the original “raw” identifier as a literal property of the resource we allow systems to look-up the URI for the associated resource using a simple SPARQL query. If multiple identifiers have been created for a resource, or additional identifiers assigned over time, then these can be added as additional repeated properties.

For systems that may need to bridge between the Linked Data and non-Linked Data views of the world, e.g. integrating with legacy applications and databases that do not store the URI, then the ability to find the identifier for the resource provides a useful integration step.

If I aggregate the non-Linked Data identifiers as sub-classes of dc:identifier, isn’t that a useful integration step whether I am using Linked Data or not?

The act of aggregating identifiers is a useful integration step, by whatever syntax. Yes?

My principal disagreement with Linked Data and other “universal” identification systems is that none of them are truly universal or long lasting. Rhetoric to the contrary notwithstanding.

« Newer PostsOlder Posts »

Powered by WordPress