Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 16, 2013

Optimizing TM Queries?

Filed under: Query Language,TMQL,XML,XQuery — Patrick Durusau @ 7:56 pm

A recent paper by V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyễn, Optimizing XML querying using type-based document projection, suggests some interesting avenues for optimizing topic map queries.

Abstract:

XML data projection (or pruning) is a natural optimization for main memory query engines: given a query Q over a document D, the subtrees of D that are not necessary to evaluate Q are pruned, thus producing a smaller document D ; the query Q is then executed on D , hence avoiding to allocate and process nodes that will never be reached by Q.

In this article, we propose a new approach, based on types, that greatly improves current solutions. Besides providing comparable or greater precision and far lesser pruning overhead, our solution ―unlike current approaches― takes into account backward axes, predicates, and can be applied to multiple queries rather than just to single ones. A side contribution is a new type system for XPath able to handle backward axes. The soundness of our approach is formally proved. Furthermore, we prove that the approach is also complete (i.e., yields the best possible type-driven pruning) for a relevant class of queries and Schemas. We further validate our approach using the XMark and XPathMark benchmarks and show that pruning not only improves the main memory query engine’s performances (as expected) but also those of state of the art native XML databases.

Phrased in traditional XML terms but imagine pruning a topic map by topic or association types, for example, before execution of a query.

While true enough that a query could include topic type, the remains the matter of examining all the instances of topic type before proceeding to the rest of the query.

For common query sub-maps as it were, I suspect that to prune once and store the results could be a viable alternative.

Despite the graphic chart enhancement from processing millions or billions of nodes, processing the right set of nodes and producing a useful answer has its supporters.

Tombstones in Topic Map Future?

Watching the What’s New in Cassandra 1.2 (Notes) webcast and encountered an unfamiliar term: “tombstones.”

If you are already familiar with the concept, skip to another post.

If you’re not, the concept is used in distributed systems that maintain “eventual” consistency by the nodes replicating their content. Which works if all nodes are available but what if you delete data and a node is unavailable? When it comes back, the other nodes are “missing” data that needs to be replicated.

From the description at the Cassandra wiki, DistributedDeletes, not an easy problem to solve.

So, Cassandra turns it into a solvable problem.

Deletes are implemented with a special value known as a tombstone. The tombstone is propogated to nodes that missed the initial delete.

Since you will eventually want to delete the tombstones as well, a grace period can be set, which is slightly longer than the period needed to replace a non-responding node.

Distributed topic maps will face the same issue.

Complicated by imperative programming models of merging that make changes in properties that alter merging difficult to manage.

Perhaps functional models of merging, as with other forms of distributed processing, will carry the day.

Free Datascience books

Filed under: Data,Data Mining,Data Science — Patrick Durusau @ 7:55 pm

Free Datascience books by Carl Anderson

From the post:

I’ve been impressed in recent months by the number and quality of free datascience/machine learning books available online. I don’t mean free as in some guy paid for a PDF version of an O’Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, “the publisher has graciously agreed to allow a full, free version of my book to be available on this site.”
Here are a few in my collection:

Any you would like to add to the list?

I first saw this in Four short links: 1 January 2013 by Nat Torkington.

ANN: HBase 0.94.4 is available for download

Filed under: Hadoop,HBase — Patrick Durusau @ 7:55 pm

ANN: HBase 0.94.4 is available for download by lars Hofhansl

Bug fix release with 81 issues resolved plus performance enhancements!

First seen tweet by Stack

Packetpig Finding Zero Day Attacks

Filed under: Pig,Security — Patrick Durusau @ 7:55 pm

Packetpig Finding Zero Day Attacks by Michael Baker.

From the post:

When Russell Jurney and I first teamed up to write these posts we wanted to do something that no one had done before to demonstrate the power of Big Data, the simplicity of Pig and the kind of Big Data Security Analytics we perform at Packetloop. Packetpig was modified to support Amazon’s Elastic Map Reduce (EMR) so that we could process a 600GB set of full packet captures. All that we needed was a canonical Zero Day attack to analyse. We were in luck!

In August 2012 a vulnerability in Oracle JRE 1.7 created huge publicity when it was disclosed that a number of Zero Day attacks had been report to Oracle in April but had still not been addressed in late August 2012. To make matters worse Oracle’s scheduled patch for JRE was months away (October 16). This position subsequently changed and a number of out-of-band patches for JRE were released for what became known as CVE-2012-4681 on the 30th of August.

The vulnerability exposed around 1 Billion systems to exploitation and the exploit was 100% effective on Windows, Mac OSX and Linux. A number of security researchers were already seeing the exploit in the wild as it was incorporated into exploit packs for the delivery of malware.

Interesting tool for packet analysis as well as insight on using Amazon’s EMR to process 600 GB of packets.

Packetpig could be an interesting source of data for creating maps or adding content to maps, based on packet traffic content.

January 15, 2013

Importing RDF into Faunus

Filed under: Faunus,RDF — Patrick Durusau @ 8:32 pm

RDF Format

Description of RDFInputFormat for Faunus to convert the edge list format of RDF into the adjacency list used by Faunus.

Currently supports:

  • rdf-xml
  • n-triples
  • turtle
  • n3
  • trix
  • trig

The converter won’t help with the lack of specified identification properties.

But, format conversion can’t increase the amount of information stored in a format.

At best it can be lossless.

XQuery 3.0: An XML Query Language [Subject Identity Equivalence Language?]

Filed under: Identity,XML,XQuery — Patrick Durusau @ 8:32 pm

XQuery 3.0: An XML Query Language – W3C Candidate Recommendation

Abstract:

XML is a versatile markup language, capable of labeling the information content of diverse data sources including structured and semi-structured documents, relational databases, and object repositories. A query language that uses the structure of XML intelligently can express queries across all these kinds of data, whether physically stored in XML or viewed as XML via middleware. This specification describes a query language called XQuery, which is designed to be broadly applicable across many types of XML data sources.

Just starting to read the XQuery CR but the thought occurred to me that it could be a basis for a “subject identity equivalence language.”

Rather than duplicating the work on expressions, paths, data types, operators, expressions, etc., why not take all that as given?

Suffice it to define a “subject equivalence function,” the variables of which are XQuery statements that identify values (or value expressions) as required, optional or forbidden and the definition of the results of the function.

Reusing a well-tested query language seems preferable to writing an entirely new one from scratch.

Suggestions?

I first saw this in a tweet by Michael Kay.

R Is Not So Hard! A Tutorial

Filed under: Programming,R — Patrick Durusau @ 8:32 pm

David Lillis is writing a tutorial on R under the title: R Is Not So Hard! A Tutorial.

So far:

Part 1: Basic steps with R.

Part 2: Creation of a two variables and what can be done with them.

Part 3: Covers using a regression model.

Edd Dumbill calls out R by name in The future of programming.

Poorly Researched Infographics [Adaptation for Topic Maps?]

Filed under: Advertising,Marketing,Topic Maps — Patrick Durusau @ 8:31 pm

Phillip Price posted this at When you SHARE poorly researched infographics….

Ride with Hitler

Two questions:

  1. Your suggestions for a line about topic maps (same image)?
  2. What other “classic” posters merit re-casting to promote topic maps?

I am not sure how to adapt the Scot Towel poster that headlines:

Is your washroom breeding Bolsheviks?

Comments/suggestions?

Symbolab

Filed under: Mathematics,Mathematics Indexing,Search Engines,Searching — Patrick Durusau @ 8:31 pm

Symbolab

Described as:

Symbolab is a search engine for students, mathematicians, scientists and anyone else looking for answers in the mathematical and scientific realm. Other search engines that do equation search use LaTex, the document mark up language for mathematical symbols which is the same as keywords, which unfortunately gives poor results.

Symbolab uses proprietary machine learning algorithms to provide the most relevant search results that are theoretically and semantically similar, rather than visually similar. In other words, it does a semantic search, understanding the query behind the symbols, to get results.

The nice thing about math and science is that it’s universal – there’s no need for translation in order to understand an equation. This means scale can come much quicker than other search engines that are limited by language.

From: The guys from The Big Bang Theory will love mathematical search engine Symbolab by Shira Abel. (includes an interview with Michael Avny, the CEO of Symbolab.

Limited to web content at the moment but a “scholar” option is in the works. I assume that will extend into academic journals.

Focused now on mathematics, physics and chemistry, but in principle should be extensible to related areas. I am particularly anxious to hear they are indexing CS publications!

Would be really nice if Springer, Elsevier, the AMS and others would permit indexing of their equations.

That presumes publishers would realize that shutting out users not at institutions is a bad marketing plan. With a marginal delivery cost of near zero and sunk costs from publication already fixed, every user a publisher gains at $200/year for their entire collection is $200 they did not have before.

Not to mention the citation and use of their publication, which just drives more people to publish there. A virtuous circle if you will.

The only concern I have is the comment:

The nice thing about math and science is that it’s universal – there’s no need for translation in order to understand an equation.

Which is directly contrary to what Michael is quoted as saying in the interview:

You say “Each symbol can mean different things within and across disciplines, order and position of elements matter, priority of features, etc.” Can you give an example of this?

The authors of the Foundations of Rule Learning spent five years attempting to reconcile notations used in rule making. Some symbols had different meanings. They resorted to inventing yet another notation as a solution.

Why the popular press perpetuates the myth of a universal language isn’t clear.

It isn’t useful and in some cases, such as national security, it leads to waste of time and resources on attempts to invent a universal language.

The phrase “myth of a universal language” should be a clue. Universal languages don’t exist. They are myths, by definition.

Anyone who says differently is trying to sell you something, Something that is in their interest and perhaps not yours.

I first saw this at Introducing Symbolab: Search for Equations by Angela Guess.

Is Linked Data the future of data integration in the enterprise?

Filed under: Linked Data,LOD — Patrick Durusau @ 8:31 pm

Is Linked Data the future of data integration in the enterprise? by John Walker.

From the post:

Following the basic Linked Data principles we have assigned HTTP URIs as names for things (resources) providing an unambiguous identifier. Next up we have converted data from a variety of sources (XML, CSV, RDBMS) into RDF.

One of the key features of RDF is the ability to easily merge data about a single resource from multiple source into a single “supergraph” providing a more complete description of the resource. By loading the RDF into a graph database, it is possible to make an endpoint available which can be queried using the SPARQL query language. We are currently using Dydra as their cloud-based database-as-a-service model provides an easy entry route to using RDF without requiring a steep learning curve (basically load your RDF and you’re away), but there are plenty of other options like Apache Jena and OpenRDF Sesame. This has made it very easy for us to answer to complex questions requiring data from multiple sources, moreover we can stand up APIs providing access to this data in minutes.

By using a Linked Data Plaform such as Graphity we can make our identifiers (HTTP URIs) dereferencable. In layman’s terms when someone plugs the URI into a browser, we provide a description of the resource in HTML. Using content negotiation we are able to provide this data in one of the standard machine-readable XML, JSON or Turtle formats. Graphity uses Java and XSLT 2.0 which our developers already have loads of experience with and provides powerful mechanisms with which we will be able to develop some great web apps.

What do you make of:

One of the key features of RDF is the ability to easily merge data about a single resource from multiple source into a single “supergraph” providing a more complete description of the resource.

???

I suppose if by some accident we all use the same URI as an identifier, that would be the case. But that hardly requires URIs, Linked Data or RDF.

Scientific conferences on digital retrieval the 1950’s worried about diversity of nomenclature being barriers to discovery of resources. If we haven’t addressed the semantic diversity issue in sixty (60) years of talking about it, it isn’t clear how creating another set of diverse names is going to help.

There may be other reasons for using URIs but seamless merging doesn’t appear to be one of them.

Moreover, how do I know what you have identified with a URI?

You can return one or more properties for a URI, but which ones matter for the identity of the subject it identifies?

I first saw this at Linked Data: The Future of Data Integration by Angela Guess.

The future of programming [A Cacophony of Semantic Primitives]

Filed under: Identity,Programming — Patrick Durusau @ 8:30 pm

The future of programming by Edd Dumbill.

You need to read Edd’s post on the future of programming in full, but there are two points I would like to pull out for your attention:

  1. Expansion of people engaged in programming:

    In our age of exploding data, the ability to do some kind of programming is increasingly important to every job, and programming is no longer the sole preserve of an engineering priesthood.

  2. Data as first class citizen

    As data and its analysis grow in importance, there’s a corresponding rise in use and popularity of languages that treat data as a first class citizen. Obviously, statistical languages such as R are rising on this tide, but within general purpose programming there’s a bias to languages such as Python or Clojure, which make data easier to manipulate.

The most famous occasion when a priesthood lost the power of sole interpretation was the Protestant Reformation.

Although there was already a wide range of interpretations, as the priesthood of believers grew over the centuries, so did the diversity of interpretation and semantics.

Even though there is a wide range of semantics in programming already, the broader participation becomes, the broader the semantics of programming will grow. Not in terms of the formal semantics as defined by language designers but as used by programmers.

Semantics being the province of usage, I am betting on semantics as used being the clear winner.

Data being treated as a first class citizen carries with it the seeds of even more semantic diversity. Data, after all, originates with users and is only meaningful when some user interprets it.

Users are going to “see” data as having the semantics they attribute to it, not the semantics as defined by other programmers or sources.

To use another analogy from religion, the Old Testament/Hebrew Bible can be read in the context of Ancient Near Eastern religions and practices or taken as a day by day calendar from the point of creation. And several variations in between. All relying on the same text.

For decades programmers have pretended programming was based on semantic primitives. Semantic primitives that could be reliably interchanged, albeit sometimes with difficulty, with other systems. But users and their data are shattering the illusion of semantic primitives.

More accurately they are putting other notions of semantic primitives into play.

A cacophony of semantic primitives bodes poorly for a future of distributed, device, data and democratized computing.

Avoidable to the degree that we choose to not silently rely upon others “knowing what we meant.”

I first saw this at The four D’s of programming’s future: data, distributed, device, democratized by David Smith.

Graphs as a New Way of Thinking [Really?]

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 8:30 pm

Graphs as a New Way of Thinking by Emil Eifrem.

From the post:

Faced with the need to generate ever-greater insight and end-user value, some of the world’s most innovative companies — Google, Facebook, Twitter, Adobe and American Express among them — have turned to graph technologies to tackle the complexity at the heart of their data.

To understand how graphs address data complexity, we need first to understand the nature of the complexity itself. In practical terms, data gets more complex as it gets bigger, more semi-structured, and more densely connected.

We all know about big data. The volume of net new data being created each year is growing exponentially — a trend that is set to continue for the foreseeable future. But increased volume isn’t the only force we have to contend with today: On top of this staggering growth in the volume of data, we are also seeing an increase in both the amount of semi-structure and the degree of connectedness present in that data.

He later concludes with:

Graphs are a new way of thinking for explicitly modeling the factors that make today’s big data so complex: Semi-structure and connectedness. As more and more organizations recognize the value of modeling data with a graph, they are turning to the use of graph databases to extend this powerful modeling capability to the storage and querying of complex, densely connected structures. The result is the opening up of new opportunities for generating critical insight and end-user value, which can make all the difference in keeping up with today’s competitive business environment.

I know it is popular rhetoric to say that X technology is a “new way of thinking.” Fashionable perhaps but also false.

People have always written about “connections” between people, institutions, events, etc. If you don’t believe me, find an online version of Plutarch.

Where I do think Emil has a good point is when he says: “Graphs are…for explicitly modeling the factors…,” which is no mean feat.

The key to disentangling big data isn’t “new thinking” or navel gazing about its complexity.

One key step is making connections between data (big or otherwise), explicit. Unless it is explicit, we can’t know for sure if we are talking about the same connection or not.

Another key step is identifying the data we are talking about (in topic maps terms, the subject of conversation) and how we identify it.

It isn’t rocket science nor does it require a spiritual or intellectual re-birth.

It does require some effort to make explicit what we usually elide over in conversation or writing.

For example, earlier in this post I used the term “Emil” and you instantly knew who I meant. A mechanical servant reading the same post might not be so lucky. Nor would it supply the connection to Neo4j.

A low effort barrier to making those explicit would go a long way to managing big data, with no “new way of thinking” being required.

I first saw this at Thinking Differently with Graph Databases by Angela Guess.

Chinese Rock Music

Filed under: Music,OWL,RDF,Semantic Web — Patrick Durusau @ 8:30 pm

Experiences on semantifying a Mediawiki for the biggest recource about Chinese rock music: rockinchina .com by René Pickhardt.

From the post:

During my trip in China I was visiting Beijing on two weekends and Maceau on another weekend. These trips have been mainly motivated to meet old friends. Especially the heads behind the biggest English resource of Chinese Rock music Rock in China who are Max-Leonhard von Schaper and the founder of the biggest Chinese Rock Print Magazin Yang Yu. After looking at their wiki which is pure gold in terms of content but consists mainly of plain text I introduced them the idea of putting semantics inside the project. While consulting them a little bit and pointing them to the right resources Max did basically the entire work (by taking a one month holiday from his job. Boy this is passion!).

I am very happy to anounce that the data of rock in china is published as linked open data and the process of semantifying the website is in great shape. In the following you can read about Max experiences doing the work. This is particularly interesting because Max has no scientific background in semantic technologies. So we can learn a lot on how to improve these technologies to be ready to be used by everybody:

Good to see that René hasn’t lost his touch for long blog titles. 😉

A very valuable lesson in the difficulties posed by current “semantic” technologies.

Max and company succeed, but only after heroic efforts.

Maps in R: Plotting data points on a map

Filed under: Geographic Data,Geography,Mapping,Maps,R — Patrick Durusau @ 8:30 pm

Maps in R: Plotting data points on a map by Max Marchi.

From the post:

In the introductory post of this series I showed how to plot empty maps in R.

Today I’ll begin to show how to add data to R maps. The topic of this post is the visualization of data points on a map.

Max continues this series with datasets from airports in Europe and demonstrates how to map the airports to geographic locations. He also represents the airports with icons that correspond to their traffic statistics.

Useful principles for any data set with events that can be plotted against geographic locations.

Parades, patrols, convoys, that sort of thing.

January 14, 2013

Using R with Hadoop [Webinar]

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 8:39 pm

Using R with Hadoop by David Smith.

From the post:

In two weeks (on January 24), Think Big Analytics' Jeffrey Breen will present a new webinar on using R with Hadoop. Here's the webinar description:

R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.

Topics include:

  • How to use R and Hadoop
  • Hadoop streaming
  • Various R packages and RHadoop
  • Hive via JDBC/ODBC
  • Using Revolution’s RHadoop
  • Big data warehousing with R and Hive

You can register for the webinar at the link below. If you do plan to attend the live session (where you can ask Jeffrey questions), be sure to sign in early — we're limited to 1000 participants and there are already more than 1000 registrants. If you can't join the live session (or it's just not at a convenient time for you), signing up will also get you a link to the recorded replay and a download link for the slides as soon as they're available after the webinar.

Definitely one for the calendar!

Intelligent Content:…

Filed under: eBooks,Information Reuse,Publishing — Patrick Durusau @ 8:39 pm

Intelligent Content: How APIs Can Supply the Right Content to the Right Reader by Adam DuVander.

From the post:

When you buy a car, it comes with a thick manual that probably sits in your glove box for the life of the car. The experience with a new luxury car may be much different. That printed, bound manual may only contain the information relevant to your car. No leather seats, no two page spread on caring for the hide. That’s intelligent content. And it’s an opportunity for APIs to help publishers go way beyond the cookie cutter printed book. It also happens to be an exciting conference coming to San Francisco in February.

It takes effort to segment content, especially when it was originally written as one piece. There are many benefits to those that put in the effort to think of their content as a platform. Publisher Pearson did this with a number of its titles, most notably with its Pearson Eyewitness Guides API. Using the API, developers can take what was a standalone travel book–say, the Eyewitness Guide to London–and query individual locations. One can imagine travel apps using the content to display great restaurants or landmarks that are nearby, for example.

Traditional publishing is a market that is ripe for disruption, characterized by Berkeley professor Robert Glushko co-creating a new approach to academic textbooks with his students in the Future of E-books. Glushko is one of the speakers at the Intelligent Content Conference, which will bring together content creators, technologists and publishers to discuss the many opportunities. Also speaking is Netflix’s Daniel Jacobson, who architected a large redesign of the Netflix API in order to support hundreds of devices. And yes, I will discuss the opportunities for content-as-a-service via APIs.

ProgrammableWeb readers can still get in on the early bird discount to attend Intelligent Content, which takes place February 7-8 in San Francisco.

San Francisco in February sounds like a good idea. Particularly if the future of publishing is on the agenda.

Would observe that “intelligent content” implies that some one, that is a person, has both authored the content and designed the API. Doesn’t happen auto-magically.

And with people involved, our old friend semantic diversity is going to be in the midst of the discussions, proposals and projects.

Reliable collation of data from different publishers (universities with multiple subscriptions should be pushing for this now) could make access seamless to end users.

1 Billion Videos = No Reruns

Filed under: Data,Entertainment,Social Media,Social Networks — Patrick Durusau @ 8:38 pm

Viki Video: 1 Billion Videos in 150 languages Means Never Having to Say Rerun by Greg Bates.

from the post:

Tried of American TV? Tired of TV in English? Escape to Viki, the leading global TV and movie network, which provides videos with crowd sourced translations in 150 languages. The Viki API allows your users to browse more than 1 billion videos by genre, country, and language, plus search across the entire database. The API uses OAuth2.0 authentication, REST, with responses in either JSON or XML.

The Viki Platform Google Group.

Now this looks like a promising data set!

A couple of use cases for topic maps come to mind:

  • Entry in OPAC points patron mapping from catalog to videos from this database.
  • Entry returned from database maps to book in local library collection (via WorldCat) (more likely to appeal to me).

What use cases do you see?

Why you should try UserTesting.com

Filed under: Design,Interface Research/Design,Usability,Users — Patrick Durusau @ 8:37 pm

Why you should try UserTesting.com by Pete Warden.

From the post:

If you’re building a website or app you need to be using UserTesting.com, a service that crowd-sources QA. I don’t say that about many services, and I have no connection with the company (a co-worker actually discovered them) but they’ve transformed how we do testing. We used to have to stalk coffee shops and pester friends-of-friends to find people who’d never seen Jetpac before and were willing to spend half an hour of their life being recorded while they checked it out. It meant the whole process took a lot of valuable time, so we’d only do it a few times a month. This made life tough for the engineering team as the app grew more complex. We have unit tests, automated Selenium tests, and QA internally, but because we’re so dependent on data caching and crunching, a lot of things only go wrong when a completely new user first logs into the system.

Another approach to user testing of your website or interface design.

howdoi

Filed under: Programming,Searching — Patrick Durusau @ 8:37 pm

howdoi

A Unix source code search tool.

Command line searching of stackoverflow.

I first saw this at Four short links: 9 January 2013 by Nat Torkington.

How To Make That One Thing Go Viral

Filed under: Advertising,Marketing,Topic Maps — Patrick Durusau @ 8:37 pm

How To Make That One Thing Go Viral (Slideshare)

From the description:

Everyone wants to know how to make that one thing go viral. Especially bosses. Here’s the answer. So now maybe they will stop asking you. See the Upworthy version of this here: http://www.upworthy.com/how-to-make-that-one-thing-go-viral-just-kidding?c=slideshare.

Worth reviewing every week or so until it becomes second nature.

Somehow I doubt: “Topic Maps: Reliable Sharing of Content Across Semantic Domains” is ever going viral.

Well, one down, 24 more to go.

😉

I first saw this at Four short links: 10 January 2013 by Nat Torkington.

How to implement an algorithm from a scientific paper

Filed under: Algorithms,Programming — Patrick Durusau @ 8:36 pm

How to implement an algorithm from a scientific paper by Emmanuel Goossaert.

From the post:

This article is a short guide to implementing an algorithm from a scientific paper. I have implemented many complex algorithms from books and scientific publications, and this article sums up what I have learned while searching, reading, coding and debugging. This is obviously limited to publications in domains related to the field of Computer Science. Nevertheless, you should be able to apply the guidelines and good practices presented below to any kind of paper or implementation.

Seven (7) rules, some with sub-parts, to follow when trying to implement an algorithm from a paper.

With the growth of research on parallel programming, likely the most important skill you will pick up this year.

I first saw this in Four short links: 11 January 2013 by Nat Torkington.

Fun with Beer – and Graphs

Filed under: Graphs,Networks — Patrick Durusau @ 8:36 pm

Fun with Beer – and Graphs by Rik van Bruggen.

From the post:

I make no excuses: My name is Rik van Bruggen and I am a salesperson. I think it is one of the finest and nicest professions in the world, and I love what I do. I love it specifically, because I get to sell great, awesome, fantastic products really – and I get to work with fantastic people along the way. But the point is I am not a technical person – at all. But, I do have a passion for technology, and feel the urge to understand and taste the products that I sell. And that’s exactly what happened a couple of months ago when I joined Neo Technology, the makers and maintainers of the popular Neo4j open source graph database.

So I decided to get my hands dirty and dive in head first. But also, to have some fun along the way.

The fun part would be coming from something that I thoroughly enjoy: Belgian beer. Some of you may know that Stella Artois, Hoegaerden, Leffe and the likes come from Belgium, but few of you know that this tiny little country in the lowlands around Brussels actually produces several thousand beers.

Belgian Beer

You can read about it on the Wikipedia page: Belgian beers are good, and numerous. So how would I go about putting Belgian beers into Neo4j? Interesting challenge.

Very useful post if you:

  • Don’t want to miss any Belgain beers as you drink your way through them.
  • Want to drink your way up or down in terms of alcohol percentage.
  • Want to memorize the names of all Belgian beers.
  • Oh, want to gain experience with Neo4j. 😉

On Graph Computing [Shared Vertices/Merging]

Filed under: Graphs,Merging,Networks — Patrick Durusau @ 8:35 pm

On Graph Computing by Marko A. Rodriguez.

Marko writes elegantly about graphs and I was about to put this down as another graph advocacy post. Interesting but graph followers have heard this story before.

But I do read the materials I cite and Marko proceeds to define three separate graphs, software, discussion and concept. Each of which has some vertexes in common with one or both of the others.

Then he has this section:

A Multi-Domain Graph

The three previous scenarios (software, discussion, and concept) are representations of real-world systems (e.g. GitHub, Google Groups, and Wikipedia). These seemingly disparate models can be seamlessly integrated into a single atomic graph structure by means of shared vertices. For instance, in the associated diagram, Gremlin is a Titan dependency, Titan is developed by Matthias, and Matthias writes messages on Aurelius’ mailing list (software merges with discussion). Next, Blueprints is a Titan dependency and Titan is tagged graph (software merges with concept). The dotted lines identify other such cross-domain linkages that demonstrate how a universal model is created when vertices are shared across domains. The integrated, universal model can be subjected to processes that provide richer (perhaps, more intelligent) services than what any individual model could provide alone.

Shared vertices sounds a lot like merging in the topic map sense to me.

It isn’t clear from the post what requirements may or may not exist for vertices to be “shared.”

Or how you would state the requirements for sharing vertices?

Or how to treat edges that become duplicates when separate vertices they connect now become the same vertices?

If “shared vertices” support what we call merging in topic maps, perhaps there are graph installations waiting to wake up as topic maps.

Connecting the Dots with Data Mashups (Webinar – 15th Jan. 2013)

Filed under: Data,Graphics,Mashups,Visualization — Patrick Durusau @ 1:53 pm

Connecting the Dots with Data Mashups (Webinar – 15th Jan. 2013)

From the webpage:

The Briefing Room with Lyndsay Wise and Tableau Software

While Big Data continues to grab headlines, most information managers know there are many more “small” data sets that are becoming more valuable for gaining insights. That’s partly because business users are getting savvier at mixing and matching all kinds of data, big and small. One key success factor is the ability create compelling visualizations that clearly show patterns in the data.

Register for this episode of The Briefing Room to hear Analyst Lindsay Wise share insights about best practices for designing data visualization mashups. She’ll be briefed by Ellie Fields of Tableau Software who will demonstrate several different business use cases in which such mashups have proven critical for generating significant business value.

Particularly interesting in the use cases part of the presentation.

Topic maps, after all, are re-usable and reliable mashups.

Finding places that like mashups+ (aka, topic maps) is a good marketing move.

PS: It took several minutes to discover a link for the webinar that did not have lots of tracking garbage attached to it. I am considering not listing events without clean URLs to registration materials. What do you think?

January 13, 2013

Principles for effective risk data aggregation and risk reporting

Filed under: Finance Services,Marketing,Topic Maps — Patrick Durusau @ 8:16 pm

Basel Committee issues “Principles for effective risk data aggregation and risk reporting – final document”

Not a very inviting title is it? 😉

Still, the report is important for banks, enterprises in general (if you take out the “r” word) and illustrates the need for topic maps.

From the post:

The Basel Committee on Banking Supervision today issued Principles for effective risk data aggregation and risk reporting.

The financial crisis that began in 2007 revealed that many banks, including global systemically important banks (G-SIBs), were unable to aggregate risk exposures and identify concentrations fully, quickly and accurately. This meant that banks’ ability to take risk decisions in a timely fashion was seriously impaired with wide-ranging consequences for the banks themselves and for the stability of the financial system as a whole.

The report goes into detail but the crux of the problem is contained in: “…were unable to aggregate risk exposures and identify concentrations fully, quickly and accurately.”

Easy said than fixed but the critical failure was the inability to reliable aggregate data. (Where have you heard that before?)

Principles for effective risk data aggregation and risk reporting (full text) is only twenty-eight (28) pages and worth reading in full.

Of the fourteen (14) principles, seven (7) of them could be directly advanced by the use of topic maps:

Principle 2 Data architecture and IT infrastructure – A bank should design, build and maintain data architecture and IT infrastructure which fully supports its risk data aggregation capabilities and risk reporting practices not only in normal times but also during times of stress or crisis, while still meeting the other Principles….

33. A bank should establish integrated 16 data taxonomies and architecture across the banking group, which includes information on the characteristics of the data (metadata), as well as use of single identifiers and/or unified naming conventions for data including legal entities, counterparties, customers and accounts.

16 Banks do not necessarily need to have one data model; rather, there should be robust automated reconciliation procedures where multiple models are in use.

Principle 3 Accuracy and Integrity – A bank should be able to generate accurate and reliable risk data to meet normal and stress/crisis reporting accuracy requirements. Data should be aggregated on a largely automated basis so as to minimise the probability of errors….

As a precondition, a bank should have a “dictionary” of the concepts used, such that data is defined consistently across an organisation. [What about across banks/sources?]

Principle 4 Completeness – A bank should be able to capture and aggregate all material risk data across the banking group. Data should be available by business line, legal entity, asset type, industry, region and other groupings, as relevant for the risk in question, that permit identifying and reporting risk exposures, concentrations and emerging risks….

A banking organisation is not required to express all forms of risk in a common metric or basis, but risk data aggregation capabilities should be the same regardless of the choice of risk aggregation systems implemented. However, each system should make clear the specific approach used to aggregate exposures for any given risk measure, in order to allow the board and senior management to assess the results properly.

Principle 5 Timeliness – A bank should be able to generate aggregate and up-to-date risk data in a timely manner while also meeting the principles relating to accuracy and integrity, completeness and adaptability. The precise timing will depend upon the nature and potential volatility of the risk being measured as well as its criticality to the overall risk profile of the bank. The precise timing will also depend on the bank-specific frequency requirements for risk management reporting, under both normal and stress/crisis situations, set based on the characteristics and overall risk profile of the bank….

The Basel Committee acknowledges that different types of data will be required at different speeds, depending on the type of risk, and that certain risk data may be needed faster in a stress/crisis situation. Banks need to build their risk systems to be capable of producing aggregated risk data rapidly during times of stress/crisis for all critical risks.

Principle 6 Adaptability – A bank should be able to generate aggregate risk data to meet a broad range of on-demand, ad hoc risk management reporting requests, including requests during stress/crisis situations, requests due to changing internal needs and requests to meet supervisory queries….

(a) Data aggregation processes that are flexible and enable risk data to be aggregated for assessment and quick decision-making;

(b) Capabilities for data customisation to users’ needs (eg dashboards, key takeaways, anomalies), to drill down as needed, and to produce quick summary reports;

[Flexible merging and tracking sources through merging.]

Principle 7 Accuracy – Risk management reports should accurately and precisely convey aggregated risk data and reflect risk in an exact manner. Reports should be reconciled and validated….

(b) Automated and manual edit and reasonableness checks, including an inventory of the validation rules that are applied to quantitative information. The inventory should include explanations of the conventions used to describe any mathematical or logical relationships that should be verified through these validations or checks; and

(c) Integrated procedures for identifying, reporting and explaining data errors or weaknesses in data integrity via exceptions reports.

Principle 8 Comprehensiveness – Risk management reports should cover all material risk areas within the organisation. The depth and scope of these reports should be consistent with the size and complexity of the bank’s operations and risk profile, as well as the requirements of the recipients….

Risk management reports should include exposure and position information for all significant risk areas (eg credit risk, market risk, liquidity risk, operational risk) and all significant components of those risk areas (eg single name, country and industry sector for
credit risk). Risk management reports should also cover risk-related measures (eg regulatory and economic capital).

You have heard Willie Sutton’s answer to: “Why do you rob banks, Mr. Sutton?”, Answer: “Because that’s where the money is.”

Same answer for: “Why write topic maps for banks?”

I first saw this at Basel Committee issues “Principles for effective risk data aggregation and risk reporting – final document” by Ken O’Connor.

U.S. GPO releases House bills in bulk XML

Filed under: Government Data,Law,Law - Sources — Patrick Durusau @ 8:15 pm

U.S. GPO releases House bills in bulk XML

Bills from the current Congress but for bulk download in XML.

Users guide.

GPO press release.

Bulk House Bills Download.

Another bulk data source from the U.S. Congress.

Integration of the legislative sources will be none trivial but it has been done before, manually.

What will be more interesting will be tracking the more complex interpersonal relationships that underlie the surface of legislative sources.

Outlier Analysis

Filed under: Data Analysis,Outlier Detection,Probability,Statistics — Patrick Durusau @ 8:15 pm

Outlier Analysis by Charu Aggarwal (Springer, January 2013). Post by Gregory Piatetsky.

From the post:

This is an authored text book on outlier analysis. The book can be considered a first comprehensive text book in this area from a data mining and computer science perspective. Most of the earlier books in outlier detection were written from a statistical perspective, and precede the emergence of the data mining field over the last 15-20 years.

Each chapter contains carefully organized content on the topic, case studies, extensive bibliographic notes and the future direction of research in this field. Thus, the book can also be used as a reference aid. Emphasis was placed on simplifying the content, so that the material is relatively easy to assimilate. The book assumes relatively little prior background, other than a very basic understanding of probability and statistical concepts. Therefore, in spite of its deep coverage, it can also provide a good introduction to the beginner. The book includes exercises as well, so that it can be used as a teaching aid.

Table of Contents and Introduction. Includes exercises and a 500+ reference bibliography.

Definitely a volume for the short reading list.

Caveat: As an outlier by any measure, my opinions here may be biased. 😉

Foundations of Rule Learning [A Topic Map Parable]

Filed under: Data Mining,Machine Learning,Topic Maps — Patrick Durusau @ 8:14 pm

Foundations of Rule Learning by Authors: Johannes Fürnkranz, Dragan Gamberger, Nada Lavrač, ISBN: 978-3-540-75196-0 (Print) 978-3-540-75197-7 (Online).

From the Introduction:

Rule learning is not only one of the oldest but also one of the most intensively investigated, most frequently used, and best developed fields of machine learning. In more than 30 years of intensive research, many rule learning systems have been developed for propositional and relational learning, and have been successfully used in numerous applications. Rule learning is particularly useful in intelligent data analysis and knowledge discovery tasks, where the compactness of the representation of the discovered knowledge, its interpretability, and the actionability of the learned rules are of utmost importance for successful data analysis.

The aim of this book is to give a comprehensive overview of modern rule learning techniques in a unifying framework which can serve as a basis for future research and development. The book provides an introduction to rule learning in the context of other machine learning and data mining approaches, describes all the essential steps of the rule induction process, and provides an overview of practical systems and their applications. It also introduces a feature-based framework for rule learning algorithms which enables the integration of propositional and relational rule learning concepts.

The topic map parable comes near the end of the introduction where the authors note:

The book is written by authors who have been working in the field of rule learning for many years and who themselves developed several of the algorithms and approaches presented in the book. Although rule learning is assumed to be a well-established field with clearly defined concepts, it turned out that finding a unifying approach to present and integrate these concepts was a surprisingly difficult task. This is one of the reasons why the preparation of this book took more than 5 years of joint work.

A good deal of discussion went into the notation to use. The main challenge was to define a consistent notational convention to be used throughout the book because there is no generally accepted notation in the literature. The used notation is gently introduced throughout the book, and is summarized in Table I in a section on notational conventions immediately following this preface (pp. xi–xiii). We strongly believe that the proposed notation is intuitive. Its use enabled us to present different rule learning approaches in a unifying notation and terminology, hence advancing the theory and understanding of the area of rule learning.

Semantic diversity in rule learning was discovered and took five years to resolve.

Where n = all prior notations/terminologies, the solution was to create the n + 1 notation/terminology.

Understandable and certainly a major service to the rule learning community. The problem remains, how does one use the n + 1 notation/terminology to access prior (and forthcoming) literature in rule learning?

In its present form, the resolution of the prior notations and terminologies into the n + 1 terminology isn’t accessible to search, data, bibliographic engines.

Not to mention that on the next survey of rule learning, its authors will have to duplicate the work already accomplished by these authors.

Something about the inability to re-use the valuable work done by these authors, either for improvement of current information systems or to avoid duplication of effort in the future seems wrong.

Particularly since it is avoidable through the use of topic maps.


The link at the top of this post is the “new and improved site,” which has less sample content than Foundations for Rule Learning, apparently an old and not improved site.

I first saw this in a post by Gregory Piatetsky.

PubChem3D: conformer ensemble accuracy

Filed under: Cheminformatics,Similarity — Patrick Durusau @ 8:13 pm

PubChem3D: conformer ensemble accuracy by Sunghwan Kim, Evan E Bolton and Stephen H Bryant. (Journal of Cheminformatics 2013, 5:1 doi:10.1186/1758-2946-5-1)

Abstract:

Background

PubChem is a free and publicly available resource containing substance descriptions and their associated biological activity information. PubChem3D is an extension to PubChem containing computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in a loss of the ability to reproduce experimentally determined “bioactive” conformations, for example, found for PDB ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of molecules.

Results

The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 small molecules whose structures were experimentally determined, and the conformers in each conformer model were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The accuracy of the conformer models before and after clustering was evaluated using five different measures: root-mean-square distance (RMSD), shape-optimized shape-Tanimoto (STST-opt) and combo-Tanimoto (ComboTST-opt), and color-optimized color-Tanimoto (CTCT-opt) and combo-Tanimoto (ComboTCT-opt). On average, the effect of clustering decreased the conformer model accuracy, increasing the conformer ensemble’s RMSD to the bioactive conformer (by 0.18 +/- 0.12 A), and decreasing the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt scores (by 0.04 +/- 0.03, 0.16 +/- 0.09, 0.09 +/- 0.05, and 0.15 +/- 0.09, respectively).

Conclusion

This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. This expected accuracy improves linearly as the molecule becomes smaller or less flexible.

If I were to say, potential shapes of a subject, would that the importance of this work clearer?

Wikipedia has this two-liner that may also help:

A macromolecule is usually flexible and dynamic. It can change its shape in response to changes in its environment or other factors; each possible shape is called a conformation, and a transition between them is called a conformational change. A macromolecular conformational change may be induced by many factors such as a change in temperature, pH, voltage, ion concentration, phosphorylation, or the binding of a ligand.

Subjects and the manner of their identification is a very deep and rewarding field of study.

An identification method in isolation is no better or worse than any other identification method.

Only your requirements (which are also subjects) can help with the process of choosing one or more identification methods over others.

« Newer PostsOlder Posts »

Powered by WordPress