November « 2011 « Another Word For It

November 21, 2011

Creating a DSL for Cypher graph queries

Filed under: Cypher,DSL — Patrick Durusau @ 7:32 pm

From the post:

My first assignment at Neo4j was to create a Java DSL for the Cypher query language, that is used to access data from the Neo4j database in a graphy way.

First off, why a DSL? There’s a ton of reasons why using a DSL instead of strings is a good idea. From a practical point of view a DSL and a decent IDE will make creating queries so much easier, as you can use code completion to build the query. No need to refer to manuals and cheat sheets if you forget the syntax. Second, I have found it useful to create queries iteratively in a layered architecture, whereby the domain model can create a base query that describes some concept, like “all messages in my inbox”, and then the application layer can take this and enhance with filtering, like “all messages in my inbox that are sent from XYZ”, and then finally the UI can add the order by and paging. Doing something like this would be extremely difficult without a DSL.

You get to learn something about DSLs and Cypher at the same time!

How cool is that?

Comments Off

TextMinr

Filed under: Data Mining,Language,Text Analytics — Patrick Durusau @ 7:31 pm

TextMinr

In pre-beta (can signal interest now) but:

Text Mining As A Service – Coming Soon!

What if you could incorporate state-of-the-art text mining, language processing & analytics into your apps and systems without having to learn the science or pay an arm and a leg for the software?

Soon you will be able to!

We aim to provide our text mining technology as a simple, affordable pay-as-you-go service, available through a web dashboard and a set of REST API’s.

If you already familiar with these tools and your data sets, this could be a useful convenience.

If you aren’t familiar with these tools and your data sets, this could be a recipe for disaster.

Like SurveyMonkey.

In the hands of a survey construction expert, with testing of the questions, etc., I am sure SurveyMonkey can be a very useful tool.

In the hands of management, who want to justify decisions where surveys can be used, SurveyMonkey is positively dangerous.

Ask yourself this: Why in an age of SurveyMonkey, do politicians pay pollsters big bucks?

Do you suspect there is something different from a professional pollster and SurveyMonkey?

Same distance between TextMinr and professional text analysis.

Or perhaps better, you get what you pay for.

Comments Off

Gephi adds Neo4j graph database support

Filed under: Gephi,Neo4j,Visualization — Patrick Durusau @ 7:30 pm

Gephi adds Neo4j graph database support (screencast)

From the webpage:

Neo4j is a powerful, award-wining graph database written in Java. It can store billions of nodes and relationships and allows very fast query/traversal. We release today a new version of the Neo4j Plugin supporting the latest 1.5 version of Neo4j. In Gephi, go to Tools > Plugins to install the plug-in.

The plugin let you visualize a graph stored in a Neo4j database and play with it. Features include full import, traversal, filter, export and lazy loading.

Warning: A real time sink! 😉

Seriously, very cool plugin that will enhance your use of Neo4j!

Enjoy!

Comments Off

Visualizing RDF Schema inferencing through Neo4J, Tinkerpop, Sail and Gephi

Filed under: Gephi,Neo4j,Sail,TinkerPop — Patrick Durusau @ 7:28 pm

Visualizing RDF Schema inferencing through Neo4J, Tinkerpop, Sail and Gephi by Dave Suvee.

From the post:

Last week, the Neo4J plugin for Gephi was released. Gephi is an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs. The graphs themselves can be loaded through a variety of file formats. Thanks to Martin Škurla, it is now possible to load and lazily explore graphs that are stored in a Neo4J data store.

In one of my previous articles, I explained how Neo4J and the Tinkerpop framework can be used to load and query RDF triples. The newly released Neo4J plugin now allows to visually browse these RDF triples and perform some more fancy operations such as finding patterns and executing social network analysis algorithms from within Gephi itself. Tinkerpop’s Sail Ouplementation also supports the notion of RDF Schema inferencing. Inferencing is the process where new (RDF) data is automatically deducted from existing (RDF) data through reasoning. Unfortunately, the Sail reasoner cannot easily be integrated within Gephi, as the Gephi plugin grabs a lock on the Neo4J store and no RDF data can be added, except through the plugin itself.

Being able to visualize the RDF Schema reasoning process and graphically indicate which RDF triples were added manually and which RDF data was automatically inferred would be a nice to have. To implement this feature, we should be able to push graph changes from Tinkerpop and Neo4J to Gephi. Luckily, the Gephi graph streaming plugin allows us to do just that. In the rest of this article, I will detail how to setup the required Gephi environment and how we can stream (inferred) RDF data from Neo4J to Gephi.

Visual is good!

Visual display and exploration of graphs is better!

Visual display and exploration of Neo4j data stores from within Gephi is the best!

Dave concludes:

With just a few lines of code we are able to stream (inferred) RDF triples to Gephi and make use of its powerful visualization and analysis tools to explore and inspect our datasets. As always, the complete source code can be found on the Datablend public GitHub repository. Make sure to surf the internet to find some other nice Gephi streaming examples, the coolest one probably being the visualization of the Egyptian revolution on Twitter.

Other suggestions for Gephi streaming examples?

Comments Off

Comparing High Level MapReduce Query Languages

Filed under: Hadoop,Hive,JAQL,MapReduce,Pig — Patrick Durusau @ 7:27 pm

Comparing High Level MapReduce Query Languages by R.J. Stewart, P.W. Trinder, and H-W. Loidl.

Abstract:

The MapReduce parallel computational model is of increasing importance. A number of High Level Query Languages (HLQLs) have been constructed on top of the Hadoop MapReduce realization, primarily Pig, Hive, and JAQL. This paper makes a systematic performance comparison of these three HLQLs, focusing on scale up, scale out and runtime metrics. We further make a language comparison of the HLQLs focusing on conciseness and computational power. The HLQL development communities are engaged in the study, which revealed technical bottlenecks and limitations described in this document, and it is impacting their development.

A starting place for watching these three HLQLs as they develop, which no doubt they will continue to do. And one expects them to be joined by other candidates so familiarity with this paper may help speed their evaluation as well.

Comments Off

November 20, 2011

Scaling MySQL with TokuDB Webinar

Filed under: MySQL,TokuDB — Patrick Durusau @ 4:23 pm

Scaling MySQL with TokuDB Webinar – Video and Slides Now Available

From the post:

Thanks to everyone who signed up and attended the webinar I gave this week with Tim Callaghan on Scaling MySQL. For those who missed it and are interested, the video and slides are now posted here.

[snip]

MySQL implementations are often kept relatively small, often just a few hundred GB or less. Anything beyond this quickly leads to painful operational problems such as poor insertion rates, slow queries, hours to days offline for schema changes, prolonged downtime for dump/reload, etc. The promise of scalable MySQL has remained largely unfulfilled, until TokuDB.

TokuDB v5.0 delivers

Exceptional Agility — Hot Schema Changes allow read/write operations during index creation or column/field addition

Unmatched Speed — Fractal Tree indexes perform 20x to 80x better on write intensive workloads

Maximum Scalability — Fractal Tree index performance scales even as the primary index exceeds available RAM

This webinar covers TokuDB v5.0 features, latest performance results, and typical use cases.

I haven’t run TukoDB but it is advertised as a drop-in replacement for MySQL. High performance replacement.

Comments/suggestions? (I need to pre-order High-Performance MySQL, 2nd ed., 2012. Ignore the scams with the 1st edition copies still in stock at some sellers.)

Comments (2)

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Filed under: Crunch,Dremel,Dryad,Flume,Giraph,HBase,HDFS,Hive,JDBC,MapReduce,ODBC,Oozie,Pregel — Patrick Durusau @ 4:21 pm

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

In or between which of these elements, would human analysis/judgment have the greatest impact?
Would human analysis/judgment be best made by experts or crowds?
What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

Comments Off

On Data and Jargon

Filed under: Crowd Sourcing,Data,Jargon — Patrick Durusau @ 4:19 pm

On Data and Jargon by Phil Simon.

From the post:

I was recently viewing an online presentation from my friend Scott Berkun. In it, Scott uses austere slides like the one with this simple bromide:

Whoever uses the most jargon has the least confidence in their ideas.

I really like that.

Are we hiding from crowds behind our jargon?

If yes, why? What do we have to lose? What do we have to gain by not hiding?

Comments Off

Triggers are coming! Triggers are coming!

Filed under: Topic Map Software,Topincs — Patrick Durusau @ 4:18 pm

Triggers are coming! Triggers are coming!

Triggers are going to appear in Topincs 5.6.0 (to be released).

From the webpage:

Issue description
It should be possible to define triggers (or event handlers?). Code should be held in a directory triggers next to the directories domain and services.

Comment
Triggers are created on the command line with the new create-trigger command. This creates a topic of type Topincs trigger with id ID writes a file into STORE_DIR/php/triggers/ID.php. The user has to 1) specify when the trigger is run and 2) code what the trigger should do.

This looks very useful, whether you have streaming input or not.

Topincs homepage

Comments Off

Processing json data with apache velocity

Filed under: Apache Velocity,JSON,Velocity — Patrick Durusau @ 4:17 pm

Processing json data with apache velocity from Pierre Lindenbaum.

From the post:

I’ve written a tool named “apache velocity” which parse json data and processes it with “Apache velocity” (a template engine ). The (javacc) source code is available here:

Just in case you run across some data in JSON format and could use and example of processing it with Apache Velocity. Just in case. 😉

Comments Off

FantasySCOTUS

Filed under: Contest,Legal Informatics — Patrick Durusau @ 4:16 pm

Fantasy SCOTUS

Another example of imaginative use of technology to interest people in what is often seen as “boring” material. Supreme court cases have outcomes that have impacts on real people. I haven’t played but suspect participant gain a lot of knowledge about the facts and law in each case.

Not to mention that there are monthly drawings for $200 Amazon gift certificates. See the site for details.

From the about page:

FantasySCOTUS is the Internet’s Premier Supreme Court Fantasy League. Last year, over 5,000 attorneys, law students, and other avid Supreme Court followers made predictions about all cases that the Supreme Court decided. On average, members of the league correctly predicted the cases nearly 60% of the time, and accurately predicted that Elena Kagan would be nominated as the 100th Associate Justice of the Supreme Court. Justin Donoho, who received the highest score out of 5,000+ members, was nominated and confirmed as the inaugural Chief Justice of FantasySCOTUS.

FantasySCOTUS is brought to you by the Harlan Institute. The Harlan Institute’s mission is to bring a stylized law school experience into the high school classroom to ensure that our next generation of leaders has a proper understanding of our most fundamental laws. By utilizing the expertise of leading legal scholars and the interactivity of online games, Harlan will introduce students to our Constitution, the cases of the United States Supreme Court, and our system of justice. Harlan’s long term strategic goal is to develop condensed law school courses that can be taught at no cost in high schools across the country using engaging online programs.

This and the Crowdsourcing Scientific Research I mentioned yesterday make me think that perhaps TREC in 2012 should have a crowdsourced component. Where the data set is available over the WWW and interfaces are proposed and tested to interest the general public in participating. What was that they said about all bugs being shallow if you just had enough eyes?

Up to now, TREC has had a small set of eyes with very powerful machines and algorithms. Would be interesting to see what a crowd, plus imaginative interface and fast interaction could do? Could be a path towards a distributed knowledge economy where users log onto tasks/interfaces that interest them.

Comments Off

Jérôme Kunegis

Filed under: Graphs,Spectral Evolution Model — Patrick Durusau @ 4:14 pm

Jérôme Kunegis

When I read things like René Pickhardt saying: “One of the first things I did @ my Institute when starting my PhD program was reading the PhD thesis of Jérôme Kunegis.” with no reference or link, it just drives me crazy! A little bibliographic instinct is a good thing but I think the breeding when a bit far in my case. 😉

Anyway, your gain. This is the homepage for Jérôme Kunegis where you will find:

PhD Thesis

I wrote my PhD thesis On the Spectral Evolution of Large Networks under supervision of Prof. Dr. Steffen Staab, Prof. Dr. Klaus Obermayer and Prof. Dr. Christian Bauckhage in 2011.

Download: “On the Spectral Evolution of Large Networks” (print version)

In my PhD thesis, I studied the spectral characteristics of large dynamic networks and formulate the spectral evolution model. The spectral evolution model applies to networks that evolve over time, and describes their spectral decompositions such as the eigenvalue and singular value decomposition. My main result is an interpretation of the spectrum and eigenvectors of networks in terms of global and local effects. I show empirically that the spectrum describes a network on the global level, whereas eigenvectors describe a network at the local level, and derive from this several new link prediction methods.

From page 2 of the thesis:

In this thesis, spectral graph theory is used to predict which nodes of a network will be connected in the future. This problem is called the link prediction problem, and is known under many different names depending on the network to which it is applied. For instance, finding new friends in a social network, recommending movies and predicting which scientists will publish a paper together in the future are all instances of the link prediction problem. Several link prediction algorithms based on spectral graph theory are already known in the literature. However there is no general theory that predicts which spectral link prediction algorithms work best, and under what circumstances certain spectral link prediction algorithms work better than others. To solve these problems, this thesis proposes the spectral evolution model : A model that describes in detail how networks change over time in terms of their eigenvectors and eigenvalues. By observing certain growth patterns in actual networks, we are able to predict the growth of networks accurately, and thus can implement relevant recommender systems for all types of network datasets.

Jérôme steps over, “This problem is called the link prediction problem, and is known under many different names depending on the network to which it is applied.”, the subject identity problem to solve a problem that may be useful with subject identity.

What if instead of thinking of “line prediction” as in a recommender system but in terms of “this representative” represents the same subject as “that representative?”

You will also find all of his publications with download links.

Be mindful that this work underlies the approach in Graphity which is retrieving 10,000 nodes per second from social networks.

Comments (1)

Download network graph data sets from Konect – the koblenz network colection

Filed under: Dataset,Graphs — Patrick Durusau @ 4:12 pm

Download network graph data sets from Konect – the koblenz network colection

From the post:

One of the first things I did @ my Institute when starting my PhD program was reading the PhD thesis of Jérôme Kunegis. For a mathematician a nice piece of work to read. For his thesis he analayzed the evolution of networks. Over the last years Jérôme has collected several (119 !) data sets with network graphs. All have different properties.

He provides the data sets and some basic statistics @ http://konect.uni-koblenz.de/networks

Sometimes edges are directed, somtimes they have timestamps somtimes even content. Some graphs are bipartite and the graphs come from different application domains such as trust, social networks, web graphs, co citation, semantic, features, ratings and communications…

I was browsing René’ Pickhardt’s blog when I ran across this entry. Pure gold.

Comments Off

These Aren’t the Sites You’re Looking For: Building Modern Web Apps

Filed under: HTML,Interface Research/Design — Patrick Durusau @ 4:09 pm

These Aren’t the Sites You’re Looking For: Building Modern Web Apps

Interesting promo for HTML5, which is a developing way to deliver interaction with a topic map.

The presentation does not focus on use of user feedback, the absence of which can leave you with a “really cool” interface that no one outside the development team really likes. To no small degree, it is good interface design with users that tells the tale, not how the interface is seen to work on the “other” side of the screen.

BTW, the slides go out of their way to promote the Chrome browser. Browser usage statistics, you do the math. Marketing is a matter of numbers, not religion.

If you are experimenting with HTML5 as a means to interact with a topic map engine, would appreciate a note when you are ready to go public.

Comments Off

November 19, 2011

Crowdsourcing Scientific Research: Leveraging the Crowd for Scientific Discovery

Filed under: Authoring Topic Maps,Crowd Sourcing — Patrick Durusau @ 10:25 pm

Crowdsourcing Scientific Research: Leveraging the Crowd for Scientific Discovery by Dave Oleson.

From the post:

Lab scientists spend countless hours manually reviewing and annotating cells. What if we could give these hours back, and replace the tedious parts of science with a hands-off, fast, cheap, and scalable solution?

That’s exactly what we did when we used the crowd to count neurons, an activity that computer vision can’t yet solve. Building on the work we recently did with the Harvard Tuberculosis lab, we were able to take untrained people all over the world (people who might never have learned that DNA Helicase unzips genes…), turn them into image analysts with our task design and quality control, and get results comparable to those provided by trained lab workers.

So, do you think authoring your topic map is more difficult than reliable identification of neurons? Really?

Maybe the lesson of crowd sourcing is that we need to be as smart at coming up new ways to do old tasks as we think we are.

What do you think?

Comments Off

A Look Into Linking Government Data

Filed under: Linked Data,LOD — Patrick Durusau @ 10:24 pm

A Look Into Linking Government Data

From the post:

Due out next month from Springer Publishing is Linking Government Data, a book that highlights some of the leading-edge applications of Linked Data to problems of government operations and transparency. David Wood, CTO of 3RoundStones and co-chair of the W3C RDF Working Group, writes and edits the volume, which includes contributions from others exploring the intersection of government and the Semantic Web.

….

Some of the hot spots for this are down under, in Australia and New Zealand. The U.K., of course, also has done extremely well, with the data.gov.uk portal an acknowledged leader in LOD efforts – and certainly comfortably ahead of the U.S. data.gov site.

…

He also thinks it’s worth noting that, just because you might not see a country openly publishing its data as Linked Data, it doesn’t mean that it’s not there. Often someone, somewhere – even if it’s just at one government agency — is using Linked Data principles, experimentally or in real projects. “Like commercial organizations, governments often use them internally and not publish externally,” he notes. “The spectrum of adoption can be individual or a trans-government mandate or everything in between.”

OK, but you would think if there were some major adoption, it would be mentioned in a post promoting the book. Australia, New Zealand and Nixon’s “Silent Majority” in the U.S. are using linked data. Can’t see them but they are there. That’s like RIAA music piracy estimates, just sheer fiction for all but true believers.

But as far as the U.S.A., the rhetoric shifts from tangible benefit to “can lead to,” “can save money,” etc.:

The economics of the Linked Data approach, Wood says, show unambiguous benefit. Think of how it can save money in the U.S. on current expenditures for data warehousing. And think of the time- and productivity-savings, for example, of having government information freely available on the web in a standard format in a way that can be reused and recombined with other data. In the U.S., “government employees wouldn’t have to divert their daily work to answer Freedom of Information requests because the information is proactively published,” he says. It can lead to better policy decisions because government researchers wouldn’t have to spend enormous amounts of time trying to integrate data from multiple agencies in varying formats to combine it and find connections between, for example, places where people live and certain kinds of health problems that may be prevalent there.

And democracy and society also profit when it’s easy for citizens to access published information on where the government is spending their money, or when it’s easy for scientists and researchers to get data the government collects around scientific efforts so that it can be reused for purposes not originally envisioned.

“Unambiguous benefit” means that we have two systems, one using practice Y and other using practice X and when compared (assuming the systems are comparable): there is a clear different of Z% of some measurable metric that can be attributed to the different practices.

Yes?

Personally I think linked data can be beneficial but that is subject to measurement and demonstration in some particular context.

As soon as this work is released, I would appreciate pointers to unambiguous benefit shown by comparison of agencies in the U.S.A. doing comparable work with some metric that makes that demonstration. But that has to be more than speculation or “can.”

Comments Off

Percona Server 5.5.17-22.1 released

Filed under: MySQL,Percona Server,SQL — Patrick Durusau @ 10:23 pm

Percona Server 5.5.17-22.1 released

From the webpage:

Percona is glad to announce the release of Percona Server 5.5.17-22.1 on November 19th, 2011 (Downloads are available here and from the Percona Software Repositories).

Based on MySQL 5.5.17, including all the bug fixes in it, Percona Server 5.5.17-22.1 is now the current stable release in the 5.5 series. All of Percona ‘s software is open-source and free, all the details of the release can be found in the 5.5.17-22.1 milestone at Launchpad or in the Release Notes for 5.5.17-22.1 in the documentation.

I haven’t installed or run a Percona Server, but the reported performance numbers are good enough to merit a closer look.

If you have run a Percona server, please comment.

Comments Off

Updates to d8taplex News Stream

Filed under: Clustering,News,Visualization — Patrick Durusau @ 10:22 pm

Updates to d8taplex News Stream by Matthew Hurst.

From the post:

I’ve made some updates to the news stream experiment on d8taplex. The page now displays a stream of news articles with clusters of related articles being shown with emboldened titles. Each article also shows the number of clicks tracked by bit.ly. The bit.ly data is coloured according to how hot it is (red represents the most clicked articles, orange represents ‘warm’ stories and gray represents stories that haven’t yet received any significant attention).

With two view of news (the clusters represent what the media companies are interested in and the bit.ly data represents what the weberati are interested in) the stream provides a slightly different overview of the news cycle. One can find clusters for which there is no significant bit.ly data, yet also find stories that are getting a lot of clicks but which don’t belong to any clusters. In addition, as each story is annotated with the source (CNN, BBC, etc.) you can also find clusters generated by only one source.

Mathew continues his experiments with news streams.

I wonder if media company clusters are related to their advertisers? Or owners? Or is the relationship more subtle? That is the Wall Street Journal clusters are driven more by the self-selection of people who work at the Journal than its support by mainstream financial backers? Not sure how you would even start to tease that apart.

Comments Off

Representing word meaning and order information in a composite holographic lexicon

Filed under: High Dimensionality,Holographic Lexicon,Word Meaning — Patrick Durusau @ 10:21 pm

Representing word meaning and order information in a composite holographic lexicon by Michael N. Jones , Douglas J. K. Mewhort.

Abstract:

The authors present a computational model that builds a holographic lexicon representing both word meaning and word order from unsupervised experience with natural language. The model uses simple convolution and superposition mechanisms (cf. B. B. Murdock, 1982) to learn distributed holographic representations for words. The structure of the resulting lexicon can account for empirical data from classic experiments studying semantic typicality, categorization, priming, and semantic constraint in sentence completions. Furthermore, order information can be retrieved from the holographic representations, allowing the model to account for limited word transitions without the need for built-in transition rules. The model demonstrates that a broad range of psychological data can be accounted for directly from the structure of lexical representations learned in this way, without the need for complexity to be built into either the processing mechanisms or the representations. The holographic representations are an appropriate knowledge representation to be used by higher order models of language comprehension, relieving the complexity required at the higher level.

More reading along the lines of higher-dimensional representation techniques. Almost six (6) pages of references to run backwards and forwards so this is going to take a while.

Comments Off

Hyperdimensional Computing: An introduction…

Filed under: High Dimensionality,Hyperdimensional Computing — Patrick Durusau @ 10:21 pm

Hyperdimensional Computing: An introduction to computing in distributed representation with high-dimensional random vectors by Pentti Kanerva (Cognitive Computation 1(2): 139-159)

You know it is going to be a Jack Park sort of day when the morning email has a notice about a presentation entitled: “Hyperdimensional Computing for Modeling How Brains Compute”. With the link you see above in the email. 😉

What’s a Jack Park sort of day like? Well, it certainly throws you off the track you were on before you sat down at the monitor. And pretty soon you are adding things like: Holographic Reduced Representation: Distributed Representation for Cognitive Structures by Tony Plate to your Amazon wish list. And adding to an already desk busting pile of papers that need your attention.

In short: A day that suits me just fine!

Suggest you read the paper, whether you add Tony’s book to your wish list or not. (Just in case you are interested, Patrick’s wish list. Or related items. If I get/have duplicates I can donate them to the library.)

I think the hyperdimensional computing approach may be one way to talk about the present need to make representations unnaturally precise so that our computing systems can deal with them. I say “unnaturally precise” because the descriptions or definitions needed by our computers aren’t necessary outside their context. If I call Jack up on the phone, I don’t say: “http://www.durusau.net/general/about.html” identifer for user who wishes to speak to (identifier for Jack), etc.” No, I say: “This is Patrick.” Trusting that if Jack is awake and I don’t have a cold, Jack will recognize my voice.

There are any number of reasons why Jack will recognize it is me and not some other Patrick he knows. I will be calling for Georgia, which as a particular area code, the time of day will be right, I have a Southern accent, any number of clues, even before we get to my self-identification.

To increase the usefulness of our information systems, they need to become a lot more like us and not have us flattening our worlds to become a lot more like them.

Jack’s “Hyperdimensional Computing” may be one path in that direction.

Comments Off

Dashboard Designs People Love – A Refresher

Filed under: Dashboard,Visualization — Patrick Durusau @ 10:21 pm

Dashboard Designs People Love – A Refresher

From Juice Analytics:

Using proper dashboard design techniques is a topic that struck a chord when we released our white paper series “A Guide to Creating Dashboards People Love to Use” a few years ago, and still seems to resonate as people regularly download the content from the Juice Analytics website, and we receive ongoing requests to speak around these key principles.

Since we can all use a little calibration every once in a while to stay in tune, we thought we’d post it again, with a reminder that you can access this oldie but goody, along with other materials on the Juice Analytics resources page anytime — and that all of these materials are available to you gratis.

Other resources you will find at Juice Analytics:

Designing a Better ‘Federal IT Dashboard’

Dashboard Design – Flow

Chart Makeovers, Fed IT Dashboard edition

The nearly fifty page “Guide” is worth registering for future updates.

I don’t know if your topic map application will have an interface that “presents as” a dashboard, but if it does, these resources are a good start towards a good one.

Not to mention that most design lessons are generally applicable to all interface design.

(I would say something about interfaces or websites where design challenged CEOs/COOs were allowed input or approval but that would be adding insult to injury wouldn’t it?)

Comments Off

Microsoft drops Dryad; bets on Hadoop

Filed under: Dryad,Hadoop — Patrick Durusau @ 10:21 pm

Microsoft drops Dryad; bets on Hadoop

In a November 11 post on the Windows HPC Team Blog, officials said that Microsoft had provided a minor update to the latest test build of the Dryad code as part of Windows High Performance Computing (HPC) Pack 2008 R2 Service Pack (SP) 3. But they also noted that “this will be the final (Dryad) preview and we do not plan to move forward with a production release.”

….

But it now appears Microsoft is putting all its big-data eggs in the Hadoop framework basket. Microsoft officials said a month ago that Microsoft was working with Hortonworks to develop both a Windows Azure and a Windows Server distribution of Hadoop. A Community Technology Preview (CTP) of the Windows Azure version is due out before the end of this calendar year; the Windows Server test build of Hadoop is due some time in 2012.

It might be a good time for the Hadoop community, which now includes MS, to talk about studying the syntax and semantics of the Hadoop eco-system that can be standardized.

It would be nice to see competition between Hadoop products on the basis of performance and features, not learning the oddities of particular implementations. The public versions could set a baseline and commercial versions would be pressed to better that.

After all, there those who contend that commercial code is measurably better than other types of code. Perhaps it is time to put their faith to the test.

Comments Off

November 18, 2011

Our second $5000 information design challenge is on!

Filed under: Contest — Patrick Durusau @ 9:38 pm

Our second $5000 information design challenge is on!

From the post:

We’ve got another pot of info-design gold to give away – and this time your work might land you on the Guardian Datablog.

Last month we ran our first visualization challenge. And, boy, did you peeps really rise to it.

And here’s our second challenge: MON€Y PANIC$!

The financial system, debt crises, recession fears, Wall St occupation, currency devaluation, collapse of the markets, the END OF THE WORLD! It’s all getting rather mind-boggling.

So we and the Guardian have found some juicy datasets we want you to use to explain what in the world is going on. Clearly. Understandably. Visibly.

What are you waiting here for? Go see about the contest and then come back to read more about topic maps!

You might even learn something about visual design that you can apply to your next topic map project!

Comments Off

New Features in Scala 2.10

Filed under: Scala — Patrick Durusau @ 9:38 pm

New Features in Scala 2.10

From the post:

Today was most awaited (by me) talk of Devoxx. Martin Odersky gave presentation and announced a new features in the Scala 2.10. I just want to quickly go through all of them:

1. New reflection framework – it looks very nice (see photo) and 100% Scala. No need for Java for reflection API in order to work with Scala classes anymore!
2. Reification – it would be limited
3. type Dynamic – something similar to .NET 3
4. IDE improvements
5. Faster builds
6. SIPs: string interpolation and simpler implicits

At the moment it’s not clear whether mentioned SIPs would be really included in the release, but the chances are pretty high! So yes, we will finally get string interpolation!

Important for two reasons:

First, news about the upcoming features of Scala.

Second, we learn there is another expansion for SIPs. (I really didn’t plan it that way but it was nice how it worked out.)

Comments Off

Identifying duplicate content using statistically improbable phrases

Filed under: Duplicates,Statistically Improbable Phrases (SIPs) — Patrick Durusau @ 9:38 pm

Identifying duplicate content using statistically improbable phrases by Mounir Errami, Zhaohui Sun, Angela C. George, Tara C. Long, Michael A. Skinner, Jonathan D. Wren and Harold R. Garner.

Abstract:

Motivation: Document similarity metrics such as PubMed’s ‘Find related articles’ feature, which have been primarily used to identify studies with similar topics, can now also be used to detect duplicated or potentially plagiarized papers within literature reference databases. However, the CPU-intensive nature of document comparison has limited MEDLINE text similarity studies to the comparison of abstracts, which constitute only a small fraction of a publication’s total text. Extending searches to include text archived by online search engines would drastically increase comparison ability. For large-scale studies, submitting short phrases encased in direct quotes to search engines for exact matches would be optimal for both individual queries and programmatic interfaces. We have derived a method of analyzing statistically improbable phrases (SIPs) for assistance in identifying duplicate content.

Results: When applied to MEDLINE citations, this method substantially improves upon previous algorithms in the detection of duplication citations, yielding a precision and recall of 78.9% (versus 50.3% for eTBLAST) and 99.6% (versus 99.8% for eTBLAST), respectively.

Availability: Similar citations identified by this work are freely accessible in the Déjà vu database, under the SIP discovery method category at http://dejavu.vbi.vt.edu/dejavu/

I ran across this article today while looking for other material on the Déjà vu database.

Why should Amazon have all the fun? 😉

Depending on the breath of the search, I can imagine creating graphs of search data that display more than one SIP per article, allowing researchers to choose paths through the literature. Well, that is beyond what the authors intend here but adaptation of their work to search and refinement of research data seems like a natural extension.

And depending how now finely data from sensors or other automatic sources was segmented, it isn’t hard to imagine something similar for sensor data. Not really plagiarism but duplication that might warrant further investigation.

Comments Off

Improved Call Graph Comparison Using Simulated Annealing

Filed under: Graphs,Malware,Simulated Annealing — Patrick Durusau @ 9:38 pm

Improved Call Graph Comparison Using Simulated Annealing by Orestis Kostakis, Joris Kinable, Hamed Mahmoudi, Kimmo Mustonen.

Abstract:

The amount of suspicious binary executables submitted to Anti-Virus (AV) companies are in the order of tens of thousands per day. Current hash-based signature methods are easy to deceive and are inefficient for identifying known malware that have undergone minor changes. Examining malware executables using their call graphs view is a suitable approach for overcoming the weaknesses of hash-based signatures. Unfortunately, many operations on graphs are of high computational complexity. One of these is the Graph Edit Distance (GED) between pairs of graphs, which seems a natural choice for static comparison of malware. We demonstrate how Simulated Annealing can be used to approximate the graph edit distance of call graphs, while outperforming previous approaches both in execution time and solution quality. Additionally, we experiment with opcode mnemonic vectors to reduce the problem size and examine how Simulated Annealing is affected.

From the introduction:

To facilitate the recognition of highly similar executables or commonalities among multiple executables which have been subject to modification, a high-level structure, i.e. an abstraction, of the samples is required. One such abstraction is the call graph which is a graphical representation of a binary executable, where functions are modelled as vertices and calls between those functions as directed edges. Minor changes in the body of the code are not reflected in the structure of the graph.

Can you say subject identity? 😉

How you judge subject identity depends on the circumstances and requirements of any given situation.

Very recent and I suspect important work on the detection of malware.

Comments Off

Big Data (Computing in Science and Engineering November/December 2011)

Filed under: BigData — Patrick Durusau @ 9:38 pm

Big Data (Computing in Science and Engineering November/December 2011 (vol. 13 no. 6) ISSN: 1521-9615)

A looks like theme issue on “big data.”

I say “looks like” because the “abstracts” for the articles are less informative than the titles of the articles on their own. I happened to be working while not logged into an account where I would just “see” the article with its actual abstract.

Let’s compare:

Data-Intensive Science in the US DOE: Case Studies and Future Challenges

Abstract not logged in:

Given its leading role in high-performance computing for modeling and simulation and its many experimental facilities, the US Department of Energy has a tremendous need for data-intensive science. Locating the challenges and commonalities among three case studies illuminates, in detail, the technical challenges involved in realizing data-intensive science.

You’re kidding, right? You are going to take ten pages to tell me the DOE needs “data-intensive science.”?

Abstract of the article when logged in:

Given its leading role in high-performance computing for modeling and simulation and its many experimental facilities, the US Department of Energy has a tremendous need for data-intensive science. Locating the challenges and commonalities among three case studies illuminates, in detail, the technical challenges involved in realizing data-intensive science.

Fortunately, I read the article after launching into this rant which allows me to make two points:

The abstract sucks!

The article is important!

How often does that happen?

Seriously. Three use cases are used to leave you with the feeling that current practices are going to look quite primitive in our lifetimes. Remember the first “bug” that was pasted onto a notebook page?

That’s the distance between now and the future of computing needed by the DOE.

Here is how I would do the abstract for the DOE article:

The DOE says: “Now for really Big Data!” Simulation data too big to store for analysis; what to keep, compress, throw away? Climate change, data about a world to share, analyze, and perhaps save. Real time experiments demand real time analysis, integrity and storage. Details inside.

OK, so mine is only two words (46 vs. 48) shorter than the one in the zine.

The DOE article really merits your attention, as do the others in this issue.

Comments Off

Information retrieval model based on graph comparison

Filed under: Graphs,Information Retrieval — Patrick Durusau @ 9:38 pm

Information retrieval model based on graph comparison (pdf) Quoc-Dinh Truong, Taoufiq Dkaki, Josiane Mothe, Pierre-Jean Charrel.

We propose a new method for Information Retrieval (IR) based on graph vertices comparison. The main goal of this method is to enhance the core IR-process of finding relevant documents in a collection of documents according to a user’s needs. The method we propose is based on graph comparison and involves recursive computation of similarity. In the framework this approach, documents, queries and indexing terms are viewed as vertices of a bipartite graph where edges go from a document or a query – first node type- to an indexing term – second node type-. Edges reflect the link that exists between documents or queries on the one hand and indexing terms on the other hand. In our model, graph edge settings reflect the tf-ifd paradigm. The proposed similarity measure instantiates and extends this principle, stipulating that the resemblance of two items or objects can be computed using the similarities of the items to which they are related. Our method also takes into account the concept of similarity propagation over graph edges.

Experiments conducted using four small sized IR test collections (TREC 2004 Novelty Track, CISI, Cranfield & Medline) demonstrate the effectiveness of our approach and its feasibility as long as the graph size does not exceed a few thousand nodes. The experiment’s results show that our method outperforms the vector-based model. Our method actually highly outperforms the vector-based cosine model, sometimes by more than doubling the precision, up to the top sixty returned documents. The computational complexity issue is resituated in the context of MAC-FAC approaches – many are called but few are chosen. More precisely, we suggest that our method can be successfully used as a FAC stage combined with a fast and computationally cheap method used as a MAC stage.

Very interesting article. Perhaps more so because searches of DBLP and Citeseer show no other publications by this author. A singularity that appears in 2008. I haven’t taken the time to look more deeply but commend the paper to your attention.

If you have pointers to later (earlier?) work by the same author, email or comments would be appreciated.

Comments Off

eTBLAST: a text-similarity based search engine

Filed under: eTBLAST,Search Engines,Similarity — Patrick Durusau @ 9:37 pm

eTBLAST: a text-similarity based search engine

eTBLAST from the Virginia Bioinformatics Institute at Virginia Tech lies at the heart of the Deja vu search service.

Unlike Deja vu, the eTBLAST interface is not limited to citations in PubMed, but includes:

MEDLINE
CRISP
NASA
Medical Cases
PMC Full Text
PMC METHODS
PMC INTRODUCTION
PMC RESULTS
PMC (paragraphs)
PMC Medical Cases
Clinical Trials
Arxiv
Wikipedia
VT Courses

There is a set of APIs, limited to some of the medical material.

While I was glad to see Arxiv included, given my research interests, CiteSeerX, or the ACM, IEEE, ALA or other information/CS would be of greater interest.

Among other things, I would like to have the ability to create a map of synonyms (can you say “topic map?”) that could be substituted during the comparison of the literature.

But, in the meantime, definitely will try the interface on Arxiv to see how it impacts my research on techniques relevant to topic maps.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 21, 2011

November 20, 2011

November 19, 2011

November 18, 2011