Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 9, 2013

…Recursive Neural Networks

Filed under: Natural Language Processing,Neural Networks — Patrick Durusau @ 1:37 pm

Parsing Natural Scenes and Natural Language with Recursive Neural Networks by Richard Socher; Cliff Chiung-Yu Lin; Andrew Ng; and Chris Manning.

Description:

Recursive structure is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. Discovering this recursive structure helps us to not only identify the units that an image or sentence contains but also how they interact to form a whole. We introduce a max-margin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification. For segmentation and annotation our algorithm obtains a new level of state-of-the-art performance on the Stanford background dataset (78.1%). The features from the image parse tree outperform Gist descriptors for scene classification by 4%.

Video of Richard Socher’s presentation at ICML 2011.

PDF of the paper: http://nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf

According to one popular search engine the paper has 51 citations (as of today).

What caught my attention was the mapping of phrases into vector spaces which resulted in the ability to calculate nearest neighbors on phrases.

Both for syntactic and semantic similarity.

If you need more than a Boolean test for similarity (Yes/No), then you are likely to be interested in this work.

Later work by Socher at his homepage.

AAAI – Weblogs and Social Media

Filed under: Artificial Intelligence,Blogs,Social Media,Tweets — Patrick Durusau @ 12:34 pm

Seventh International AAAI Conference on Weblogs and Social Media

Abstracts and papers from the Seventh International AAAI Conference on Weblogs and Social Media.

Much to consider:

Frontmatter: Six (6) entries.

Full Papers: Sixty-nine (69) entries.

Poster Papers: Eighteen (18) entries.

Demonstration Papers: Five (5) entries.

Computational Personality Recognition: Ten (10) entries.

Social Computing for Workforce 2.0: Seven (7) entries.

Social Media Visualization: Four (4) entries.

When the City Meets the Citizen: Nine (9) entries.

Be aware that the links for tutorials and workshops only give you the abstracts describing the tutorials and workshops.

There is the obligatory “blind men and the elephant” paper:

Blind Men and the Elephant: Detecting Evolving Groups in Social News

Abstract:

We propose an automated and unsupervised methodology for a novel summarization of group behavior based on content preference. We show that graph theoretical community evolution (based on similarity of user preference for content) is effective in indexing these dynamics. Combined with text analysis that targets automatically-identified representative content for each community, our method produces a novel multi-layered representation of evolving group behavior. We demonstrate this methodology in the context of political discourse on a social news site with data that spans more than four years and find coexisting political leanings over extended periods and a disruptive external event that lead to a significant reorganization of existing patterns. Finally, where there exists no ground truth, we propose a new evaluation approach by using entropy measures as evidence of coherence along the evolution path of these groups. This methodology is valuable to designers and managers of online forums in need of granular analytics of user activity, as well as to researchers in social and political sciences who wish to extend their inquiries to large-scale data available on the web.

It is a great paper but commits a common error when it notes:

Like the parable of Blind Men and the Elephant2, these techniques provide us with disjoint, specific pieces of information.

Yes, the parable is oft told to make a point about partial knowledge, but the careful observer will ask:

How are we different from the blind men trying to determine the nature of an elephant?

Aren’t we also blind men trying to determine the nature of blind men who are examining an elephant?

And so on?

Not that being blind men should keep us from having opinions, but it should may us wary of how deeply we are attached to them.

Not only are there elephants all the way down, there are blind men before, with (including ourselves) and around us.

July 8, 2013

100 Search Engines For Academic Research

Filed under: Search Engines,Searching — Patrick Durusau @ 7:43 pm

100 Search Engines For Academic Research

From the post:

Back in 2010, we shared with you 100 awesome search engines and research resources in our post: 100 Time-Saving Search Engines for Serious Scholars. It’s been an incredible resource, but now, it’s time for an update. Some services have moved on, others have been created, and we’ve found some new discoveries, too. Many of our original 100 are still going strong, but we’ve updated where necessary and added some of our new favorites, too. Check out our new, up-to-date collection to discover the very best search engine for finding the academic results you’re looking for.

(…)

When I saw the title for this post I assumed it was source code for search engines. 😉

Not so!

But don’t despair!

Consider all of them as possible comparisons for your topic map interface.

Or should I say the results delivered by your topic map interface?

Some are better than others but I am sure you can do better with a curated topic map.

Querying ElasticSearch – A Tutorial and Guide

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 6:59 pm

Querying ElasticSearch – A Tutorial and Guide by Rufus Pollock.

From the post:

ElasticSearch is a great open-source search tool that’s built on Lucene (like SOLR) but is natively JSON + RESTful. Its been used quite a bit at the Open Knowledge Foundation over the last few years. Plus, as its easy to setup locally its an attractive option for digging into data on your local machine.

While its general interface is pretty natural, I must confess I’ve sometimes struggled to find my way around ElasticSearch’s powerful, but also quite complex, query system and the associated JSON-based “query DSL” (domain specific language).

This post therefore provides a simple introduction and guide to querying ElasticSearch that provides a short overview of how it all works together with a good set of examples of some of the most standard queries.

(…)

This is a very nice introduction to ElasticSearch.

Read, bookmark and pass it along!

Advanced autocomplete with Solr Ngrams

Filed under: AutoComplete,AutoSuggestion,Solr — Patrick Durusau @ 6:54 pm

Advanced autocomplete with Solr Ngrams by Peter Tyrrell.

From the post:

The following approach is a good one if you require:

  • phrase suggestions, not just words
  • the ability to match user input against multiple fields
  • multiple fields returned
  • multiple field values to make up a unique suggestion
  • suggestion results collapsed (grouped) on a field or fields
  • the ability to filter the query
  • images with suggestions

I needed a typeahead suggestion (autocomplete) solution for a textbox that searches titles. In my case, I have a lot of magazines that are broken down so that each page is a document in the Solr index, and has metadata that describes its parentage. For example, page 1 of Dungeon Magazine 100 has a title: “Dungeon 100”; a collection; “Dungeon Magazine”; and a universe: “Dungeons and Dragons”. (Yes, all the material in my index is related to RPG in some way.) A magazine like this might consist of 70 pages or so, whereas a sourcebook like the Core Rulebook for Pathfinder, a D&D variant, boasts 578, so title suggestions have to group on title and ignore counts. Further, the Warhammer 40k game Dark Heresy also has a Core Rulebook, so title suggestions have to differentiate between them.

(…)

Topic map interfaces with autosuggest/complete could ease users into searching and authoring topic maps.

Better synonym handling in Solr

Filed under: Search Engines,Solr,Synonymy — Patrick Durusau @ 6:45 pm

Better synonym handling in Solr by Nolan Lawson.

From the post:

It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?

(image omitted)

As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.

Deep review of the handling of synonyms in Solr and a patch to improve its handling of synonyms.

The issue is now SOLR-4381 and is set for SOLR 4.4.

Interesting discussion continues under the SOLR issue.

Online College Courses (Academic Earth)

Filed under: Data,Education — Patrick Durusau @ 4:04 pm

Online College Courses (Academic Earth)

No new material but a useful aggregation of online course materials at forty-nine (49) institutions. (as of today)

Not that hard to imagine topic map-based value add services that link up materials and discussions with course materials.

Most courses are offered on a regular cycle and knowing what has helped before may be useful to you.

…Creating Content That Gets Shared

Filed under: Marketing,Topic Maps — Patrick Durusau @ 3:44 pm

Social Media and Storytelling, Part 3: Creating Content That Gets Shared by Cameron Uganec.

From the post:

In the previous posts, I explored how social media and storytelling can be used by marketers to engage with an audience and build relationships. It turns out that there is another benefit to following a brand storytelling approach; it can increase the shareability of your content. In fact the potential to build relationships coupled with the “viral effect” is what makes storytelling and social media powerful tools for marketers.

My team creates a lot of content. Our content marketing takes many forms: Tweets, Facebook posts, contributed articles, infographics, videos, blog posts etc. In order to unlock the potential value of the ‘earned media’ component of social media we endeavour to make every piece of content shareable. So it’s important that we understand why people share content.

Why People Share

The NYTimes Insights Group published a study that looked at the key factors that influence people to share content. Unsurprisingly, they discovered that sharing is all about relationships. They outlined these key motivations for people to share:

  • To bring valuable and entertaining content to others.
  • To define ourselves to others.
  • To grow and nourish relationships.
  • To get the word out about causes and brands I care about.

When you are creating content it’s important to be mindful of what the motivation of your audience is. When planning each piece of content our team answers these questions:

  • How does this add value for our audience?
  • How will this help or entertain them?
  • Why will they share it?

(…)

I am going to pick up the prior posts in this series and suggest that you do the same.

At least if you are interested in other people marketing your product, services, topic maps for you.

Sounds like a good deal to me.

Other posts in this series:

Social Media and Storytelling, Part 1: Why Storytelling?

Social Media and Storytelling, Part 2: Back to the Future

Social Media and Storytelling, Part 3: Creating Content That Gets Shared (subject of this post)

Social Media + Storytelling = Awesomesauce (presentation by Cameron at Marketo’s 2013 Summit Conference in San Francisco)

Mahout – Unofficial 0.8 Release

Filed under: Machine Learning,Mahout — Patrick Durusau @ 3:29 pm

Mahout – Unofficial 0.8 Release (email from Grant Ingersoll.

From the post:

A _preview_ of release artifacts for 0.8 are at https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/.

This is not an official release. I will call a vote in a day or two, pending feedback on this thread, so please review/test.

A _preview_ of the release notes are at https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8

In case you are interested in contributing comments pre-release.

Game Theory, n.

Filed under: Game Theory — Patrick Durusau @ 3:06 pm

I happened across the index to the Economist, which sported this entry: game theory.

Some of the economists I have known were interested in game theory (Originated with John von Neumann.)

Back at the Economist, I was surprise that the first entry read:

Are international football tournaments curse or boon?

Then noticing a definition of game theory that reads:

Reporting and analysis on the politics, economics, science and statistics of the games we play and watch

Just having an index isn’t enough. 😉

Detecting Semantic Overlap and Discovering Precedents…

Detecting Semantic Overlap and Discovering Precedents in the Biodiversity Research Literature by Graeme Hirst, Nadia Talenty, and Sara Scharfz.

Abstract:

Scientific literature on biodiversity is longevous, but even when legacy publications are available online, researchers often fail to search it adequately or effectively for prior publications; consequently, new research may replicate, or fail to adequately take into account, previously published research. The mechanisms of the Semantic Web and methods developed in contemporary research in natural language processing could be used, in the near-term future, as the basis for a precedent-finding system that would take the text of an author’s early draft (or a submitted manuscript) and find potentially related ideas in published work. Methods would include text-similarity metrics that take different terminologies, synonymy, paraphrase, discourse relations, and structure of argumentation into account.

Footnote one (1) of the paper gives an idea of the problem the authors face:

Natural history scientists work in fragmented, highly distributed and parochial communities, each with domain specific requirements and methodologies [Scoble 2008]. Their output is heterogeneous, high volume and typically of low impact, but with a citation half-life that may run into centuries” (Smith et al. 2009). “The cited half-life of publications in taxonomy is longer than in any other scientific discipline, and the decay rate is longer than in any scientific discipline” (Moritz 2005). Unfortunately, we have been unable to identify the study that is the basis for Moritz’s remark.

The paper explores in detail issues that have daunted various search techniques, when the material is available in electronic format at all.

The authors make a general proposal for addressing these issues, with mention of the Semantic Web but omit from their plan:

The other omission is semantic interpretation into a logical form, represented in XML, that draws on ontologies in the style of the original Berners-Lee, Hendler, and Lassila (2001) proposal for the Semantic Web. The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here. This is not to say that logical forms would be useless. On the contrary, they are employed by some approaches to paraphrase and textual entailment (section 4.1 above) and hence might appear in the system if only for that reason; but even so, they would form only one component of a broader and somewhat looser kind of semantic representation.

That’s the problem with the Semantic Web in a nutshell:

The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here.

What if I want to be logically precise sometimes but not others?

What if I want to be more precise in some places and less precise in others?

What if I want to have different degrees or types of imprecision?

With topic maps the question is: How im/precise do you want to be?

July 7, 2013

Proceedings of the 3rd Workshop on Semantic Publishing

Filed under: Publishing,Semantics — Patrick Durusau @ 7:54 pm

Proceedings of the 3rd Workshop on Semantic Publishing edited by: Alexander García Castro, Christoph Lange, Phillip Lord, and Robert Stevens.

Table of Contents

Research Papers

  1. Twenty-Five Shades of Greycite: Semantics for Referencing and Preservation Phillip Lord
  2. Systematic Reviews as an Interface to the Web of (Trial) Data: using PICO as an Ontology for Knowledge Synthesis in Evidence-based Healthcare Research Chris Mavergames
  3. Towards Linked Research Data: an Institutional Approach Najko JahnFlorian Lier, Thilo Paul-Stueve, Christian Pietsch, Philipp Cimiano
  4. Repurposing Benchmark Corpora for Reconstructing Provenance Sara Magliacane.
  5. Connections across Scientific Publications based on Semantic Annotations Leyla Jael García Castro, Rafael Berlanga, Dietrich Rebholz-Schuhmann, Alexander Garcia.
  6. Towards the Automatic Identification of the Nature of Citations Angelo Di Iorio, Andrea Giovanni Nuzzolese, Silvio Peroni.
  7. How Reliable is Your Workflow: Monitoring Decay in Scholarly Publications José Manuel Gómez-Pérez, Esteban García-Cuesta, Jun Zhao, Aleix Garrido, José Enrique Ruiz.

Polemics (published externally)

  1. Flash Mob Science, Open Innovation and Semantic Publishing Hal Warren, Bryan Dennis, Eva Winer.
  2. Science, Semantic Web and Excuses Idafen Santana Pérez, Daniel Garijo, Oscar Corcho.
  3. Polemic on Future of Scholarly Publishing/Semantic Publishing Chris Mavergames.
  4. Linked Research Sarven Capadisli.

The whole proceedings can also be downloaded as a single file (PDF, including title pages, preface, and table of contents).

Some reading to start your week!

Most data isn’t “big,”…

Filed under: BigData — Patrick Durusau @ 4:35 pm

Most data isn’t “big,” and businesses are wasting money pretending it is by Christopher Mims.

From the post:

Big data! If you don’t have it, you better get yourself some. Your competition has it, after all. Bottom line: If your data is little, your rivals are going to kick sand in your face and steal your girlfriend.

There are many problems with the assumptions behind the “big data” narrative (above, in a reductive form) being pushed, primarily, by consultants and IT firms that want to sell businesses the next big thing. Fortunately, honest practitioners of big data—aka data scientists—are by nature highly skeptical, and they’ve provided us with a litany of reasons to be weary of many of the claims made for this field. Here they are:

Christopher makes good points:

Even web giants like Facebook and Yahoo generally aren’t dealing with big data, and the application of Google-style tools is inappropriate.

Big data has become a synonym for “data analysis,” which is confusing and counter-productive.

Supersizing your data is going to cost you and may yield very little.

In some cases, big data is as likely to confuse as it is to enlighten.

So what’s better—big data or small?

But I like his closer most of all:

Remember: Gregor Mendel uncovered the secrets of genetic inheritance with just enough data to fill a notebook. The important thing is gathering the right data, not gathering some arbitrary quantity of it.

Should I forward this to the NSA?

import.io

Filed under: Data,ETL,WWW — Patrick Durusau @ 4:18 pm

import.io

The steps listed by import.io on its “How it works” page:

Find: Find an online source for your data, whether it’s a single web page or a search engine within a site. Import•io doesn’t discriminate; it works with any web source.

Extract: When you have identified the data you want, you can begin to extract it. The first stage is to highlight the data that you want. You can do this by giving us a few examples and our algorithms will identify the rest. The next stage is to organise your data. This is as simple as creating columns to sort parts of the data into, much like you would do in a spreadsheet. Once you have done that we will extract the data into rows and columns.

If you want to use the data once, or infrequently, you can stop here. However, if you would like a live connection to the data or want to be able to access it programatically, the next step will create a real-time connection to the data.

Connect: This stage will allow you to create a real-time connection to the data. First you have to record how you obtained the data you extracted. Second, give us a couple of test cases so we can ensure that, if the website changes, your connection to the data will remain live.

Mix: One of the most powerful features of the platform is the ability to mix data from many sources to form a single data set. This allows you to create incredibly rich data sets by combing hundred of underlying data points from many different websites and access them via the application or API as a single source. Mixing is as easy a clicking the sources you want to mix together and saving that mix as a new real-time data set.

Use: Simply copy your data into your favourite spreadsheet software or use our APIs to access it in an application.

Developer preview but interesting for a couple of reasons.

First simply as an import service. I haven’t tried it (yet) so your mileage may vary. Reports welcome.

Second, I like the (presented) ease of use approach.

Imagine a topic map application for some specific domain that was as matter-of-fact as what I quote above.

Something to think about.

Nasty data corruption getting exponentially worse…

Filed under: BigData,Data Quality — Patrick Durusau @ 3:52 pm

Nasty data corruption getting exponentially worse with the size of your data by Vincent Granville.

From the post:

The issue with truly big data is that you will end up with field separators that are actually data values (text data). What are the chances to find a double tab in a one GB file? Not that high. In an 100 TB file, the chance is very high. Now the question is: is it a big issue, or maybe it’s fine as long as less than 0.01% of the data is impacted. In some cases, once the glitch occurs, ALL the data after the glitch is corrupted, because it is not read correctly – this is especially true when a data value contains text that is identical to a row or field separator, such as CR / LF (carriage return / line feed). The problem gets worse when data is exported from UNIX or MAC to WINDOWS, or even from ACCESS to EXCEL.

Vincent has a number of suggestions for checking data.

What would you add to his list?

wxHaskell

Filed under: Functional Programming,Graphics,Haskell — Patrick Durusau @ 3:21 pm

wxHaskell: A Portable and Concise GUI Library for Haskell by Daan Leijen.

Abstract:

wxHaskell is a graphical user interface (GUI) library for Haskell that is built on wxWidgets: a free industrial strength GUI library for C++ that has been ported to all major platforms, including Windows, Gtk, and MacOS X. In contrast with many other libraries, wxWidgets retains the native look-and-feel of each particular platform. We show how distinctive features of Haskell, like parametric polymorphism, higher-order functions, and first-class computations, can be used to present a concise and elegant monadic interface for portable GUI programs.

Complete your Haskell topic map app with a Haskell based GUI!

BPM Engine With Neo4j

Filed under: Graphs,Neo4j — Patrick Durusau @ 2:58 pm

NoSQL experimentations with Activiti: A (very simple) Neo4J prototype

From the post:

I’ve got this itch for a long time now to try and see how easy it is to write a BPM engine on a graph database such as Neo4J. After all, the data model fits perfectly, as business processes are graphs and executions of those processes basically boil down to keeping pointers to where you are in that graph. It just feels as a very natural fit.

So I spend some time implementing a prototype, which you can find on https://github.com/jbarrez/activiti-neo4j

The prototype contains some unit tests which execute simple BPMN 2.0 processes. I tried to be as close as possible to the Activiti concepts of services, commands, executions, JavaDelegates, etc. Currently covered:

If BPMN is unfamiliar:

A standard Business Process Model and Notation (BPMN) will provide businesses with the capability of understanding their internal business procedures in a graphical notation and will give organizations the ability to communicate these procedures in a standard manner. Furthermore, the graphical notation will facilitate the understanding of the performance collaborations and business transactions between the organizations. This will ensure that businesses will understand themselves and participants in their business and will enable organizations to adjust to new internal and B2B business circumstances quickly.

BPMN homepage, includes links to version 2.0 and other materials.

Processes, business and otherwise, seem like naturals to model as graphs.

Spatial Search With Apache Solr and Google Maps

Filed under: Google Maps,JQuery,Solr — Patrick Durusau @ 2:48 pm

Spatial Search With Apache Solr and Google Maps by Wern Ancheta.

From the post:

In this tutorial I’m going to show you how to setup spatial search in Apache Solr then were going to create an application which uses Spatial searching with the use of Google Maps.

You will also learn about geocoding and JQuery as part of this tutorial.

For the purposes of this tutorial were going to use Spatial search to find the locations which are near the place that we specify.

If you have a cellphone contract or geolocation you can find who lives nearby. 😉

Assuming you have that kind of data.

Physics, Topology, Logic and Computation:…

Filed under: Category Theory,Computation,Logic,Topology — Patrick Durusau @ 2:43 pm

Physics, Topology, Logic and Computation: A Rosetta Stone by John C. Baez and Mike Stay.

Abstract:

In physics, Feynman diagrams are used to reason about quantum processes. In the 1980s, it became clear that underlying these diagrams is a powerful analogy between quantum physics and topology: namely, a linear operator behaves very much like a “cobordism”. Similar diagrams can be used to reason about logic, where they represent proofs, and computation, where they represent programs. With the rise of interest in quantum cryptography and quantum computation, it became clear that there is extensive network of analogies between physics, topology, logic and computation. In this expository paper, we make some of these analogies precise using the concept of “closed symmetric monoidal category”. We assume no prior knowledge of category theory, proof theory or computer science.

The authors set out to create a Rosetta stone for the areas of physics, topology, logic and computation on the subject of categories.

Seventy (70)+ pages of heavy reading but worth the effort (at least so far)!

Mini Search Engine…

Filed under: Graphs,Neo4j,Search Engines,Searching — Patrick Durusau @ 1:13 pm

Mini Search Engine – Just the basics, using Neo4j, Crawler4j, Graphstream and Encog by Brian Du Preez.

From the post:

Continuing to chapter 4 of Programming Collection Intelligence (PCI) which is implementing a search engine.

I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn’t necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.

Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.

Just a couple:
neo4j – learn
Graph theory for busy developers
Graphdatabases

Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.

The general search space is filled by major contenders.

But that leaves open opportunities for domain specific search services.

Law and medicine have specialized search engines. What commercially viable areas are missing them?

July 6, 2013

Norch- a search engine for node.js

Filed under: Indexing,node-js,Search Engines — Patrick Durusau @ 4:30 pm

Norch- a search engine for node.js by Fergus McDowall.

From the post:

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

  • Full text search
  • Stopword removal
  • Faceting
  • Filtering
  • Relevance weighting (tf-idf)
  • Field weighting
  • Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format.

Download the first release of Norch (0.2.1) here

See: https://github.com/fergiemcdowall/norch for various details and instructions.

Interesting but I am curious what advantage Norch offers over Solr or Elasticseach, for example?
.

July 5, 2013

On Importing Data into Neo4j

Filed under: Graphs,Neo4j — Patrick Durusau @ 12:26 pm

On Importing Data into Neo4j (Blog Series) by Michael Hunger.

From the post:

Being able to run interesting queries against a graph database like Neo4j requires the data to be in there in the first place. As many users have questions in this area, I thought a series on importing data into Neo4j would be helpful to get started. This series covers both importing small and moderate data volumes for examples and demonstrations, but also large scale data ingestion.

For operations where massive amounts of data flow in or out of a Neo4j database, the interaction with the available APIs should be more considerate than with your usual, ad-hoc, local graph queries.

This blog series will discuss several ways of importing data into Neo4j and the considerations you should make when choosing one or the other. There is a dedicated page about importing data on neo4j.org which will help you getting started but needs feedback to improve.

Basically Neo4j offers several APIs to import data. Preferably the Cypher query language should be used as it is easiest to use from any programming language. The Neo4j Server’s REST API is not suited for massive data import, only the batch-operation endpoint and the Cypher REST and transactional HTTP (from Neo4j 2.0) endpoints are of interest there. Neo4j’s Java Core APIs provide a way of avoiding network overhead and driving data import directly from a programmatic source of data and also allow to drop down to the lowest API levels for high speed ingestion.

Great overview of importing data into Neo4j with the promise of more post to follow on importing data into Neo4j 2.0.

I first saw this at Alex Popescu’s On Importing Data into Neo4j.

July 4, 2013

Re-designing Topic Maps (July 4, 2013)

Filed under: Topic Maps — Patrick Durusau @ 3:16 pm

There is an ongoing thread at LinkedIn, started by Steve Pepper, on what would we do differently today in designing topic maps?

I posted this as a starting point on that redesign today:

Just a sketch but suppose the components were:

Topics: Represent subjects (anything you want to talk about).

Associations: Relationships between subjects (which are represented by topics)

A relationship is a subject but it is useful to have a simple way to talk about relationships between subjects.

Occurrences: Where you find a subject in a topic map or some other information resource. Quite like page numbers in an index entry.

An occurrence is a relationship between a subject and a location in an information resource. Also a subject but common enough to merit a simple way to talk about it.

Some other necessary mechanics but how is that so far?

Thinking the only way to judge complexity is to start enumerating the parts and someone can speak up when the complexity barrier is crossed.

Clustering Search with Carrot2

Filed under: Clustering,Solr — Patrick Durusau @ 2:43 pm

Clustering Search with Carrot2 by Ian Milligan.

From the post:

My work is taking me to larger and larger datasets, so finding relevant information has become a real challenge – I’ve dealt with this before, noting DevonTHINK as an alternative to something slow and cumbersome like OS X’s Spotlight. As datasets scale, keyword searching and n-gram counting has also shown some limitations.

One approach that I’ve been taking is to try to implement a clustering algorithm on my sources, as well as indexing them for easy retrieval. I wanted to give you a quick sense of my workflow in this post.

Brief but useful tutorial on using Solr and Carrot2.

July 3, 2013

CHD@ZJU…

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:37 am

CHD@ZJU: a knowledgebase providing network-based research platform on coronary heart disease by Leihong Wu, Xiang Li, Jihong Yang, Yufeng Liu, Xiaohui Fan and Yiyu Cheng. (Database (2013) 2013 : bat047 doi: 10.1093/database/bat047)

From the webpage:

Abstract:

Coronary heart disease (CHD), the leading cause of global morbidity and mortality in adults, has been reported to be associated with hundreds of genes. A comprehensive understanding of the CHD-related genes and their corresponding interactions is essential to advance the translational research on CHD. Accordingly, we construct this knowledgebase, CHD@ZJU, which records CHD-related information (genes, pathways, drugs and references) collected from different resources and through text-mining method followed by manual confirmation. In current release, CHD@ZJU contains 660 CHD-related genes, 45 common pathways and 1405 drugs accompanied with >8000 supporting references. Almost half of the genes collected in CHD@ZJU were novel to other publicly available CHD databases. Additionally, CHD@ZJU incorporated the protein–protein interactions to investigate the cross-talk within the pathways from a multi-layer network view. These functions offered by CHD@ZJU would allow researchers to dissect the molecular mechanism of CHD in a systematic manner and therefore facilitate the research on CHD-related multi-target therapeutic discovery.

Database URL: http://tcm.zju.edu.cn/chd/

The article outlines the construction of CHD@ZJU as follows:

chd@zju

Figure 1.
Procedure for CHD@ZJU construction. CHD-related genes were extracted with text-mining technique and manual confirmation. PPI, pathway and drugs information were then collected from public resources such as KEGG and HPRD. Interactome network of every pathway was constructed based on their corresponding genes and related PPIs, and the whole CHD diseasome network was then constructed with all CHD-related genes. With CHD@ZJU, users could find information related to CHD from gene, pathway and the whole biological network level.

While assisted by computer technology, there is a manual confirmation step that binds all the information together.

Triggers for Apache HBase

Filed under: HBase,Indexing,Triggers — Patrick Durusau @ 9:00 am

Cloudera Search over Apache HBase: A Story of Collaboration by Steven Noels.

Great background story on the development of triggers and indexing updates for Apache HBase by NGDATA (for their Lily product) and that underlies Cloudera Search.

From the post:

In this most recent edition, we introduced an order of magnitude performance improvement: a cleaner, more efficient, and fault-tolerant code path with no write performance penalty on HBase. In the interest of modularity, we decoupled the trigger and indexing component from Lily, making it into a stand-alone, collaborative open source project that is now underpinning both Cloudera Search HBase support as well as Lily.

This made sense for us, not just because we believe in HBase and its community but because our customers in Banking, Media, Pharma and Telecom have unqualified expectations for both the scalability and resilience of Lily. Outsourcing some part of that responsibility towards the infrastructure tier is efficient for us. We are very pleased with the collaboration, innovation, and quality that Cloudera has produced by working with us and look forward to a continued relationship that combines joint development in a community oriented way with responsible stewardship of the infrastructure code base we build upon.

Our HBase Triggering and Indexing software can be found on GitHub at:

https://github.com/NGDATA/hbase-sep
https://github.com/NGDATA/hbase-indexer

Do you have any indexing or update side-effect needs for HBase? Tell us your thoughts on this solution.

July 2, 2013

Finding relationships in Trademark Data

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:18 pm

Finding relationships in Trademark Data by Matt Overstreet.

From the post:

At the recent National Day of Civic hacking here at OSC we dug into a few ways to find relationships between Trademarks files with the USPTO.

If you’ve ever played with the US trademark data you’ll know that it’s both plentiful and scarce. There are lot’s of trademark fillings, each with the minimum possible data to make them uniquely identifiable.

That’s great for streamlined government and citizen anonymity, but no fun for finding the relationships between filings. We needed to suss out more information about the graph of trademarks. That’s when we Eric and Wes tripped over the translations included in many of the patent filings. We wondered if the term space for these translations might be smaller and more consistent then the space defined by the actual trademarks. Translations were less likely to play games with spelling or grammar the way one might with the actual mark.

Some Hacking with the data and Neo4j resulted in an intriguing dataset that we are still unpacking. Want to play with the data? Neo4J loaded with data is at this url: http://rosetta.bloom.sh:7474/webadmin/

Curious to know what you make of the theory:

Translations were less likely to play games with spelling or grammar the way one might with the actual mark.

I’m not sure that is a useful assumption:

Marks consisting of or including foreign words or terms from common, modern languages are translated into English to determine genericness, descriptiveness, likelihood of confusion, and other similar issues. See Palm Bay ,396 F.3d at 1377, 73 USPQ2d at 1696. With respect to likelihood of confusion, “[i]t is well established that foreign words or terms are not entitled to be registered if the English language equivalent has been previously used on or registered for products which might reasonably be assumed to come from the same source.” Mary Kay Cosmetics, Inc. v. Dorian Fragrances, Ltd. , 180 USPQ 406, 407 (TTAB 1973).[Examination Guide 1-08]

Use of the English translation to judge “…genericness, descriptiveness, likelihood of confusion, and other similar issues.”, lends an incentive for “…play[ing] games with spelling or grammar….”

If you are interested in the data set, you may find the resources at: Trademarks Home useful.

Caution: Legal terminology may not have the semantics you expect.

Running Python and R inside Emacs

Filed under: Editor,Programming,Python,R — Patrick Durusau @ 2:45 pm

Running Python and R inside Emacs by John D. Cook.

From the post:

Emacs org-mode lets you manage blocks of source code inside a text file. You can execute these blocks and have the output display in your text file. Or you could export the file, say to HTML or PDF, and show the code and/or the results of executing the code.

Here I’ll show some of the most basic possibilities. For much more information, see orgmode.org. And for the use of org-mode in research, see A Multi-Language Computing Environment for Literate Programming and Reproducible Research.

Not recent (2012) but looks quite interesting.

Well, you have to already like Emacs! 😉

Follow John’s post for basic usage and if you like it, checkout orgmode.org.

Balisage Reservations?

Filed under: Conferences — Patrick Durusau @ 1:57 pm

Canada Travel advises for Montreal in August:

Average August temperature: 21°C / 68°F
August average high: 28°C / 83°F
August average low: 16°C / 60°F

Visitors can expect rain about 9 days out of 31 in August.

I mention that because Tommie Usdin advises it is time to:

Register for Balisage: The Markup Conference at: http://www.balisage.net/registration.html

Reserve your room at the conference hotel: receive the group rate, you MUST either:

call the hotel at 514-866-6492 (or from Canada or the US: 1-888-535-2808)

or

send email to info@hoteleuropa.com with a copy to mbouchaibi@hoteleuropa.com

and

specify that you are making a reservation for Balisage 2013. [Penalty if you fail to say Balisage 2013] Rates cannot be changed at check-in/check-out times for people who fail to identify their affiliation at the time of reservation.

Start thinking about what you want to talk about in Balisage Bluff: http://www.balisage.net/2013/Program.html#w115p

Decide what you want to donate to the Silent Auction: http://www.balisage.net/2013/Auction.html

Questions? write to info@balisage.net

The weather should be nice but being wet at 16°C / 60°F isn’t much fun. Get a room.

A resource for fully calibrated NASA data

Filed under: Astroinformatics,Data — Patrick Durusau @ 12:46 pm

A resource for fully calibrated NASA data by Scott Fleming, an astronomer at Space Telescope Science Institute.

From the post:

The Mikulski Archive for Space Telescopes (MAST) maintains, among other things, a database of fully calibrated, community-contributed spectra, catalogs, models, and images from UV and optical NASA missions. These High Level Science Products (HLSPs) range from individual objects to wide-field surveys from MAST missions such as Hubble, Kepler, GALEX, FUSE, and Swift UVOT. Some well-known surveys archived as HLSPs include CANDELS, CLASH, GEMS, GOODS, PHAT, the Hubble Ultra Deep Fields, the ACS Survey of Galactic Globular Clusters. (Acronym help here: DOOFAS). And it’s not just Hubble projects: we have HLSPs from GALEX, FUSE, and IUE, to name a few, and some of the HLSPs include data from other missions or ground-based observations. A complete listing can be found on our HLSP main page.

How do I navigate the HLSP webpages?

Each HLSP has a webpage that, in most cases, includes a description of the project, relevant documentation, and previews of data. For example, the GOODS HLSP page has links to the current calibrated and mosaiced FITS data files, the multi-band source catalog, a Cutout Tool for use with images, a Browse page where you can view multi-color, drizzled images, and a collection of external links related to the GOODS survey.

You can search many HLSPs based on target name or coordinates. If you’ve ever used the MAST search forms to access HST, Kepler, or GALEX data, this will look familiar. The search form is great for checking whether your favorite object is part of a MAST HLSP. You can also upload a list of objects through the “File Upload Form” link if you want to check multiple targets. You may also access several of the Hubble-based HLSPs through the Hubble Legacy Archive (HLA). Click on “advanced search” in green, then in the “Proposal ID” field, enter the name of the HLSP product to search for, e.g., “CLASH”. A complete list of HLSPs available through the HLA may be found here where you can also click on the links in the Project Name column to initiate a search within that HLSP.

(…)

More details follow on how to contribute your data.

I suggest following @MAST_News for updates on data and software!

« Newer PostsOlder Posts »

Powered by WordPress