June « 2011 « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 5, 2011

Your Data, Your Search

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:21 pm

Your Data, Your Search by Karel Minařík.

Slide deck but a very interesting one. Covers the shortcomings of search, an overview of reverse indexing and ends up with ElasticSearch. Along the way he observes that the “analysis” step is often more important than the “search” step. Suspect that analysis is nearly always more important than searching. And certainly harder.

Comments Off

Bio4j includes RefSeq data now !

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 3:20 pm

Bio4j includes RefSeq data now !

A word about the RefSeq data (I haven’t reproduced all the hyperlinks, which are many):

NCBI’s Reference Sequence (RefSeq) database is a collection of taxonomically diverse, non-redundant and richly annotated sequences representing naturally occurring molecules of DNA, RNA, and protein. Included are sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Similar to a review article, a RefSeq is a synthesis of information integrated across multiple sources at a given time. RefSeqs provide a foundation for uniting sequence data with genetic and functional information. They are generated to provide reference standards for multiple purposes ranging from genome annotation to reporting locations of sequence variation in medical records. The RefSeq collection is available without restriction and can be retrieved in several different ways, such as by searching or by available links in NCBI resources, including PubMed, Nucleotide, Protein, Gene, and Map Viewer, searching with a sequence via BLAST, and downloading from the RefSeq FTP site.

Source: http://www.ncbi.nlm.nih.gov/books/NBK21091/

BTW, note that the RefSeq information is stored in the Bio4J DB but the sequences are held as separate files on S3. See the blog post for details. (Thanks to @pablopareja for the correction on storage of refseq information in the Bio4J DB.)

Comments Off

neo4jrestclient 1.3.2

Filed under: Neo4j,Python — Patrick Durusau @ 3:19 pm

neo4jrestclient 1.3.2

From the website:

Library to interact with Neo4j standalone REST server

The first objective of Neo4j Python REST Client is to make transparent for Python programmers the use of a local database through neo4j.py or a remote database thanks to Neo4j REST Server. So, the syntax of this API is fully compatible with neo4j.py. However, a new syntax is introduced in order to reach a more pythonic style.

Comments Off

June 4, 2011

XQuery Guestbook

Filed under: XQuery — Patrick Durusau @ 7:14 pm

XQuery Guestbook

A guestbook written entirely in XQuery.

Is XQuery a good model for the capabilities people will expect from TMQL? Thinking that to offer less than the “average” set of capabilities is going to make TMQL look lame. Thoughts?

Comments Off

gmap3

Filed under: JQuery,Maps — Patrick Durusau @ 7:14 pm

gmap3

From the website:

A JQuery plugin to create google maps with advanced features (overlays, clusters, callbacks, events…)

Is this a pathway to marry topic and google maps?

Comments Off

June 3, 2011

IBM InfoSphere BigInsights

Filed under: Avro,BigInsights,Hadoop,HBase,Lucene,Pig,Zookeeper — Patrick Durusau @ 2:32 pm

IBM InfoSphere BigInsights

Two items stand out in the usual laundry list of “easy administration” and “IBM supports open source” list of claims:

The Jaql query language. Jaql, a Query Language for JavaScript Object Notation (JSON), provides the capability to process both structured and non-traditional data. Its SQL-like interface is well suited for quick ramp-up by developers familiar with the SQL language and makes it easier to integrate with relational databases.

….

Integrated installation. BigInsights includes IBM value-added technologies, as well as open source components, such as Hadoop, Lucene, Hive, Pig, Zookeeper, Hbase, and Avro, to name a few.

I guess it must include a “few” things since the 64-bit Linux download is 398 MBs.

Just pointing out its availability. More commentary to follow.

Comments Off

brain – Javascript for supervised machine learning

Filed under: Javascript,Machine Learning — Patrick Durusau @ 2:31 pm

brain – Javascript for supervised machine learning

From the website:

brain is a JavaScript library for neural networks and Bayesian classifiers.

The documentation reports that by default it stores in memory but it also has a Redis backend.

Reported to be used for spam filtering. Filtering is predicated on recognition of some basis for filtering, dare I say subject recognition? There are some subjects that are classes, such as spam, which are composed of included subjects, such as individual spam senders or messages. How fine grained subject recognition need be really depends upon the purpose of recognition.

What subjects are you filtering for?

Comments Off

Red-R – Pipeline Visual Editor for Doing Stats With R

Filed under: R,Statistics — Patrick Durusau @ 2:31 pm

Red-R – Pipeline Visual Editor for Doing Stats With R

Overview of Red-R, visual editor for R.

From the post:

To ease my way into R, I’ve started using R-Studio, an in-development IDE. But the other day, I was also tipped off about Red-R, a visual programming environment for R that seems to be built around the same tooling as the Orange data analysis tool I wrote about last year.

It’s still pretty ropey at the moment (on a Mac at least), but works enough to be going on with…

The metaphor is based on pipeline processing of data, chaining together functional blocks with wires in the order you want the functions to be executed. Getting data in is currently from a file (it would be nice to see hooks into online datasources supported too), with a range of options for getting the data into the environment in a structured way:….

How would you visual topic map processing as topics, etc., encounter constraints or are formed by queries? What of items that are discarded or not selected? Thinking of something along the lines of interactive creation/destruction of topics along with merging of the same.

Comments Off

June 2, 2011

Intermine

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:45 pm

Intermine

From the website:

InterMine is a powerful open source data warehouse system. Using InterMine, you can create databases of biological data accessed by sophisticated web query tools. Parsers are provided for integrating data from several common biological formats and there is a framework for adding your own data. InterMine includes an attractive, user-friendly web interface that works ‘out of the box’ and can be easily customised for your specific needs, as well as a powerful, scriptable web-service API to allow programmatic access to your data.

Intermine is biological data integration software, the uses of which, provide a window into the complexities of data integration.

Powered by InterMine:

FlyMine

modENCODE

RatMine at RGD

YeastMine at SGD

TargetMine at NIBIO, Japan

Definitely a project where topic mappers can see what has been done already for integration of biological data as well as find places where topic maps can contribute to further solutions.

Comments Off

An introduction to Category Theory for Software Engineers

Filed under: Category Theory,Software — Patrick Durusau @ 7:45 pm

An introduction to Category Theory for Software Engineers

Dr. Steve Easterbrook of University of Toronto introduces category theory and covers these topics:

What is Category Theory?

Why should we be interested in Category Theory?

How much Category Theory is it useful to know?

What kinds of things can you do with Category Theory in Software Engineering?

Does Category Theory help us to automate things? (for the ASE audience)

One of the more approachable introductions/overviews of category theory that I have seen.

Comments Off

Annotation Ontology and the SWAN Annotation Tool Webinar – 15 June 2011 10 AM PT

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:44 pm

Linking science and semantics with Annotation Ontology and the SWAN Annotation Tool

From the website:

ABSTRACT:

Annotation Ontology (AO) is an open ontology in OWL for annotating scientific documents on the web. AO supports both human and algorithmic content annotation. It enables “stand-off” or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO contains a provenance model to support versioning, and a set model for specifying groups and containers of annotation.

The SWAN Annotation Tool, recently renamed DOMEO (Document Metadata Exchange Organizer), is an extensible web application enabling users to visually and efficiently create and share ontology-based stand-off annotation metadata on HTML or XML document targets, using the Annotation Ontology RDF model. The tool supports manual, fully automated, and semi-automated annotation with complete provenance records, as well as personal or community annotation with access authorization and control.

[AO] http://code.google.com/p/annotation-ontology/

SPEAKER BIO:

Paolo Ciccarese is Instructor at the Harvard Medical School and Assistant in Neurology at the Massachusetts General Hospital. After obtaining his MS in Computer Science, he started his career as a freelance consultant in knowledge management software development. Soon after, Paolo received a PhD in Bioengineering and Bioinformatics from the University of Pavia, Italy. Here, he was also a teaching assistant for five years in courses on the subjects of artificial intelligence in medicine and object orientation programming. Outside of his doctorate work, Paolo co-developed the RDF visualizer Welkin for the SIMILE project and founded the JDPF (Java Data Processing Framework) project, a modular and extendable open source infrastructure for processing big quantities of heterogeneous data.

Immediately following the completion of his PhD, Paolo became a research fellow in Department of Neurology at Massachusetts General Hospital where he co-developed the SWAN (Semantic Web Applications in Neuromedicine) platform. Paolo authored several ontologies including the SWAN Ontology, the PAV (Provenance, Authoring and Versioning) ontology and the Annotation Ontology (AO). Since 3 years, he also serves as coordinator of several subtasks of the Scientific Discourse task force at the W3C HCLS Interest Group. Currently, Paolo is focusing on the design and development of knowledge management tools leveraging Semantic Web technologies integrating the annotation of online resources.

Extra credit: What does “stand-off” annotation have in common with HyTime? 😉

Comments Off

Second International Workshop on Consuming Linked Data (COLD2011)

Filed under: Linked Data,LOD — Patrick Durusau @ 7:42 pm

Second International Workshop on Consuming Linked Data (COLD2011)

Important Dates:

Paper submission deadline: August 15, 2011, 23.59 Hawaii time
Acceptance notification: September 6, 2011
Camera-ready versions of accepted papers: September 15, 2011
Workshop date: October 23 or 24, 2011

From the website:

Abstract:

The quantity of published Linked Data is increasing dramatically. However, applications that consume Linked Data are not yet widespread. Current approaches lack methods for seamless integration of Linked Data from multiple sources, dynamic discovery of available data and data sources, provenance and information quality assessment, application development environments, and appropriate end user interfaces. Addressing these issues requires well-founded research, including the development and investigation of concepts that can be applied in systems which consume Linked Data from the Web. Following the success of the 1st International Workshop on Consuming Linked Data, we organize the second edition of this workshop in order to provide a platform for discussion and work on these open research problems. The main objective is to provide a venue for scientific discourse -including systematic analysis and rigorous evaluation- of concepts, algorithms and approaches for consuming Linked Data.

Err “…lack methods for seamless integration of Linked Data from multiple sources…” has topic maps written all over it.

Comments Off

June 1, 2011

Silk – A Link Discovery Framework for the Web of Data

Filed under: Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 6:52 pm

Silk – A Link Discovery Framework for the Web of Data

From the website:

The Web of Data is built upon two simple ideas: First, to employ the RDF data model to publish structured data on the Web. Second, to set explicit RDF links between data items within different data sources. Background information about the Web of Data is found at the wiki pages of the W3C Linking Open Data community effort, in the overview article Linked Data – The Story So Far and in the tutorial on How to publish Linked Data on the Web.

The Silk Link Discovery Framework supports data publishers in accomplishing the second task. Using the declarative Silk – Link Specification Language (Silk-LSL), developers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must fulfill in order to be interlinked. These link conditions may combine various similarity metrics and can take the graph around a data item into account, which is addressed using an RDF path language. Silk accesses the data sources that should be interlinked via the SPARQL protocol and can thus be used against local as well as remote SPARQL endpoints.

Of particular interest are the comparison operators:

A comparison operator evaluates two inputs and computes their similarity based on a user-defined metric.
The Silk framework currently supports the following similarity metrics, which return a similarity value between 0 (lowest similarity) and 1 (highest similarity) each:

Metric Description

levenshtein([float maxDistance], [float minValue], [float maxValue]) String similarity based on the Levenshtein metric.

jaro String similarity based on the Jaro distance metric.

jaroWinkler String similarity based on the Jaro-Winkler metric.

qGrams(int q) String similarity based on q-grams (by default q=2).

equality Return 1 if strings are equal, 0 otherwise.

inequality Return 0 if strings are equal, 1 otherwise.

num(float maxDistance, float minValue, float maxValue) Computes the numeric distance between two numbers and normalizes it using the threshold.
Parameters:
maxDistance The similarity score is 0.0 if the distance is bigger than maxDistance.
minValue, maxValue The minimum and maximum values which occur in the datasource

date(int maxDays) Computes the similarity between two dates (“YYYY-MM-DD” format). At a difference of “maxDays”, the metric evaluates to 0 and progresses towards 1 with a lower difference.

wgs84(string unit, float threshold, string curveStyle) Computes the geographical distance between two points.
Parameters:
unit The unit in which the distance is measured. Allowed values: “meter” or “m” (default) , “kilometer” or “km”

threshold Will result in a 0 for all bigger values than t, values below are varying with the curveStyle
curveStyle “linear” gives a linear transition, “logistic” uses the logistic function f(x)=1/(1+e^(x)) gives a more soft curve with a slow slope at the start and the end of the curve but a steep one in the middle.
Author: Konrad Höffner (MOLE subgroup of Research Group AKSW, University of Leipzig)

Metric	Description
levenshtein([float maxDistance], [float minValue], [float maxValue])	String similarity based on the Levenshtein metric.
jaro	String similarity based on the Jaro distance metric.
jaroWinkler	String similarity based on the Jaro-Winkler metric.
qGrams(int q)	String similarity based on q-grams (by default q=2).
equality	Return 1 if strings are equal, 0 otherwise.
inequality	Return 0 if strings are equal, 1 otherwise.
num(float maxDistance, float minValue, float maxValue)	Computes the numeric distance between two numbers and normalizes it using the threshold. Parameters: `maxDistance` The similarity score is 0.0 if the distance is bigger than maxDistance. `minValue`, `maxValue` The minimum and maximum values which occur in the datasource
date(int maxDays)	Computes the similarity between two dates (“YYYY-MM-DD” format). At a difference of “maxDays”, the metric evaluates to 0 and progresses towards 1 with a lower difference.
wgs84(string unit, float threshold, string curveStyle)	Computes the geographical distance between two points. Parameters: `unit` The unit in which the distance is measured. Allowed values: “meter” or “m” (default) , “kilometer” or “km” `threshold` Will result in a 0 for all bigger values than t, values below are varying with the curveStyle `curveStyle` “linear” gives a linear transition, “logistic” uses the logistic function f(x)=1/(1+e^(x)) gives a more soft curve with a slow slope at the start and the end of the curve but a steep one in the middle. Author: Konrad Höffner (MOLE subgroup of Research Group AKSW, University of Leipzig)

(better formatting is available at the original page but I thought the operators important enough to report in full here)

Definitely a step towards more than opaque mapping between links. Note for example that Silk – Link Specification Language declares why two or more links are mapped together. More could be said but this is a start in the right direction.

Comments Off

Workshop on Entity-Oriented Search (EOS) – Beijing

Filed under: Conferences,Entity Extraction,Entity Resolution,Searching — Patrick Durusau @ 6:51 pm

The First International Workshop on Entity-Oriented Search (EOS)

Important Dates

Submissions due: June 10, 2011
Notification of acceptance: June 25, 2011
Camera-ready submission: July 1, 2011 (provisional, awaiting confirmation)
Workshop date: July 28, 2011

From the website:

Workshop Theme

Many user information needs concern entities: people, organizations, locations, products, etc. These are better answered by returning specific objects instead of just any type of documents. Both commercial systems and the research community are displaying an increased interest in returning “objects”, “entities”, or their properties in response to a user’s query. While major search engines are capable of recognizing specific types of objects (e.g., locations, events, celebrities), true entity search still has a long way to go.

Entity retrieval is challenging as “objects” unlike documents, are not directly represented and need to be identified and recognized in the mixed space of structured and unstructured Web data. While standard document retrieval methods applied to textual representations of entities do seem to provide reasonable performance, a big open question remains how much influence the entity type should have on the ranking algorithms developed.

Avoiding repeated document searching by successive users will require identification as suggested here. Sub-document addressing and retrieval of portions of documents is another aspect to the entity issue.

Comments Off

Elsevier/Tetherless World Health and Life Sciences Hackathon (27-28 June 2011)

Filed under: Conferences,Linked Data — Patrick Durusau @ 6:50 pm

Elsevier/Tetherless World Health and Life Sciences Hackathon (27-28 June 2011)

From the announcement:

The Tetherless World Constellation at RPI is excited to announce that TWC and the SciVerse team at Elsevier are planning a 24-hour Health and Life Sciences Semantic Web Hackathon to be held 27-28 June 2011. The Elsevier-sponsored event will be held at the beautiful Pat’s Barn, on the campus of the Rensselaer Technology Park.

Participants will compete against each other to develop apps using linked data from TWC and other sources, web APIs from Elsevier SciVerse, and visualization and other resources from around the Web.

Registration at: http://twcsciverse2011.eventbrite.com/

You won’t see much other than Pat’s Barn but it is a 24-hour hackathon and there are prizes!

Using topic maps to make linked data links less semantically opaque comes to mind.

Comments (2)

Nuxeo World 2011

Filed under: Conferences — Patrick Durusau @ 6:50 pm

Nuxeo World 2011

Paris: 20-21 October 2011

From the website:

We are pleased to invite you to Nuxeo World 2011, the second edition of our international conference on Nuxeo technology and solutions.

Whether you’re a partner or a customer, Nuxeo World provides multiple opportunities to discuss ECM topics and share experiences. We would love to hear how Nuxeo solutions have impacted your business, and give you a chance to share your knowledge and expertise with others in the industry.

Here’s your chance to show how topic maps enhance Enterprise Content Management (ECM) solutions.

Comments Off

Structr

Filed under: Neo4j — Patrick Durusau @ 6:49 pm

Structr

From the website:

structr is a free, open-source CMS under the GPLv3, written in Java, based on the fantastic NoSQL graph database Neo4j.

If you are curious about a new way creating interactive websites, you should give it a try. But maybe you can even use it for your personal or company website … if you dare.

structr is not yet stable, so please be patient and look out for bugs and minor (or even major) pitfalls.

As the site says, “not yet stable,” but very interesting!

Comments Off

groonga

Filed under: groonga,Search Engines — Patrick Durusau @ 6:49 pm

groonga

From the website:

Groonga is an open-source fulltext search engine and column store. It lets you write high-performance applications that requires fulltext search.

Be aware that most of the documentation is written in Japanese.

Consider it an incentive to learn Japanese, practice Japanese if you already know it but are rusty, or to develop documentation in another language.

Comments Off

« Newer Posts