April « 2011 « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 4, 2011

Lucene Scoring API

Filed under: Indexing,Lucene — Patrick Durusau @ 6:34 pm

The documentation for the Lucene scoring API makes for very interesting reading.

In more ways that one.

Important for anyone who want to understand the scoring of documents by Lucene, which will influence the usefulness of searches in your particular domain.

But I think it is also important because it emphasizes that the scoring is for documents and not subjects.

A very useful thing to score documents, because it (hopefully) puts the most relevant ones to a search at or near the top of search results.

But isn’t that similar to the last mile problem with high speed internet delivery?

That is it is one thing to get high speed internet service to the local switching office. It is quite another to get it to each home, hence, the last mile problem.

An indexing solution like Lucene can, maybe, get you to the right document for a search but that leaves you to go the last mile in terms of finding the subject of interest in the article.

And, just as importantly, relating that subject to other information about the same subject.

True enough, I have been doing that very thing with print indexes and hard copy long before the arrival of widespread full text indexes and on-demand versions of texts.

It seems like a terrible waste of time and resources for everyone interested in a particular subject to have to dig information out of documents and then that cycle is repeated every time someone looks up that subject and finds a particular document.

We all keep running the last semantic mile.

The question is what would motivate us to shorten that to say the last 1/2 semantic mile, or less?

Comments Off

DC-2011 Extended Deadline

Filed under: Conferences,Dublin Core,Metadata — Patrick Durusau @ 6:33 pm

DC-2011 Extended Deadline

New submission deadline: 30 April 2011

See DC-2011 for conference details (Dublin Core Metadata)

Comments Off

Tinkerpop – New Releases

Filed under: Blueprints,Frames,Graphs,Gremlin,Pipes — Patrick Durusau @ 6:32 pm

Tinkerpop – New Releases

From the release notes:

Blueprints 0.6 Oscar: https://github.com/tinkerpop/blueprints/wiki/Release-Notes
Pipes 0.4 Spigot: https://github.com/tinkerpop/pipes/wiki/Release-Notes
Frames 0.1 Brick-by-Brick: https://github.com/tinkerpop/frames/wiki/Release-Notes
Gremlin 0.9 Gremlin the Grouch: https://github.com/tinkerpop/gremlin/wiki/Release-Notes

Comments Off

Topincs 5.4.1 – Enhancements/Bug Fix

Filed under: Topic Map Software,Topincs — Patrick Durusau @ 6:32 pm

Topincs 5.4.1 – Enhancements/Bug Fix

Enhancements/bug fix for Topincs 5.4.0.

Rare to see both enhancements and bug fixes quickly.

Kudos to Robert Cerny!

Comments Off

Apache Hive 0.70 Released!

Filed under: Hadoop,Hive — Patrick Durusau @ 6:31 pm

Apache Hive 0.70 Released!

I count thirty-four (34) new features so I am not going to list them here. Improvements as well.

Comments Off

April 3, 2011

Hama

Filed under: Bulk Synchronous Parallel (BSP),Pregel — Patrick Durusau @ 6:39 pm

Hama

Apache Incubator project that describes itself as:

Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.

A little better explanation appears on the Hama blog when answering the question: “How will Hama BSP different from Pregel?:”

Hama BSP is a computing engine, based on BSP model, like a Pregel, and it’ll be compatible with existing HDFS cluster, or any FileSystem and Database in the future. However, we believe that the BSP computing model is not limited to a problems of graph; it can be used for widely distributed software such as Map/Reduce. In addition to a field of graph, there are many other algorithms, which have similar problems with graph processing using Map/Reduce. Actually, the BSP model has been researched for many years in the field of matrix computation, too. http://blogs.apache.org/hama/

Wikipedia has a short article on Bulk synchronous parallel (BSP) computing techniques with some references.

Comments Off

Introduction to Hadoop: Map Reduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 6:39 pm

Introduction to Hadoop: Map Reduce

Introduction to Hadoop by Steve Watt.

May not have the first principle verbatim correct but I liked:

Data must remain at rest.

The principle being that the work is moved to the data (due to the overhead of moving large amounts of data).

That would seem to also point in the direction of using functional programming principles as well.

Comments Off

Octave

Filed under: Mathematics,Visualization — Patrick Durusau @ 6:38 pm

Octave

From the website:

GNU Octave is a high-level interpreted language, primarily intended for numerical computations. It provides capabilities for the numerical solution of linear and nonlinear problems, and for performing other numerical experiments. It also provides extensive graphics capabilities for data visualization and manipulation. Octave is normally used through its interactive command line interface, but it can also be used to write non-interactive programs. The Octave language is quite similar to Matlab so that most programs are easily portable.

A new version, 3.4.0, was released February 8, 2011.

Comments Off

Shogun – Google Summer of Code 2011

Filed under: Hidden Markov Model,Kernel Methods,Machine Learning,Vectors — Patrick Durusau @ 6:38 pm

Shogun – Google Summer of Code 2011

Students! Here is your change to work on a cutting edge software library for machine learning!

Posted ideas, or submit your own.

From the website:

SHOGUN is a machine learning toolbox, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines for classification and regression, hidden Markov models, multiple kernel learning, linear discriminant analysis, linear programming machines, and perceptrons. Most of the specific algorithms are able to deal with several different data classes, including dense and sparse vectors and sequences using floating point or discrete data types. We have used this toolbox in several applications from computational biology, some of them coming with no less than 10 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond.

SHOGUN is implemented in C++ and interfaces to MATLAB, R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org.

This summer we are looking to extend the library in four different ways: Improving interfaces to other machine learning libraries or integrating them when appropriate, improved i/o support, framework improvements and new machine algorithms. Here is listed a set of suggestions for projects.

A prior post on Shogun.

Comments Off

April 2, 2011

Beyond MapReduce – Large Scale Graph Processing with GoldenOrb

Filed under: Graphs,Hadoop,MapReduce — Patrick Durusau @ 5:36 pm

Beyond MapReduce – Large Scale Graph Processing with GoldenOrb

While waiting for the open source release, I ran across this presentation about GoldenOrb by Zach Richardson, Co-Founder of Ravel Data.

Covers typical use cases, such as mining social networks and molecule modeling.

Comments (1)

Pathogen Portal

Filed under: Bioinformatics,Biomedical,Dataset — Patrick Durusau @ 5:34 pm

Pathogen Portal, The Bioinformatics Resource Centers Portal.

From the website:

Pathogen Portal is a repository linking to the Bioinformatics Resource Centers (BRCs) sponsored by the National Institute of Allergy and Infectious Diseases (NIAID) and maintained by The Virginia Bioinformatics Institute. The BRCs are providing web-based resources to scientific community conducting basic and applied research on organisms considered potential agents of biowarfare or bioterrorism or causing emerging or re-emerging diseases.

Motherlode of resources and datasets on “…potential agents of biowarfare or bioterrorism….”

I read an article years ago in Popular Science about smearing punji stakes with water buffalo excrement. A primitive, but effective, form of biowarfare.

I suppose that would fall in the realm of applied research for purposes of a topic map.

Comments Off

EuPathDB

Filed under: Bioinformatics,Biomedical,TMQL — Patrick Durusau @ 5:29 pm

EuPathDB

From the website:

EuPathDB Bioinformatics Resource Center for Biodefense and Emerging/Re-emerging Infectious Diseases is a portal for accessing genomic-scale datasets associated with the eukaryotic pathogens (Cryptosporidium, Encephalitozoon, Entamoeba, Enterocytozoon, Giardia, Leishmania, Neospora, Plasmodium, Toxoplasma, Trichomonas and Trypanosoma).

OK, other than being cited in the previous post about integration using ontologies, why is this relevant to topic maps?

Check out the web tutorial on search strategies.

Now imagine being able to select/revise/view results for a TMQL query.

It would take some work and no doubt be domain specific, but I thought the example would be worth bringing to your attention.

Not to mention that these are data sets where improved access using topic maps could attract attention.

Comments Off

Using ontologies in integrative tools for protozoan parasite research

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 5:26 pm

Using ontologies in integrative tools for protozoan parasite research

Abstract:

Protozoan parasites such as those that cause malaria and toxoplasmosis remain major threats to global health, and a significant biodefense concern. Current treatments are limited and sometimes compromised by acquired resistance. Solutions will come from the integration and mining of ongoing research. The need for data integration is common among research communities tackling complex topics such as the biology of eukaryotic pathogens, their interaction with hosts, and the search for druggable targets and vaccine candidates. Biomedical researchers have greatly benefited from the Gene Ontology (GO) that provides standardized terms for annotating protein function, location, and participation in processes. GO and other relevant ontologies have largely been developed to support human and model organism biology with only limited representation of protozoan parasite biology. In addition, the availability and use of standard terms is also very limited for the inputs and outputs of bioinformatic tools that are commonly used to analyze protozoan parasite datasets and is a barrier for linking these tools together. In the Integrative Tools for Protozoan Parasite Research (ITPPR) project, we have started addressing these areas by developing tools needed by the communities served by EuPathDB (http://eupathdb.org/). We are using ontology-based models as part of our process to build tools for collecting information on isolates, describing phenotypic outcomes of transgenic parasites, and for joining web services running sequence similarity and alignment analysis. Ontologies are drawn from the OBO Foundry and include the Infectious Disease Ontology (IDO) and OBI (Ontology for Biomedical Investigations).

Topic: NCBO Webinar Series
Date: Wednesday, April 6, 2011
Time: 10:00 am, Pacific Daylight Time (San Francisco, GMT-07:00)

That’s the Wednesday following this post.

An area where integration of data can make a difference.

Comments Off

April 1, 2011

SEISA: set expansion by iterative similarity aggregation

Filed under: Aggregation,Sets — Patrick Durusau @ 4:11 pm

SEISA: set expansion by iterative similarity aggregation by Yeye He, University of Wisconsin-Madison, Madison, WI, USA, and Dong Xin, Microsoft Research, Redmond, WA, USA.

In this paper, we study the problem of expanding a set of given seed entities into a more complete set by discovering other entities that also belong to the same concept set. A typical example is to use “Canon” and “Nikon” as seed entities, and derive other entities (e.g., “Olympus”) in the same concept set of camera brands. In order to discover such relevant entities, we exploit several web data sources, including lists extracted from web pages and user queries from a web search engine. While these web data are highly diverse with rich information that usually cover a wide range of the domains of interest, they tend to be very noisy. We observe that previously proposed random walk based approaches do not perform very well on these noisy data sources. Accordingly, we propose a new general framework based on iterative similarity aggregation, and present detailed experimental results to show that, when using general-purpose web data for set expansion, our approach outperforms previous techniques in terms of both precision and recall.

To the uses of set expansion mentioned by the authors:

Set expansion systems are of practical importance and can be used in various applications. For instance, web search engines may use the set expansion tools to create a comprehensive entity repository (for, say, brand names of each product category), in order to deliver better results to entity-oriented queries. As another example, the task of named entity recognition can also leverage the results generated by set expansion tools [13]

I would add:

augmented authoring of navigation tools for text corpora
discovery of related entities (for associations)

While the authors concentrate on web-based documents, which for the most part are freely available, the techniques shown here could be just as easily applied to commercial texts or used to generate pay-for-view results.

It would have to really be a step up to get people to pay a premium for navigation of free content, but given the noisy nature of most information sites, that is certainly possible.

Comments Off

A word at a time: computing word relatedness using temporal semantic analysis

Filed under: Semantics,Temporal Semantic Analysis — Patrick Durusau @ 4:11 pm

A word at a time: computing word relatedness using temporal semantic analysis by Kira Radinsky, Technion-Israel Institute of Technology, Haifa, Israel; Eugene Agichtein, Emory University, Atlanta, GA, USA; Evgeniy Gabrilovich, Yahoo! Research, Santa Clara, CA, USA; Shaul Markovitch, Technion-Israel Institute of Technology, Haifa, Israel.

Computing the degree of semantic relatedness of words is a key functionality of many language applications such as search, clustering, and disambiguation. Previous approaches to computing semantic relatedness mostly used static language resources, while essentially ignoring their temporal aspects. We believe that a considerable amount of relatedness information can also be found in studying patterns of word usage over time. Consider, for instance, a newspaper archive spanning many years. Two words such as “war” and “peace” might rarely co-occur in the same articles, yet their patterns of use over time might be similar. In this paper, we propose a new semantic relatedness model, Temporal Semantic Analysis (TSA), which captures this temporal information. The previous state of the art method, Explicit Semantic Analysis (ESA), represented word semantics as a vector of concepts. TSA uses a more refined representation, where each concept is no longer scalar, but is instead represented as time series over a corpus of temporally-ordered documents. To the best of our knowledge, this is the first attempt to incorporate temporal evidence into models of semantic relatedness. Empirical evaluation shows that TSA provides consistent improvements over the state of the art ESA results on multiple benchmarks.

The discovery of “related” terms may lead to discovery of synonyms for a subject, associations with a subject and other grist for your topic map mill.

This is interesting work and should be considered whenever topic mapping material recorded over time. Historical government archives come to mind.

Comments Off

Solr 3.1 (Lucene 3.1) Released!

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 4:10 pm

Solr 3.1 (Lucene 3.1) Released!

Solr 3.1, which contains Lucene 3.1, was released on 31 March 2011.

Some of the new features include:

Numeric range facets (similar to date faceting).

New spatial search, including spatial filtering, boosting and sorting capabilities.

Example Velocity driven search UI at
http://localhost:8983/solr/browse

A new faster termvector-based highlighter.

Extended dismax (edismax) query parser with support for fielded queries, enhanced relevancy, and full lucene syntax support.

Distributed search support for the Spell check
and Terms components.

Suggester, a fast trie-based autocomplete component.

Sort results by any function query.

JSON document indexing.

CSV response format

Apache UIMA integration for metadata extraction.

Tons of optimizations, bugfixes, and new analysis capabilities via Apache Lucene 3.1.

Quick links:

Download Solr here

Check out the tutorial

Read the Solr wiki to learn more

Join the community

Give Back! (Optional, but encouraged!)

Comments (1)

matplotlib

Filed under: Graphics — Patrick Durusau @ 4:09 pm

matplotlib

From the website:

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.

If the name sounds familiar, it should as it is mentioned in scikits.learn machine learning in Python, but I didn’t have a link to the package.

Comments Off

« Newer Posts