Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 21, 2012

New Public-Access Source With 3-D Information for Protein Interactions

Filed under: Bioinformatics,Biomedical,Genome,Genomics — Patrick Durusau @ 5:24 pm

New Public-Access Source With 3-D Information for Protein Interactions

From the post:

Researchers have developed a platform that compiles all the atomic data, previously stored in diverse databases, on protein structures and protein interactions for eight organisms of relevance. They apply a singular homology-based modelling procedure.

The scientists Roberto Mosca, Arnaud Ceol and Patrick Aloy provide the international biomedical community with Interactome3D (interactome3d.irbbarcelona.org), an open-access and free web platform developed entirely by the Institute for Research in Biomedicine (IRB Barcelona). Interactome 3D offers for the first time the possibility to anonymously access and add molecular details of protein interactions and to obtain the information in 3D models. For researchers, atomic level details about the reactions are fundamental to unravel the bases of biology, disease development, and the design of experiments and drugs to combat diseases.

Interactome 3D provides reliable information about more than 12,000 protein interactions for eight model organisms, namely the plant Arabidopsis thaliana, the worm Caenorhabditis elegans, the fly Drosophila melanogaster, the bacteria Escherichia coli and Helicobacter pylori, the brewer’s yeast Saccharomyces cerevisiae, the mouse Mus musculus, and Homo sapiens. These models are considered the most relevant in biomedical research and genetic studies. The journal Nature Methods presents the research results and accredits the platform on the basis of it high reliability and precision in modelling interactions, which reaches an average of 75%.

Further details can be found at:

Interactome3D: adding structural details to protein networks by Roberto Mosca, Arnaud Céol and Patrick Aloy. (Nature Methods (2012) doi:10.1038/nmeth.2289)

Abstract:

Network-centered approaches are increasingly used to understand the fundamentals of biology. However, the molecular details contained in the interaction networks, often necessary to understand cellular processes, are very limited, and the experimental difficulties surrounding the determination of protein complex structures make computational modeling techniques paramount. Here we present Interactome3D, a resource for the structural annotation and modeling of protein-protein interactions. Through the integration of interaction data from the main pathway repositories, we provide structural details at atomic resolution for over 12,000 protein-protein interactions in eight model organisms. Unlike static databases, Interactome3D also allows biologists to upload newly discovered interactions and pathways in any species, select the best combination of structural templates and build three-dimensional models in a fully automated manner. Finally, we illustrate the value of Interactome3D through the structural annotation of the complement cascade pathway, rationalizing a potential common mechanism of action suggested for several disease-causing mutations.

Interesting not only for its implications for bioinformatics but for the development of homology modeling (superficially, similar proteins have similar interaction sites) to assist in their work.

The topic map analogy would be to show a subject domain, different identifications of the same subject tend to have the same associations or to fall into other patterns.

Then constructing a subject identity test based upon a template of associations or other values.

D3 3.0 [New Features Illustrated]

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 4:17 pm

D3 3.0

Illustrations of a new geographic projection system for D3, new geographic plugins, transitions, requests and other fixes and enhancements.

If the images don’t interest you in D3, you must not have any graphic issues. 😉

How-to: Use a SerDe in Apache Hive

Filed under: Hive — Patrick Durusau @ 4:04 pm

How-to: Use a SerDe in Apache Hive by Jonathan Natkins.

From the post:

Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.

What is a SerDe?

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

In this article, we will examine a SerDe for processing JSON data, which can be used to transform a JSON record into something that Hive can process.

You may be too busy to notice if you have any presents under the tree. 😉

SPARQL end-point of data.euorpeana.edu

Filed under: Library,Museums,RDF,SPARQL — Patrick Durusau @ 3:22 pm

SPARQL end-point of data.euorpeana.edu

From the webpage:

Welcome on the SPARQL end-point of data.europeana.eu!

data.europeana.eu currently contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana. Data is following the terms of the Creative Commons CC0 public domain dedication. Data is described the Resource Description Framework (RDF) format, and structured using the Europeana Data Model (EDM). We give more detail on the EDM data we publish on the technical details page.

Please take the time to check out the list of collections currently included in the pilot.

The terms of use and external data sources appearing at data.europeana.eu are provided on the Europeana Data sources page.

Sample queries are available on the sparql page.

At first I wondered why this was news because: Europeana opens up data on 20 million cultural items appeared on 12 September 2012 in the Guardian

I assume the data has been in use since its release last September.

If you have been using it, can you comment on how your use will change now that the data is available as a SPARQL end-point?

Intro to Cypher Console [Live Party Friend of Friend Graph?]

Filed under: Cypher,Humor,Neo4j — Patrick Durusau @ 2:54 pm

Intro to Cypher Console by Peter Neubauer.

Peter has posted a 5 minute video introduction to the Cypher console.

Imagine a dynamic a friend of a friend graph for a Christmas or New Year’s party. Updated every 5 minutes and projected on a big screen.

Or you could allow guests to attach comments to the nodes/edges.

Rife with opportunities for humor. 😉

BaseX. The XML Database. [XPath/XQuery]

Filed under: Editor,XML,XQuery — Patrick Durusau @ 11:08 am

BaseX. The XML Database.

From the webpage:

News: BaseX 7.5 has just been released…

BaseX is a very light-weight, high-performance and scalable XML Database engine and XPath/XQuery 3.0 Processor, including full support for the W3C Update and Full Text extensions. An interactive and user-friendly GUI frontend gives you great insight into your XML documents.

Another XML editor but I mention it for its support of XQuery more than as an editor per se.

We continue to lack a standard query language for topic maps and experience with XQuery may prove informative.

Not to mention its possible role in gathering diverse data for presentation in a merged state to users.

<ANGLES>

Filed under: Editor,Software,XML — Patrick Durusau @ 10:33 am

<ANGLES>

From the homepage:

ANGLES is a research project aimed at developing a lightweight, online XML editor tuned to the needs of the scholarly text encoding community. By combining the model of intensive code development (the “code sprint”) with participatory design exercises, testing, and feedback from domain experts gathered at disciplinary conferences, ANGLES will contribute not only a working prototype of a new software tool but also another model for tool building in the digital humanities (the “community roadshow”).

Work on ANGLES began in November 2012.

We’ll have something to share very soon!

<ANGLES> is an extension of ACE:

ACE is an embeddable code editor written in JavaScript. It matches the features and performance of native editors such as Sublime, Vim and TextMate. It can be easily embedded in any web page and JavaScript application. ACE is maintained as the primary editor for Cloud9 IDE and is the successor of the Mozilla Skywriter (Bespin) project.

<ANGLES> code at Sourceforge.

I will be interested to see how ACE is extended. Just glancing at it this morning, it appears to be the traditional “display angle bang syntax” editor we all know so well.

What puzzles me is that we have been to the mountain of teaching users to be comfortable with raw XML markup and the results have not been promising.

As opposed to the experience with OpenOffice, MS Office, etc., which have proven that creating documents that are then expressed in XML, is within the range of ordinary users.

<ANGLES> looks like an interesting project but whether it brings XML editing within the reach of ordinary users is an open question.

If the XML editing puzzle is solved, perhaps it will have lessons for topic map editors.

Connecting Splunk and Hadoop

Filed under: Hadoop,Splunk — Patrick Durusau @ 6:23 am

Connecting Splunk and Hadoop by Ledion Bitincka.

From the post:

Finally I am getting a some time to write about some cool features of one the projects that I’ve been working on – Splunk Hadoop Connect . This app is our first step in integrating Splunk and Hadoop. In this post I will cover three tips on how this app can help you, all of them are based on the new search command included in the app: hdfs. Before diving into the tips I would encourage that you download, install and configure the app first. I’ve also put together two screencast videos to walk you through the installation process:

Installation and Configuration for Hadoop Connect
Kerberos Configuration

You can also find the full documentation for the app here

Cool!

Is it just me or is sharing data across applications becoming more common?

Thinking the greater the sharing, the greater the need for mapping data semantics for integration.

Sense

Filed under: ElasticSearch — Patrick Durusau @ 6:09 am

Sense by Boaz Leskes.

From the webpage:

A JSON aware, web based interface to ElasticSearch. Comes with handy machinary such as syntax highlighting, autocomplete, formatting and code folding.

If you are using ElasticSearch, certainly worth a look!

December 20, 2012

edX – Spring 2013

Filed under: CS Lectures,Law — Patrick Durusau @ 8:34 pm

edX – Spring 2013

Of particular interest:

This spring also features Harvard’s Copyright, taught by Harvard Law School professor William Fisher III, former law clerk to Justice Thurgood Marshall and expert on the hotly debated U.S. copyright system, which will explore the current law of copyright and the ongoing debates concerning how that law should be reformed. Copyright will be offered as an experimental course, taking advantage of different combinations and uses of teaching materials, educational technologies, and the edX platform. 500 learners will be selected through an open application process that will run through January 3rd 2013.

An opportunity to use a topic map with complex legal issues and sources.

But CS topics are not being neglected:

In addition to these new courses, edX is bringing back several courses from the popular fall 2012 semester: Introduction to Computer Science and Programming; Introduction to Solid State Chemistry; Introduction to Artificial Intelligence; Software as a Service I; Software as a Service II; Foundations of Computer Graphics.

Node.js, Neo4j, and usefulness of hacking ugly code [Normalization as Presentation?]

Filed under: Graphs,Neo4j,Networks,node-js — Patrick Durusau @ 8:16 pm

Node.js, Neo4j, and usefulness of hacking ugly code by Justin Mandzik.

From the post:

My primary application has a ton of data, even in its infancy. Hundreds of millions of distinct entities (and growing fast), each with many properties, and many relationships. Numbers in the billions start to be really easy to hit, and then thats still not accounting for organic growth. Most of the data is hierarchical for now, but theres a need in the near term for arbitrary relationships and the quick traversing thereof. Vanilla MySQL in particular is annoying to work when it comes to hierarchical data. Moving to Oracle gets us some nicer toys to play with (CONNECT_BY_ROOT and such), but ultimately, the need for a complimentary database solution emerges.

NOSQL bake-off

While my non-relational db experience is limited to MongoDB (which I love dearly), a graph db seemed to be the better theoretical fit. Requirements: Manage dense, interconnected data that has to be traversed fast, a query language that supports a root cause analysis use case, and some kind of H.A. plan of attack. Signals of Neo4j, OrientDB, and Titan started emerging from the noise. Randomly, I started in with Neo4j with the intent of repeating the test cases on the other contenders assuming any of the 3 met the requirements (in theory, at least). Neo4j has a GREAT “2 minutes to get up and running” experience. Untar, bin/neo4j start, and go to localhost:7474 and you’re off and running. A decent interface waits for you and you can dive right in.

Proof of concept code for testing Neo4j with project data.

The presumption of normalization in Neo4j continues to nag at me.

The broader the reach for data, the less likely normalization is going to be possible, or affordable if possible in some theoretical sense.

It may be that normalization is a presentation aspect of results. Will have to think about that over the holidays.

Apache Hadoop: Seven Predictions for 2013 [Topic Maps in 2013 Predictions?]

Filed under: Hadoop — Patrick Durusau @ 8:03 pm

Apache Hadoop: Seven Predictions for 2013 by Herb Cunitz.

From the post:

At Thanksgiving we took a moment to reflect on the past and give thanks for all that has happened to Hortonworks the past year. With the New Year approaching we now take time to look forward and provide our predictions for the Hadoop community in 2013. To compile this list, we queried and collected big data from our team of Hadoop committers and members of the community.

We asked a few luminaries as well and we surfaced many expert opinions and while we had our hearts set on five predictions, we ended up with SEVEN. So, without further adieu, here are the top Top 7 Predictions for Hadoop in 2013.

These are just the first predictions I have seen for 2013. I am sure there have been others and there will be lots between now and year’s end.

Assuming we all make it past the 21st of December, 2012, ;-), any suggestions for topic maps in 2013?

Maps in R: Introduction – Drawing the map of Europe

Filed under: Mapping,Maps,R — Patrick Durusau @ 7:36 pm

Maps in R: Introduction – Drawing the map of Europe by Max Marchi.

How many lines of R code to draw a map of Europe?

Write down your answer.

Now follow the link to the original post.

Close? Far away?

Splunk SDKs for Java and Python

Filed under: BigData,Splunk — Patrick Durusau @ 7:07 pm

Splunk Announces New Development Tools to Extend the Power of Splunk Software

From the post:

Splunk Inc. (NASDAQ: SPLK), the leading software platform for real-time operational intelligence, today announced the general availability (GA) of new software development kits (SDKs) for Java and Python. SDKs make it easier for developers to customize and extend the power of Splunk® Enterprise, enabling real-time big data insights across the organization. Splunk previously released the GA version of the Splunk SDK for JavaScript for Splunk Enterprise 5. The Splunk SDK for PHP is in public preview.

“Our mission at Splunk is to lower the barriers for organizations to gain operational intelligence from machine data,” said Paul Sanford, general manager of developer platform, Splunk. “We want to empower developers to build big data applications on the Splunk platform and to understand that you don’t need large-scale development efforts to get big value. That’s a key driver behind the development of these SDKs, helping developers quickly get started with Splunk software, leveraging their existing language skills and driving rapid time to value.”

“Building a developer community around a software platform requires a strong commitment to a low barrier to entry. This applies to every step of the adoption process, from download to documentation to development. Splunk’s focus on SDKs for some of the most popular programming languages, with underlying REST-based APIs, supports its commitment to enabling software developers to easily build applications,” said Donnie Berkholz, Ph.D., IT industry analyst, RedMonk.

Just in time for the holidays!

Downloads:

Splunk Enterprise

Splunk SDK for Java

Splunk SDK for JavaScript

Splunk SDK for PHP

Splunk SDK for Python

I first saw this in a tweet by David Fauth.

Integration of Information Workbench with Stanbol: Public Demo Available

Filed under: CMS,Stanbol — Patrick Durusau @ 6:57 pm

Integration of Information Workbench with Stanbol: Public Demo Available

From the post:

We are happy to announce a public demo showcasing the integration of Apache Stanbol into the Information Workbench platform by fluid Operations AG. This integration is the result of fluid Operations’ participation in the IKS Early Adopters program.

With the help of Apache Stanbol enhancement engines, the Information Workbench is able to enrich free-text content with references to semantic data instances. This enables advanced data management capabilities to Information Workbench users in the area of semantic content management and publishing: both making it easier to organize internal data by linking structured and free-text content as well as assisting in content authoring and publishing.

Our public demo illustrates these capabilities by presenting a competitive intelligence scenario. In this demo, a collection of documents is annotated with relevant DBpedia entities representing companies, people, and locations mentioned in these documents. These annotations are used to browse the document collection and visualize it using different widgets: e.g., presenting mentioned locations using Google Maps and number of entity mentions with charts. In addition, Stanbol content enhancement is used to enrich information imported from external Web sources: particularly, abstracts of relevant news articles accessed via the New York Times Article Search API.

A promising demonstration of Apache Stanbol.

I was less impressed with the content.

Take one of the top ten companies being tracked, Ferrari.

In the Information Workbench, the Ferrari entry displays a Google map displays to your right, marking a location in Italy. I suspect I know the meaning of that location on the map but some reassurance on that score would be nice.

The “relevant news” includes “Italy’s Premier Refuses To Commit to Running,” rather puzzling for Ferrari until you read more of the story to find: “Luca Cordero di Montezemolo, the president of Ferrari who started a civic movement last month and said it would endorse Mr. Monti.”

On the other hand, DBpedia may be so coarse that searches based upon it are on par with the average search engine.

I applaud the early use of Stanbol but stronger data sources are going to be required for interesting results.

Best Buy Product Catalog via Semantic Endpoints

Filed under: Linked Data,RDF — Patrick Durusau @ 2:31 pm

Announcing BBYOpen Metis Alpha: Best Buy Product Catalog via Semantic Endpoints

From the post:

Announcing BBYOpen Metis Alpha: Best Buy Product Catalog via Semantic Endpoints

These days, consumers have a rich variety of products available at their fingertips. A massive product landscape has evolved, but sadly products in this enormous and rich landscape often get flattened to just a price tag. Over time, it seems the product value proposition, variety, descriptions, specifics, and details that make up products have all but disappeared. This presents consumers with a "paradox of choice" where misinformed decisions can lead to poor product selections, and ultimately product returns and customer remorse.

To solve this problem, BBY Open is excited to announce the first phase Alpha release of Metis, our semantically-driven product insight engine. As part of a phased release approach, this first release consists of publishing all 500K+ of our active Best Buy products with reviews as RDF-enabled endpoints for public consumption.

This alpha release is the first phase in solving this product ambiguity. With the publishing of structured product data in RDF format using industry accepted product ontologies like GoodRelations, standards from the Semantic Web group at the W3C, and the NetKernel platform, the Metis Alpha gives developers the ability to consume and query structured data via SPARQL (get up to speed with Learning SPARQL by Bob DuCharme), enabling the discovery of insight hidden deep inside the product catalog.

Comments?

D3: IED Attacks in Afghanistan Explained

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 11:47 am

D3: IED Attacks in Afghanistan Explained

A walk through the use of D3, jQuery and jQuery UI to create an interactive display of IED attacks in Afghanistan.

Adaptable to other time/location based data sets.

The working example: IED Attacks in Afghanistan (2004-2009)

The Department of Political Science, Ohio State University, hosts several D3 tutorials:

Data Visualization: Learning D3https://secure.polisci.ohio-state.edu/faq/d3.php

I first saw this in a tweet by Christophe Viau.

Web-Based Visualization Part 1: The D3.js Key Concept

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 11:24 am

Web-Based Visualization Part 1: The D3.js Key Concept by Stephen Thomas.

From the post:

Those of us in the user interface community often obsess over the smallest details: shifting an element one pixel to the left or finding the perfect transparency setting for our drop shadows. Such an obsession is well and good; it can lead to a superior user experience. But sometimes perhaps we should take a step back from the details. In many cases, our interfaces exist to provide understanding. If the user doesn’t understand the information, then it doesn’t matter how pretty the interface looks. Fortunately, there’s an entire science devoted to understanding information, the science of data visualization. And on the web there is one particular tool that its practitioners rely on far more than any other—the excellent D3.js library from Mick Bostock. In a series of posts, we’ll take a close look at this library and its enormous power.

Stephen disagrees with the view that D3.js is difficult to learn and sets a goal to get readers on the way to using D3.js after fifteen minutes of reading.

How did you fare?

I first saw this in a tweet by Christophe Viau.

December 19, 2012

Is Your Information System “Sticky?”

Filed under: Citation Analysis,Citation Indexing,Citation Practices,Marketing,Topic Maps — Patrick Durusau @ 11:41 am

In “Put This On My List…” Michael Mitzenmacher writes:

Put this on my list of papers I wish I had written: Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. I think the title is sufficiently descriptive of the content, but the idea was they created a fake researcher and posted fake papers on a real university web site to inflate citation counts for some papers. (Apparently, Google scholar is pretty “sticky”; even after the papers came down, the citation counts stayed up…)

The traditional way to boost citations is to re-arrange the order of the authors and the same paper, then re-publish it.

Gaming citation systems isn’t news, although the Google Scholar Citations paper demonstrates that it has become easier.

For me the “news” part was the “sticky” behavior of Google’s information system, retaining the citation counts even after the fake documents were removed.

Is your information system “sticky?” That is does it store information as “static” data that isn’t dependent on other data?

If it does, you and anyone who uses your data is running the risk of using stale or even incorrect data. The potential cost of that risk depends on your industry.

For legal, medical, banking and similar industries, the potential liability argues against assuming recorded data is current and valid data.

Representing critical data as a topic with constrained (TMCL) occurrences that must be present is one way to address this problem with a topic map.

If a constrained occurrences is absent, the topic in question fails the TMCL constraint and so can be reported as an error.

I suspect you could duplicate that behavior in a graph database.

When you query for a particular node (read “fact”), check to see if all the required links are present. Not as elegant as invalidation by constraint but should work.

TSD 2013: 16th International Conference on Text, Speech and Dialogue

Filed under: Conferences,Natural Language Processing,Texts — Patrick Durusau @ 10:50 am

TSD 2013: 16th International Conference on Text, Speech and Dialogue

Important Dates:

When Sep 1, 2013 – Sep 5, 2013
Where Plzen (Pilsen), Czech Republic
Submission Deadline Mar 31, 2013
Notification Due May 12, 2013
Final Version Due Jun 9, 2013

Subjects for submissions:

  • Speech Recognition
    —multilingual, continuous, emotional speech, handicapped speaker, out-of-vocabulary words, alternative way of feature extraction, new models for acoustic and language modelling,
  • Corpora and Language Resources
    —monolingual, multilingual, text, and spoken corpora, large web corpora, disambiguation, specialized lexicons, dictionaries,
  • Speech and Spoken Language Generation
    —multilingual, high fidelity speech synthesis, computer singing,
  • Tagging, Classification and Parsing of Text and Speech
    —multilingual processing, sentiment analysis, credibility analysis, automatic text labeling, summarization, authorship attribution,
  • Semantic Processing of Text and Speech
    —information extraction, information retrieval, data mining, semantic web, knowledge representation, inference, ontologies, sense disambiguation, plagiarism detection,
  • Integrating Applications of Text and Speech Processing
    —machine translation, natural language understanding, question-answering strategies, assistive technologies,
  • Automatic Dialogue Systems
    —self-learning, multilingual, question-answering systems, dialogue strategies, prosody in dialogues,
  • Multimodal Techniques and Modelling
    —video processing, facial animation, visual speech synthesis, user modelling, emotion and personality modelling.

It was TSD 2012 where I found the presentation by Ruslan Mitkov presentation: Coreference Resolution: to What Extent Does it Help NLP Applications? So, highly recommended!

December 18, 2012

Google Imagines a Real World That’s as Irritating as the Internet

Filed under: Design,Humor,Interface Research/Design — Patrick Durusau @ 5:43 pm

Google Imagines a Real World That’s as Irritating as the Internet by Rebecca Cullers.

From the post:

Google Analytics has put together a series of videos demonstrating what poor web design can do to an online commerce site—crap we’d never put up with in a brick-and-mortar store. There’s unintuitive search and site design that prevents you from finding the item you’re looking for—in this case, it’s a grocery store that makes it impossible to find an everyday item as simple as milk. There’s the obnoxious online checkout, where you’re forced to log in, agree to terms and prove you’re a real person before you get timed out, forcing you to start all over again. Then there’s a misplaced dig at Amazon’s highly successful, often copied suggestion of other items you might like. Produced in-house by Google Creative Lab, all the spots have the absurdity of a Monty Python skit. It seems weird for Google to be dissing online search and e-commerce, but here it serves the greater goal of telling people to learn more about their customers via Analytics. And in this case, it’s funny cause it’s true.

I won’t even attempt to describe the videos.

You will have to hold onto your chair to remain upright.

Seriously, they capture the essence of bad online shopping experiences.

Or should I say user interfaces?

Bio-Linux 7 – Released November 2012

Filed under: Bio-Linux,Bioinformatics,Biomedical,Linux OS — Patrick Durusau @ 5:24 pm

Bio-Linux 7 – Released November 2012

From the webpage:

Bio-Linux 7 is a fully featured, powerful, configurable and easy to maintain bioinformatics workstation. Bio-Linux provides more than 500 bioinformatics programs on an Ubuntu Linux 12.04 LTS base. There is a
graphical menu for bioinformatics programs, as well as easy access to the Bio-Linux bioinformatics documentation system and sample data useful for testing programs. 

Bio-Linux 7 adds many improvements over previous versions, including the Galaxy analysis environment.  There are also various packages to handle new generation sequence data types.

You can install Bio-Linux on your machine, either as the only operating system, or as part of a dual-boot setup which allows you to use your current system and Bio-Linux on the same hardware.

Bio-Linux also runs Live from the DVD or a USB stick. This runs in the memory of your machine and does not involve installing anything. This is a great, no-hassle way to try out Bio-Linux, demonstrate or teach with it, or to work with when you are on the move.

Bio-Linux is built on open source systems and software, and so is free to to install and use. See What’s new on Bio-Linux 7. Also, check out the  2006 paper on Bio-Linux and open source systems for biologists.

Useful for exploring bioinformatics tools for Ubuntu.

But useful as well for considering how those tools could be used in data/text mining for other domains.

Not to mention the packaging for installation to DVD or USB stick.

Are there any topic map engines that are setup for burning to DVD or USB stick?

Packaging them that way with more than a minimal set of maps and/or data sets might be a useful avenue to explore.

Superlinear Indexes

Filed under: Indexing — Patrick Durusau @ 4:33 pm

Superlinear Indexes

From the webpage:

Multidimensional and String Indexes for Streaming Data

The Superlinear Index project is investigating how to data structures and algorithms for maintaining superlinear indexes on out-of-core storage (such as disk drives), with high incoming data rates. To understand what a superlinear index is, consider a linear index, which provides a total order on keys. A superlinear index is more complex than a total order. Examples of superlinear indexes including multidimensional indexes and full-text indexes.

A number of publications but none in 2012.

I will be checking to see if the project is still alive.

The value of typing code

Filed under: Education,Programming,Teaching,Topic Maps — Patrick Durusau @ 3:53 pm

The value of typing code by John D. Cook.

John points to a blog post by Tommy Nicholas that reads in part:

When Hunter S. Thompson was working as a copy boy at Time Magazine in 1959, he spent his spare time typing out the entire Great Gatsby by F. Scott Fitzgerald and A Farewell to Arms by Ernest Hemingway in order to better understand what it feels like to write a great book. To be able to feel the author’s turns in logic and storytelling weren’t possible from reading the books alone, you had to feel what it feels like to actually create the thing. And so I have found it to be with coding.

Thompson’s first book, Hell’s Angels: a strange and terrible saga was almost a bible to me in middle school, but I don’t know that he ever captured writing “a great book.” There or in subsequent books. Including the scene where he describes himself as clawing at the legs of Edmund Muskie before Muskie breaks down in tears. Funny, amusing, etc. but too period bound to be “great.”

On the other hand, as an instructional technique, what do you think about disabling cut-n-paste in a window so students have to re-type a topic map and perhaps add material to it at the same time?

Something beyond toy examples although with choices so students could pick one with enough interest for them to do the work.

The Flâneur Approach to User Experience Design

Filed under: Interface Research/Design,Usability — Patrick Durusau @ 3:28 pm

The Flâneur Approach to User Experience Design by Sarah Doody.

The entire article is a delight but Sarah’s take on how to prepare ourselves for random insights resonates with me:

So, how can we prepare our minds to recognize and respond to moments of random insight? Turns out the French may have an answer: flâner, a verb that means “to stroll.” Derived from this verb is the noun flâneur, a person who would stroll, lounge, or saunter about on the streets of Paris.

Throughout the 16th and 17th centuries, a flâneur was regarded as somewhat lazy, mindless, and loafing. However, in the 19th century a new definition of the word emerged that captures the essence of what I believe makes a great user experience designer.

By this definition, a flâneur is more than just an aimless wanderer. The flâneur’s mind in always in a state of observation. He or she connects the dots through each experience and encounter that comes his or her way. The flâneur is in constant awe of his surroundings. In the article “In Search Of Serendipity” for The Economist’s Intelligent Life Magazine, Ian Leslie writes that a flâneur is someone who “wanders the streets with purpose, but without a map.”

I rather like that image, “wanders the streets with purpose, but without a map.”

I always start the day with things I would like to blog about but some (most?) days the keyboard just gets away from me. 😉

I haven’t kept score but my gut feeling is that I have discovered more things while looking for something else than following a straight and narrow path.

You?

(See Sarah’s post for the qualities needed to have a prepared mind.)

Coreference Resolution Tools : A first look

Filed under: Coreference Resolution,Natural Language Processing — Patrick Durusau @ 2:10 pm

Coreference Resolution Tools : A first look by Sharmila G Sivakumar.

From the post:

Coreference is where two or more noun phrases refer to the same entity. This is an integral part of natural languages to avoid repetition, demonstrate possession/relation etc.

Eg: Harry wouldn’t bother to read “Hogwarts: A History” as long as Hermione is around. He knows she knows the book by heart.

The different types of coreference includes:
Noun phrases: Hogwarts A history <- the book
Pronouns : Harry <- He
Possessives : her, his, their
Demonstratives: This boy

Coreference resolution or anaphor resolution is determining what an entity is refering to. This has profound applications in nlp tasks such as semantic analysis, text summarisation, sentiment analysis etc.

In spite of extensive research, the number of tools available for CR and level of their maturity is much less compared to more established nlp tasks such as parsing. This is due to the inherent ambiguities in resolution.

A bit dated (2010) now but a useful starting point for updating. (Specific to medical records, see: Evaluating the state of the art in coreference resolution for electronic medical records. Other references you would recommend?)

Sharmila goes on to compare the results of using the tools on a set text so you can get a feel for the tools.

Capturing/Defining/Interchanging Coreference Resolutions (Topic Maps!)

Filed under: Coreference Resolution,Natural Language Processing — Patrick Durusau @ 1:46 pm

While listening to Ruslan Mitkov presentation: Coreference Resolution: to What Extent Does it Help NLP Applications?, the thought occurred to me that coreference resolution lies at the core of topic maps.

A topic map can:

  • Capture a coreference resolution in one representative by merging it with another representative that “pick out the same referent.”
  • Define a coreference resolution by defining representatives that “pick out the same referent.”
  • Interchange coreference resolutions by defining the representation of referents that “pick out the same referent.”

Not to denigrate associations or occurrences, but they depend upon the presence topics, that is representatives that “pick out a referent.”

Merged topics being two or more topics that individually “picked out the same referent,” perhaps using different means of identification.

Rather than starting every coreference resolution application at zero, to test its algorithmic prowess, a topic map could easily prime the pump as it were with known coreference resolutions.

Enabling coreference resolution systems to accumulate resolutions, much as human users do.*

*This may be useful because coreference resolution is a recognized area of research in computational linguistics, unlike topic maps.

Coreference Resolution: to What Extent Does it Help NLP Applications?

Coreference Resolution: to What Extent Does it Help NLP Applications? by Ruslan Mitkov. (presentation – audio only)

The paper from the same conference:

Coreference Resolution: To What Extent Does It Help NLP Applications? by Ruslan Mitkov, Richard Evans, Constantin Orăsan, Iustin Dornescu, Miguel Rios. (Text, Speech and Dialogue, 15th International Conference, TSD 2012, Brno, Czech Republic, September 3-7, 2012. Proceedings, pp. 16-27)

Abstract:

This paper describes a study of the impact of coreference resolution on NLP applications. Further to our previous study [1], in which we investigated whether anaphora resolution could be beneficial to NLP applications, we now seek to establish whether a different, but related task—that of coreference resolution, could improve the performance of three NLP applications: text summarisation, recognising textual entailment and text classification. The study discusses experiments in which the aforementioned applications were implemented in two versions, one in which the BART coreference resolution system was integrated and one in which it was not, and then tested in processing input text. The paper discusses the results obtained.

In the presentation and in the paper, Mitkov distinguishes between anaphora and coreference resolution (from the paper):

While some authors use the terms coreference (resolution) and anaphora (resolution) interchangeably, it is worth noting that they are completely distinct terms or tasks [3]. Anaphora is cohesion which points back to some previous item, with the ‘pointing back’ word or phrase called an anaphor, and the entity to which it refers, or for which it stands, its antecedent. Coreference is the act of picking out the same referent in the real world. A specific anaphor and more than one of the preceding (or following) noun phrases may be coreferential, thus forming a coreferential chain of entities which have the same referent.

I am not sure why the “real world” is necessary in: “Coreference is the act of picking out the same referent in the real world.”

For topic maps, I would shorten it to: Coreference is the act of picking out the same referent. (full stop)

The paper is a useful review of coreference systems and quite unusually, reports a negative result:

This study sought to establish whether or not coreference resolution could have a positive impact on NLP applications, in particular on text summarisation, recognising textual entailment, and text categorisation. The evaluation results presented in Section 6 are in line with previous experiments conducted both by the present authors and other researchers: there is no statistically significant benefit brought by automatic coreference resolution to these applications. In this specific study, the employment of the coreference resolution system distributed in the BART toolkit generally evokes slight but not significant increases in performance and in some cases it even evokes a slight deterioration in the performance results of these applications. We conjecture that the lack of a positive impact is due to the success rate of the BART coreference resolution system which appears to be insufficient to boost performance of the aforementioned applications.

My conjecture is topic maps can boost conference resolution enough to improve performance of NLP applications, including text summarisation, recognising textual entailment, and text categorisation.

What do you think?

How would you suggest testing that conjecture?

A Python Compiler for Big Data

Filed under: Arrays,Python — Patrick Durusau @ 6:41 am

A Python Compiler for Big Data by Stephen Diehl.

From the post:

Blaze is the next generation of NumPy, Python’s extremely popular array library. At Continuum Analytics we aim to tackle some of the hardest problems in large data analytics with our Python stack of Numba and Blaze, which together will form the basis of distributed computation and storage system which is simultaneously able to generate optimized machine code specialized to the data being operated on.

Blaze aims to extend the structural properties of NumPy arrays to to a wider variety of table and array-like structures that support commonly requested features such missing values, type heterogeneity, and labeled arrays.

(images omitted)

Unlike NumPy, Blaze is designed to handle out-of-core computations on large datasets that exceed the system memory capacity, as well as on distributed and streaming data. Blaze is able to operate on datasets transparently as if they behaved like in-memory NumPy arrays.

We aim to allow analysts and scientists to productively write robust and efficient code, without getting bogged down in the details of how to distribute computation, or worse, how to transport and convert data between databases, formats, proprietary data warehouses, and other silos.

Just a thumbnail sketch but enough to get you interested in learning more.

…Efficient Approximate Data De-Duplication in Streams [Approximate Merging?]

Filed under: Bloom Filters,Deduplication,Stream Analytics — Patrick Durusau @ 6:35 am

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams by Suman K. Bera, Sourav Dutta, Ankur Narang, Souvik Bhattacherjee.

Abstract:

Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent.
.
In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

If you think of more than one representative for a subject as “duplication,” then merging is a special class of “deduplication.”

Deduplication that discards redundant information but that preserves unique additional information and relationships to other subjects.

As you move away from static and towards transient topic maps, representations of subjects in real time data streams, this and similar techniques will become very important.

I first saw this in a tweet from Stefano Bertolo.

PS: A new equivalent term (to me) for deduplication: “intelligent compression.” Pulls about 46K+ “hits” in a popular search engine today. May want to add it to your routine search queries.

« Newer PostsOlder Posts »

Powered by WordPress