July « 2012 « Another Word For It

July 7, 2012

On the origin of long-range correlations in texts

Filed under: Natural Language Processing,Text Analytics — Patrick Durusau @ 2:53 pm

On the origin of long-range correlations in texts by Eduardo G. Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti.

Abstract:

The complexity of human interactions with social and natural phenomena is mirrored in the way we describe our experiences through natural language. In order to retain and convey such a high dimensional information, the statistical properties of our linguistic output has to be highly correlated in time. An example are the robust observations, still largely not understood, of correlations on arbitrary long scales in literary texts. In this paper we explain how long-range correlations flow from highly structured linguistic levels down to the building blocks of a text (words, letters, etc..). By combining calculations and data analysis we show that correlations take form of a bursty sequence of events once we approach the semantically relevant topics of the text. The mechanisms we identify are fairly general and can be equally applied to other hierarchical settings.

Another area of arXiv.org, Physics > Data Analysis, Statistics and Probability, to monitor. 😉

The authors used ten (10) novels from Project Gutenberg:

Alice’s Adventures in Wonderland
The Adventures of Tom Sawyer
Pride and Prejudice
Life on the Mississippi
The Jungle
The Voyage of the Beagle
Moby Dick; or The Whale
Ulysses
Don Quixote
War and Peace

Interesting research that will take a while to digest but I have to wonder why these ten (10) novels?

Or perhaps better, in an age of “big data,” why only ten (10)?

Why not the entire corpus of Project Gutenberg?

Or perhaps the texts of Wikipedia in its multitude of languages?

Reasoning that if the results represent an insight about natural language, they should be applicable beyond English. Yes?

If this is your area, comments and suggestions would be most welcome.

Comments Off

Subverting Ossified Departments [Moving beyond name calling]

Filed under: Analytics,Business Intelligence,Marketing,Topic Maps — Patrick Durusau @ 10:21 am

Brian Sommer has written on why analytics will not lead to new revenue streams, improved customer service, better stock options or other signs of salvation:

The Ossified Organization Won’t ‘Get’ Analytics (part 1 of 3)

How Tough Will Analytics Be in Ossified Firms? (Part 2 of 3)

Analytics and the Nimble Organization (part 3 of 3)

Why most firms won’t profit from analytics:

… Every day, companies already get thousands of ideas for new products, process innovations, customer interaction improvements, etc. and they fail to act on them. The rationale for this lack of movement can be:

– That’s not the way we do things here

– It’s a good idea but it’s just not us

– It’s too big of an idea

– It will be too disruptive

– We’d have to change so many things

– I don’t know who would be responsible for such a change

And, of course,

– It’s not my job

So if companies don’t act on the numerous, free suggestions from current customers and suppliers, why are they so deluded into thinking that IT-generated, analytic insights will actually fare better? They’re kidding themselves.

[part 1]

What Brian describes in amusing and great detail are all failures that no amount of IT, analytics or otherwise, can address. Not a technology problem. Not even an organization (as in form) issue.

It is a personnel issue. You can either retrain (I find unlikely to succeed) or you can get new personnel. it really is that simple. And with a glutted IT market, now would be the time to recruit an IT department not wedded to current practices. But you would need to do the same in accounting, marketing, management, etc.

But calling a department “ossified” is just name calling. You have to move beyond name calling to establish a bottom line reason for change.

Assuming you have access, topic maps can help you integrate data across department that don’t usually interchange data. So you can make the case for particular changes in terms of bottom line expenses.

Here is a true story with the names omitted and the context changed a bit:

Assume you are a publisher of journals, with both institutional and personal subscriptions. One of the things that all periodical publishers have to address are claims for “missing” issues. It happens, mail room mistakes, postal system errors, simply lost in transit, etc. Subscribers send in claims for those missing issues.

Some publishers maintain records of all subscriptions, including any correspondence and records, which are consulted by some full time staffer who answers all “claim” requests. One argument being there is a moral obligation to make sure non-subscribers don’t get an issue to which they are not entitled. Seriously, I have heard that argument made.

Analytics and topic maps could combine the subscription records with claim records and expenses for running the claims operation to show the expense of detailed claim service. Versus the cost of having the mail room toss another copy back to the requester. (Our printing cost was $3.00/copy so the math wasn’t the hard part.)

Topic maps help integrate the data you “obtain” from other departments. Just enough to make your point. Don’t have to integrate all the data, just enough to win the argument. Until the next argument comes along and you take a bit bigger bite of the apple.

Agile organizations are run by people agile enough to take control of them.

You can wait for permission from an ossified organization or you can use topic maps to take the first “bite.”

Your move.

PS: If you have investments in journal publishing you might want to check on claims handling.

Comments Off

Hash Tables: Introduction

Filed under: Hashing,Teaching — Patrick Durusau @ 9:21 am

Hash Tables: Introduction

Randy Gaul has written a nice introduction to hash tables, in part to learn about hash tables.

In the next iteration of the topic maps course, I should have only a topic map (no papers) as the main project. Require draft maps to be posted on a weekly basis.

So that design choices can be made, discussed and debated as the map develop.

So that the students are teaching each other about the domains they have chosen as they are constructing their maps.

Comments Off

Teach Data Science [website + book]

Filed under: Data Science,R — Patrick Durusau @ 5:44 am

Teach Data Science

From the webpage:

This is the companion site to the electronic textbook, Introduction to Data Science, by Jeffrey Stanton. This book provides non-technical readers with a gentle introduction to essential concepts and activities of data science. For more technical readers, the book provides explanations and code for a range of interesting applications using the open source R language for statistical computing and graphics.

This book was developed for the Certificate of Data Science program at Syracuse University’s School of Information Studies. If you find errors or omissions, please contact the author, Jeffrey Stanton, at jmstanto@syr.edu. The book is suitable for an introductory course in data science where students have a varied background or as a supplement to an advanced analytics course where students would benefit from an introduction to R.

The book, Introduction to Data Science, is available:

ITunes: http://itunes.apple.com/us/book/introduction-to-data-science/id529088127?ls=1 Interactive version.

PDF http://jsresearch.net/groups/teachdatascience/wiki/welcome/attachments/72f24/DataScienceBook1_1.pdf (19 MB file)

Nothing ground breaking but a useeful introduction to the area.

Comments Off

Genome-scale analysis of interaction dynamics reveals organization of biological networks

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 5:25 am

Genome-scale analysis of interaction dynamics reveals organization of biological networks by Jishnu Das, Jaaved Mohammed, and Haiyuan Yu. (Bioinformatics (2012) 28 (14): 1873-1878. doi: 10.1093/bioinformatics/bts283)

Summary:

Analyzing large-scale interaction networks has generated numerous insights in systems biology. However, such studies have primarily been focused on highly co-expressed, stable interactions. Most transient interactions that carry out equally important functions, especially in signal transduction pathways, are yet to be elucidated and are often wrongly discarded as false positives. Here, we revisit a previously described Smith–Waterman-like dynamic programming algorithm and use it to distinguish stable and transient interactions on a genomic scale in human and yeast. We find that in biological networks, transient interactions are key links topologically connecting tightly regulated functional modules formed by stable interactions and are essential to maintaining the integrity of cellular networks. We also perform a systematic analysis of interaction dynamics across different technologies and find that high-throughput yeast two-hybrid is the only available technology for detecting transient interactions on a large scale.

Research of obvious importance to anyone investigating biological networks but I mention it for the problem of how to represent transient relationships/interactions in a network?

Assuming a graph/network typology, how does a transient relationship impact a path traversal?

Assuming a graph/network typology, do we ignore the transience for graph theoretical properties such as shortest path?

Do we need graph theoretical queries versus biological network queries? Are the results always the same?

Can transient relationships results in transient properties? How do we record those?

Better yet, how do we ignore transient properties and under what conditions? (Leaving to one side how we would formally/computationally accomplish that ignorance.) What are the theoretical issues?

You can find the full text of this article at Professor Yu’s site: http://yulab.icmb.cornell.edu/PDF/Das_B2012.pdf

Comments (1)

Measurement = Meaningful?

Filed under: Data Science,Education,Measurement — Patrick Durusau @ 4:37 am

A two part series of posts on data and education has started up at Hortonworks. Data in Education (Part I) by James Locus.

From the post:

The education industry is transforming into a 21^st century data-driven enterprise. Metrics based assessment has been a powerful force that has swept the national education community in response to widespread policy reform. Passed in 2001, the No-Child-Left-Behind Act pushed the idea of standards-based education whereby schoolteachers and administrators are held accountable for the performance of their students. The law elevated standardized tests and dropout rates as the primary way officials measure student outcomes and achievement. Underperforming schools can be placed on probation, and if no improvement is seen after 3-4 years, the entire staff of the school can be replaced.

The political ramifications of the law inspire much debate amongst policy analysts. However, from a data perspective, it is more informative to understand how advances in technology can help educators both meet the policy’s guidelines and work to create better student outcomes.

How data measurement can drive poor management practices is captured in:

whereby schoolteachers and administrators are held accountable for the performance of their students.

Really? The only people who are responsible for the performance of students are schoolteachers and administrators?

Recalling that schoolteachers don’t see a child until they are at least four or five years old and most of their learning and behavior patterns have been well established. By their parents, by advertisers, by TV shows, by poor diets, by poor health care, etc.

And when they do see children, it is only for seven hours out of twenty-four.

Schoolteachers and administrators are in a testable situation, which isn’t the same thing as a situation where tests are meaningful.

As data “scientists” we can crunch the numbers given to us and serve the industry’s voracious appetite for more numbers.

Or we can point out that better measurement design could result in different policy choices.

Depends on your definition of “scientist.”

There were people who worked for Big Tobacco that still call themselves “scientists.”

What do you think?

Comments Off

July 6, 2012

Updated: List of Legal Informatics Courses, Programs, Departments, and Research Centers

Filed under: Legal Informatics — Patrick Durusau @ 6:51 pm

Updated: List of Legal Informatics Courses, Programs, Departments, and Research Centers.

Legal Informatics has updated it listing of legal informatics resources. Just in case you are spending a summer weekend updating your resource lists. 😉

Comments Off

UCR Time Series Classification/Clustering Page

Filed under: Classification,Clustering,Time Series — Patrick Durusau @ 6:48 pm

UCR Time Series Classification/Clustering Page

I encountered this while hunting down references on the insect identification contest.

How does your thinking about topic maps or other semantic solutions fare against:

Machine learning research has, to a great extent, ignored an important aspect of many real world applications: time. Existing concept learners predominantly operate on a static set of attributes; for example, classifying flowers described by leaf size, petal colour and petal count. The values of these attributes is assumed to be unchanging — the flower never grows or loses leaves.

However, many real datasets are not “static”; they cannot sensibly be represented as a fixed set of attributes. Rather, the examples are expressed as features that vary temporally, and it is the temporal variation itself that is used for classification. Consider a simple gesture recognition domain, in which the temporal features are the position of the hands, finger bends, and so on. Looking at the position of the hand at one point in time is not likely to lead to a successful classification; it is only by analysing changes in position that recognition is possible.

(Temporal Classication: Extending the Classication Paradigm to Multivariate Time Series by Mohammed Waleed Kadous (2002))

A decade old now but still a nice summary of the issue.

Can we substitute “identification” for “machine learning research?”

Are you relying “…on a static set of attributes” for identity purposes?

Comments Off

UCR Insect Classification Contest [Classification by Ear]

Filed under: Classification,Identification — Patrick Durusau @ 5:16 pm

UCR Insect Classification Contest Ends November 16, 2012

As I have said before, subject identity is everywhere! 😉

From the details PDF file:

Phase I: July to November 16th 2012 (this contest)

The task is to produce the best distance (similarity) measure for insect flight sounds.

The contest will be scored by 1-nearest neighbor classification.

The prizes include $500 cash and engraved trophies.

I was amused to read in the FAQ:

Note that the “sound” is measured with an optical sensor, rather than an acoustic one. This is done for various pragmatic reasons, however we don’t believe it makes any difference to the task at hand. The sampling rate is 16000 Hz

If you have a bee keeper nearby, can you do an empirical comparison of optical versus acoustic sensors for the capturing the “sound” of insects?

That seems like a first step in establishing computational entomology. BTW, use a range of frequencies, from sub to super sonic. (You are aware they have discovered sub-sonic sounds from whales can travel thousands of miles? Unlikely with insects but just because our ears can’t hear something doesn’t mean other ears cannot as well.)

I first saw this at KDNuggets.

Comments (1)

Apache Camel at 5 [2.10 release]

Filed under: Apache Camel,Integration,Tweets — Patrick Durusau @ 4:54 pm

Apache Camel celebrates 5 years in development with 2.10 release by Chris Mayer.

Chris writes:

Off the back of celebrating its fifth birthday at CamelOne 2012, the Apache Camel team have put the finishing touches to their next release, Apache Camel 2.10, adding in an array of new components to the Apache enterprise application integration platform.

No less than 483 issues have been resolved this time round, but the real draw is the 18 components added to the package, including Websocket and Twitter, allowing for deeper cohesive messaging for users. With the Twitter component, based on the Twitter4J library, users may obtain direct, polling, or event-driven consumption of timelines, users, trends, and direct messages. An example of combining the two can be found here.

Other additions to the component catalogue include support for HBase, CDI, MongoDB, Apache Avro, DynamoDB on AWS, Google GSON and Guava. Java 7 support is much more thorough now, as is support for Spring 3.1.x and Netty. A full list of all resolved issues can be found here.

The Twitter Websocket example reminds me of something I have been meaning to write about Twitter, topic maps and public data streams.

But more on that next week.

Comments Off

SparQLed…Writing SPARQL Queries [Less ZERO-result queries]

Filed under: RDF,Semantic Web,SPARQL — Patrick Durusau @ 4:36 pm

SindiceTech Releases SparQLed As Open Source Project To Simplify Writing SPARQL Queries by Jennifer Zaino.

From the post:

SindiceTech today released SparQLed, the SindiceTech Assisted SPARQL Editor, as an open source project. SindiceTech, a spinoff company from the DERI Institute, commercializes large-scale, Big Data infrastructures for enterprises dealing with semantic data. It has roots in the semantic web index Sindice, which lets users collect, search, and query semantically marked-up web data (see our story here).

SparQLed also is one of the components of the commercial Sindice Suite for helping large enterprises build private linked data clouds. It is designed to give users all the help they need to write SPARQL queries to extract information from interconnected datasets.

“SPARQL is exciting but it’s difficult to develop and work with,” says Giovanni Tummarello, who led the efforts around the Sindice search and analysis engine and is founder and CEO of SindiceTech.

SparQLed Project page.

Maybe we have become spoiled by search engines that always return results, even bad ones:

With SQL, the advantage lies in having a schema which users can look at and understand how to write a query. RDF, on the other hand, has the advantage of providing great power and freedom, because information in RDF can be interconnected freely. But, Tummarello says, “with RDF there is no schema because there is all sorts of information from everywhere.” Without knowing which properties are available specifically for a certain URI and in what context, users can wind up writing queries that return no results and get frustrated by the constant iterating needed to achieve their ends.

I am not encouraged by a features list that promises:

Less ZERO-result queries

Comments Off

Lucene Tutorial updated for Lucene 3.6

Filed under: Lucene — Patrick Durusau @ 4:15 pm

Lucene Tutorial updated for Lucene 3.6

From LingPipe:

The current Apache Lucene Java version is 3.6, released in April of 2012. We’ve updated the Lucene 3 tutorial and the accompanying source code to bring it in line with the current API so that it doesn’t use any deprecated methods and my, there are a lot of them. Bob blogged about this tutorial back in February 2011, shortly after Lucene Java rolled over to version 3.0.

Like other 3.x minor releases, Lucene 3.6 introduces performance enhancements, bug fixes, new analyzers, and changes that bring the Lucene API in line with Solr. In addition, Lucene 3.6 anticipates Lucene 4, billed as “the next major backwards-incompatible release.”

Excellent news! Although you will need to hurry reading it. Lucene/Solr 4.0 is just around the corner!

Comments Off

Tutorial on biological networks [The Heterogeneity of Nature]

Filed under: Bioinformatics,Biomedical,Graphs,Heterogeneous Data,Networks — Patrick Durusau @ 3:54 pm

Tutorial on biological networks by Francisco G. Vital-Lopez, Vesna Memišević, and Bhaskar Dutta. (Vital-Lopez, F. G., Memišević, V. and Dutta, B. (2012), Tutorial on biological networks. WIREs Data Mining Knowl Discov, 2: 298–325. doi: 10.1002/widm.1061)

Abstract:

Understanding how the functioning of a biological system emerges from the interactions among its components is a long-standing goal of network science. Fomented by developments in high-throughput technologies to characterize biomolecules and their interactions, network science has emerged as one of the fastest growing areas in computational and systems biology research. Although the number of research and review articles on different aspects of network science is increasing, updated resources that provide a broad, yet concise, review of this area in the context of systems biology are few. The objective of this article is to provide an overview of the research on biological networks to a general audience, who have some knowledge of biology and statistics, but are not necessarily familiar with this research field. Based on the different aspects of network science research, the article is broadly divided into four sections: (1) network construction, (2) topological analysis, (3) network and data integration, and (4) visualization tools. We specifically focused on the most widely studied types of biological networks, which are, metabolic, gene regulatory, protein–protein interaction, genetic interaction, and signaling networks. In future, with further developments on experimental and computational methods, we expect that the analysis of biological networks will assume a leading role in basic and translational research.

As a frozen artifact in time, I would suggest reading this article before it is too badly out of date. It will be sad to see it ravaged by time and pitted by later research that renders entire sections obsolete. Or of interest only to medical literature spelunkers of some future time.

Developers of homogeneous and “correct” models of biological networks should take warning from the closing lines of this survey article:

Currently different types of networks, such as PPI, GRN, or metabolic networks are analyzed separately. These heterogeneous networks have to be integrated systematically to generate comprehensive network, which creates a realistic representation of biological systems.[cite omitted] The integrated networks have to be combined with different types of molecular profiling data that measures different facades of the biological system. A recent multi institutional collaborative project, named The Cancer Genome Atlas,[cite omitted] has already started generating much multi-‘omics’ data for large cancer patient cohorts. Thus, we can expect to witness an exciting and fast paced growth on biological network research in the coming years.

Interesting.

Nature uses heterogeneous networks, with great success.

We can keep building homogenous networks or we can start building heterogeneous networks (at least to the extent we are capable).

What do you think?

Comments Off

The R Journal 4/1 June, 2012

Filed under: R,Statistics — Patrick Durusau @ 12:47 pm

The R Journal 4/1 June, 2012

I am sure you will find something interesting to read:

Analysing Seasonal Data Adrian G Barnett, Peter Baker and Annette J Dobson
MARSS: Multivariate Autoregressive State-space Models for Analyzing Time-series Data Elizabeth E. Holmes, Eric J. Ward, Kellie Wills
openair – Data Analysis Tools for the Air Quality Community Karl Ropkins and David C. Carslaw
Foreign Library Interface Daniel Adler
Vdgraph: A Package for Creating Variance Dispersion Graphs John Lawson
xgrid and R: Parallel Distributed Processing Using Heterogeneous Groups of Apple Computers Sarah C. Anoke, Yuting Zhao, Rafael Jaeger and Nicholas J. Horton
maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification Timothy P. Jurka
Sumo: An Authenticating Web Application with an Embedded R Session Timothy T. Bergsma and Michael S. Smith

Comments Off

BigMl 0.3.1 Release

Filed under: Machine Learning,Predictive Analytics,Python — Patrick Durusau @ 9:45 am

BigMl 0.3.1 Release

From the webpage:

An open source binding to BigML.io, the public BigML API

Downloads ↓

BigML makes machine learning easy by taking care of the details required to add data-driven decisions and predictive power to your company. Unlike other machine learning services, BigML creates beautiful predictive models that can be easily understood and interacted with.

These BigML Python bindings allow you to interact with BigML.io, the API for BigML. You can use it to easily create, retrieve, list, update, and delete BigML resources (i.e., sources, datasets, models and, predictions).

There’s that phrase again, predictive models.

Don’t people read patent literature anymore? 😉 I don’t care for absurdist fiction so I tend to avoid it. People claiming invention for having a patent lawyer write common art up in legal prose. Good for patent lawyers, bad for researchers and true inventers.

Comments Off

Puzzling outcomes in A/B testing

Filed under: Interface Research/Design,Users — Patrick Durusau @ 9:28 am

Puzzling outcomes in A/B testing by Greg Linden.

Greg writes:

“Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” (PDF), has a lot of great insights into A/B testing and real issues you hit with A/B testing.

I like where Greg quotes the paper as saying:

When Bing had a bug in an experiment, which resulted in very poor results being shown to users, two key organizational metrics improved significantly: distinct queries per user went up over 10%, and revenue per user went up over 30%! …. Degrading algorithmic results shown on a search engine result page gives users an obviously worse search experience but causes users to click more on ads, whose relative relevance increases, which increases short-term revenue … [This shows] it’s critical to understand that long-term goals do not always align with short-term metrics.

I am not real sure what an “obviously worse search experience” would look like. Maybe I don’t want to know. 😉

Anyway, kudos to Greg for finding an amusing and useful paper on testing.

Comments Off

July 5, 2012

Search and Counterfactuals

Filed under: Counterfactual,Indexing,Search Algorithms,Searching — Patrick Durusau @ 6:53 pm

In Let the Children Play, It’s Good for Them! (Smithsonian, July/August 2012) Alison Gopnik writes:

Walk into any preschool and you’ll find toddling superheroes battling imaginary monsters. We take it for granted that young children play and, especially, pretend. Why do they spend so much time in fantasy worlds?

People have suspected that play helps children learn, but until recently there was little research that showed this or explained why it might be true. In my lab at the University of California at Berkeley, we’ve been trying to explain how very young children can learn so much so quickly, and we’ve developed a new scientific approach to children’s learning.

Where does pretending come in? It relates to what philosophers call “counterfactual” thinking, like Einstein wondering what would happen if a train went at the speed of light.

Do our current models for search encourage or discourage counterfactual thinking? Neutral?

There is place for “factual” queries: Has “Chipper” Jones, who plays for the Atlanta Braves, ever hit safely 5 out of 5 times in a game? 😉

But what of counterfactuals?

Do they lead us to new forms of indexing? By re-imagining how searching could be done, if and only if there were a new indexing structure?

Are advances in algorithms largely due to counterfactuals? Where the “factuals” are the world of processing as previously imagined?

We can search for the “factuals,” prior answers approved by authorities, but how does one search for a counterfactual?

Or search for what triggers a counterfactual?

I don’t have even an inkling at an answer or what an answer might look like, but thought it would be worth asking the question.

Comments Off

Theory and Techniques for Synthesizing a Family of Graph Algorithms

Filed under: Graph Traversal,Graphs,Search Algorithms — Patrick Durusau @ 6:31 pm

Theory and Techniques for Synthesizing a Family of Graph Algorithms by Srinivas Nedunuri, William R. Cook, and Douglas R. Smith.

Abstract:

Although Breadth-First Search (BFS) has several advantages over Depth-First Search (DFS) its prohibitive space requirements have meant that algorithm designers often pass it over in favor of DFS. To address this shortcoming, we introduce a theory of Efficient BFS (EBFS) along with a simple recursive program schema for carrying out the search. The theory is based on dominance relations, a long standing technique from the field of search algorithms. We show how the theory can be used to systematically derive solutions to two graph algorithms, namely the Single Source Shortest Path problem and the Minimum Spanning Tree problem. The solutions are found by making small systematic changes to the derivation, revealing the connections between the two problems which are often obscured in textbook presentations of them.

I don’t think it would satisfy the formal derivation requirements of the authors but am curious why the dominance relationships are derived? Wouldn’t it be easier to ask a user at each node, go/no-go as far as dominance relationship? Allowing an interactive application of the search algorithms based upon the users dominance judgement?

I say that because the derivation strategy depends upon the developer’s interpretation of dominance from a field that may be unfamiliar or incorrectly communicated by a user. To be sure the derivation and results may be formally correct but not produce an answer that is “correct” in the view of a user.

Not to mention that once a derivation is formally “correct,” there would be resistance to changing a “correct” derivation.

A interactive, dynamic, symbiotic search experience between users and their search systems is more likely to produce results thought to be “correct” by users. (What other measure of “correctness” comes to mind?)

PS: Srinivas Nedunuri‘s homepage promises a copy of his PhD dissertation, which formed the basis for this paper, soon!

Comments Off

INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases

Filed under: Indexing,Searching,String Matching — Patrick Durusau @ 3:21 pm

INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases by Sourav Dutta and Arnab Bhattacharya.

Abstract:

The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure, INSTRUCT, for efficient storage and management of sequence databases. Our structure uses bit vectors for reusing the storage space for common triplets, and hence, has a very low memory requirement. INSTRUCT efficiently handles prefix and suffix search queries in addition to the exact string search operation by iteratively checking the presence of triplets. We also propose an extension of the structure to handle substring search efficiently, albeit with an increase in the space requirements. This extension is important in the context of trie-based solutions which are unable to handle such queries efficiently. We perform several experiments portraying that INSTRUCT outperforms the existing structures by nearly a factor of two in terms of space requirements, while the query times are better. The ability to handle insertion and deletion of strings in addition to supporting all kinds of queries including exact search, prefix/suffix search and substring search makes INSTRUCT a complete data structure.

From the introduction:

As all strings are composed of a defined set of characters, reusing the storage space for common characters promises to provide the most compressed form of representation. This redundancy linked with the need for extreme space-efficient index structures motivated us to develop INSTRUCT (INdexing STrings by Re-Using Common Triplets).

By the time of a presentation (below) on the technique, the authors apparently rethought the name, settling on:

“SPACE-EFFICIENT MANAGEMENT OF TEXT USING INDEXED KEYS” (SEManTIKs)“

Neither one really “rolls off the tongue,” but I suspect searching for the second may be somewhat easier.

I say that, but “semantiks” turns out to be a women’s clothing line and at least one popular search engine offers to correct to “semantics.” I am sure a “gathered scoop neck” and a “flattering boot-cut silhouette” are all quite interesting but not really on point.

The slide presentation lists some fourteen (14) other approaches that can be compared to the one developed by the authors. (I am assuming the master’s thesis by Dutta has the details on the comparisons with the other techniques. I haven’t found it online but have written to request a copy.)

This work demonstrates that we are no where near the end of improvements for indexing and search.

See also the presentation: SPACE-EFFICIENT MANAGEMENT OF STRING DATABASES BY REUSING COMMON CHARACTERS by Sourav Dutta.

Comments Off

Special Volume: Graphical User Interfaces for R (Journal of Statistical Software, Vol. 49)

Filed under: R,Statistics — Patrick Durusau @ 12:38 pm

Special Volume: Graphical User Interfaces for R (Journal of Statistical Software, Vol. 49)

From the table of contents:

Graphical User Interfaces for R Pedro M. Valero-Mora, Ruben Ledesma Vol. 49, Issue 1, Jun 2012

Integrated Degradation Models in R Using iDEMO Ya-Shan Cheng, Chien-Yu Peng Vol. 49, Issue 2, Jun 2012

Glotaran: A Java-Based Graphical User Interface for the R Package TIMP Joris J. Snellenburg, Sergey Laptenok, Ralf Seger, Katharine M. Mullen, Ivo H. M. van Stokkum Vol. 49, Issue 3, Jun 2012

A Graphical User Interface for R in a Rich Client Platform for Ecological Modeling Marcel Austenfeld, Wolfram Beyschlag Vol. 49, Issue 4, Jun 2012

Closing the Gap between Methodologists and End-Users: R as a Computational Back-End Byron C. Wallace, Issa J. Dahabreh, Thomas A. Trikalinos, Joseph Lau, Paul Trow, Christopher H. Schmid Vol. 49, Issue 5, Jun 2012

tourrGui: A gWidgets GUI for the Tour to Explore High-Dimensional Data Using Low-Dimensional Projections Bei Huang, Dianne Cook, Hadley Wickham Vol. 49, Issue 6, Jun 2012

The RcmdrPlugin.survival Package: Extending the R Commander Interface to Survival Analysis John Fox, Marilia S. Carvalho Vol. 49, Issue 7, Jun 2012

Deducer: A Data Analysis GUI for R Ian Fellows Vol. 49, Issue 8, Jun 2012

RKWard: A Comprehensive Graphical User Interface and Integrated Development Environment for Statistical Analysis with R Stefan Rödiger, Thomas Friedrichsmeier, Prasenjit Kapat, Meik Michalke Vol. 49, Issue 9, Jun 2012

gWidgetsWWW: Creating Interactive Web Pages within R John Verzani Vol. 49, Issue 10, Jun 2012

Oscars and Interfaces Antony Unwin Vol. 49, Issue 11, Jun 2012

Comments Off

Mosaic: making biological sense of complex networks

Filed under: Bioinformatics,Gene Ontology,Genome,Graphs,Networks — Patrick Durusau @ 12:14 pm

Mosaic: making biological sense of complex networks by Chao Zhang, Kristina Hanspers, Allan Kuchinsky, Nathan Salomonis, Dong Xu, and Alexander R. Pico. (Bioinformatics (2012) 28 (14): 1943-1944. doi: 10.1093/bioinformatics/bts278)

Abstract:

We present a Cytoscape plugin called Mosaic to support interactive network annotation, partitioning, layout and coloring based on gene ontology or other relevant annotations.

From the Introduction:

The increasing throughput and quality of molecular measurements in the domains of genomics, proteomics and metabolomics continue to fuel the understanding of biological processes. Collected per molecule, the scope of these data extends to physical, genetic and biochemical interactions that in turn comprise extensive networks. There are software tools available to visualize and analyze data-derived biological networks (Smoot et al., 2011). One challenge faced by these tools is how to make sense of such networks often represented as massive ‘hairballs’. Many network analysis algorithms filter or partition networks based on topological features, optionally weighted by orthogonal node or edge data (Bader and Hogue, 2003; Royer et al., 2008). Another approach is to mathematically model networks and rely on their statistical properties to make associations with other networks, phenotypes and drug effects, sidestepping the issue of making sense of the network itself altogether (Machado et al., 2011). Acknowledging that there is still great value in engaging the minds of researchers in exploratory data analysis at the level of networks (Kelder et al., 2010), we have produced a Cytoscape plugin called Mosaic to support interactive network annotation and visualization that includes partitioning, layout and coloring based on biologically relevant ontologies (Fig. 1). Mosaic shows slices of a given network in the visual language of biological pathways, which are familiar to any biologist and are ideal frameworks for integrating knowledge.

[Fig. 1 omitted}

Cytoscape is a free and open source network visualization platform that actively supports independent plugin development (Smoot et al., 2011). For annotation, Mosaic relies primarily on the full gene ontology (GO) or simplified ‘slim’ versions (http://www.geneontology.org/GO.slims.shtml). The cellular layout of partitioned subnetworks strictly depends on the cellular component branch of GO, but the other two functions, partitioning and coloring, can be driven by any annotation associated with a major gene or protein identifier system.

You will need:

As per the Mosaic project page.

The Mosaic page offers additional documentation, which will take a while to process. I am particularly interested in annotations of the network driving partitioning.

Comments Off

JMLR – Journal of Machine Learning Research

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 10:37 am

JMLR – Journal of Machine Learning Research

From the webpage:

The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online.

Starts with volume 1 in October of 2000 and continues to the present.

Special topics that call out articles from different issues and special issues are also listed.

A first rate collection of machine learning research.

Comments Off

GraphStream 1.1

Filed under: Graphs,Networks — Patrick Durusau @ 10:28 am

GraphStream 1.1 was released November, 2011. Sorry, ran across slides from a recent presentation and thence the updated release.

From the release notes:

We are happy to announce a new minor release of GraphStream stable version, 1.1. We hope it will fulfill your needs and that you will enjoy the new features that come with it. As usual, please do not hesitate to provide us with your comments through the mailing list and to submit bugs on the issue tracking system.

What is new in release 1.1?

GraphStream 1.1 supports most of the commonly used graph file formats (DOT, GML, GEXF, Pajek, GraphML, TLP). It can read files in these formats thus making the interface with other graph libraries easier. Some of these parsers (DOT, GML, Pajek, TLP) are (re)written using a JavaCC grammar to reproduce the exact format specifications.

There is a new way to access graph elements (nodes and edges) by index in addition to the access by identifier. The access by index is faster and allows easy interfacing with APIs that use arrays.

New methods are added to Graph and Node interfaces for more flexibility. In general, there are three ways to pass a graph element to a method: by id, by index and by reference.

The Graph implementations (AdjacencyListGraph, SingleGraph and MultiGraph) were completely rewritten. The common code (Sink and Source implementation) was refactored. The new implementations are more stable and provide faster access and iteration (especially breadth-first and depth-first iteration) with almost no memory overhead.

Concept of “Camera” has been extracted from the previous implementation. With this new version, each view of a viewer has to return a camera object. This object allows to get informations about the view (view center, zoom, etc …), to control this view and to convert pixels to graphic units and vice-versa.

There is a new directive in the DGS specifications. This directive, called “cl”, is linked to the “graphCleared()” event of a sink.

Dijkstra’s algorithm was reimplemented. The new implementation is much faster. The API has slightly changed.

With the help of our users many bugs were detected and fixed. Special thanks to all of them for their feedback.

The presentation?

Dynamic Graphs… …and a tool to handle them 6^th Complex Systems Summer School Paris, July 5^th 2012.

On GitHub: github.com/organizations/graphstream

Comments Off

Connect the Stars (Graphs Anyone?)

Filed under: Graphs,Mathematics,Networks — Patrick Durusau @ 7:59 am

Connect the Stars (How papers are like constellations ) by KW Regan.

From the post:

Bob Vaughan is a mathematician at Penn State University. He is also a Fellow of the Royal Society—not ours, Ben Franklin helped make it tough for us to have one about 236 years ago this Wednesday. He is a great expert on analytic number theory, especially applied to the prime numbers. His work involves the deep connections between integers and complex numbers that were first charted by Leonhard Euler in the time of Franklin.

Today we examine how connections are made in the literature, and how choosing them influences our later memory of what is known and what is not.

Proved mathematical statements are like stars of various magnitudes: claim, proposition, lemma, theorem… A paper usually connects several of the former kinds to a few bright theorems. Often there are different ways the connections could go, and a lengthened paper may extend them to various corollaries and other theorems. Thus we can get various constellations even from the same stars. Consider the Big Dipper and the larger Ursa Major:

I lack the mathematical chops to follow the substance of the post but can read along to see the connections that were made at different times by different people that contributed to what is reported as the present state of knowledge.

How to capture that, dare I say network/graph of interconnections?

Search seems haphazard and lossy.

Writing it out in prose monographs or articles isn’t much better because you still have to find the monograph or article.

What if there were a dynamic network/graph of connections that is overlaid and grows with publications? Both in the way of citations but less formal connections and to less than an entire article?

The social life of research as it is read, assimilated, used, revised and extended by members of a discipline.

That is to say that research isn’t separate from us, research is us. It is as much a social phenomena as prose, plays or poetry. Just written in a different style.

Comments (1)

Olympic medal winners: every one since 1896 as open data

Filed under: Data,Data Mining,Data Source — Patrick Durusau @ 5:21 am

Olympic medal winners: every one since 1896 as open data

The Guardian Datablog has posted Olympic medal winner data for download.

Admitting to some preference I was pleased to see that OpenDocument Format was one of the download choices. 😉

It may just be my ignorance of Olympic events but it seems odd for the gender of competitors to be listed along with the gender of the event?

A brief history of Olympic Sports (from Wikipedia). Military patrol was a demonstration sport in 1928, 1936 and 1948. Is that likely to make a return in 2016? Or would terrorist spotting be more appropriate?

Comments Off

Here is a Youtube Video Series on How to Write Fast R Code

Filed under: R — Patrick Durusau @ 4:45 am

Here is a Youtube Video Series on How to Write Fast R Code

From the post:

One of my Quant Finance Meetup members, Alon, has put up two videos in a series about speeding up the execution of R. It was a presentation he did a few months ago to the group. Thanks to him for these videos and presenting.

This is an instructional video on increasing the speed with which R code is executing. It is mostly related to tricks in the R syntax that substantially decrease the time it takes for R to execute code. There are 12 different tricks in the tutorial to increasing the efficiency of your code. Specific replicable examples are given so that you can try them at home. The neatest thing about these techniques is that they do not require any additional tools beyond the standard R Build. I have decreased the time it takes for a complex simulation from 15 minutes to less than 2 without using any compiled languages!!!

I was amused by the observation the difference between 10 and 15 minutes of run time isn’t significant in most contexts. True but I am not sure how to sell that in an immediate feedback, game-boy trained world. Suggestions welcome!

BTW, you will appreciate the videos as well.

Comments Off

July 4, 2012

A new open journal on Data Science

Filed under: Data Science,R — Patrick Durusau @ 7:41 pm

A new open journal on Data Science

From the post:

Springer has introduced a new open, peer-reviewed journal focused on Data Science: EPJ Data Science.

What makes this a Data Science journal is novel uses of statistics, data analysis, computer techniques and public data sources to research a topic in another domain, rather than methodological research. Here are a few examples of the papers you'll find in the journal:

A confirmation of the "Pollyanna Hypothesis" that we use more positive words than negative words (and so negative sentiments carry more weight than positive ones).

An analysis of the Love Parade disaster, using photographs, satellite images, and public documents to investigate the causes that led to 21 deaths in a 2010 crowd panic in Germany.

An analysis of politically-active Twitter users users that reveals that Republicans in 2008 had a more tightly-connected social network that was more effective at broadcasting political material on Twitter.

Unsurprisingly, many of the articles use the R language for the underlying analysis and data visualization. And because this is an open journal, you're free to read any of the articles at the link below.

Now that’s good news!

Comments Off

JQVMap

Filed under: JQuery,Mapping,Maps,SVG — Patrick Durusau @ 7:35 pm

JQVMap

From the post:

JQVMap is a jQuery plugin that renders Vector Maps. It uses resizable Scalable Vector Graphics (SVG) for modern browsers like Firefox, Safari, Chrome, Opera and Internet Explorer 9. Legacy support for older versions of Internet Explorer 6-8 is provided via VML.

I saw this at Pete Warden’s Five Links, along with the Plane Networks.

Comments Off

Plane Old Networks

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 7:29 pm

Plane Old Networks by Skye Bender-deMoll.

From the post:

This is a catchall post to collect together a number of interesting network images I’ve run across in the last few years. The common feature is that they are all networks that are based in or arise from geography or spatial processes. Unlike most of the networks we often have to work with, these are mostly “planar” (or nearly so) meaning that they can usually be drawn in two dimensions with minimal crossing and distortion.

I had to reference this post because the networks are interesting and my hometown gets mentioned on one of them. 😉

I do think the author is correct when he speculates:

I have a hunch (but no stats to back it up) that the sorts of networks generated by process that essentially operate on an a flat substrate may be structurally different (have certain specific network properties) than the kinds of networks generated from processes like citations, campaign contributions, ownership relations, or other less-geographic systems.

Assuming there are different network properties, my question would be what underlying cause creates that difference?

Comments Off

Living with Imperfect Data

Filed under: Data,Data Governance,Data Quality,Topic Maps — Patrick Durusau @ 5:00 pm

Living with Imperfect Data by Jim Ericson.

From the post:

In a keynote at our MDM & Data Governance conference in Toronto a few days ago, an executive from a large analytical software company said something interesting that stuck with me. I am paraphrasing from memory, but it was very much to the effect of, “Sometimes it’s better to have everyone agreeing on numbers that aren’t entirely accurate than having everyone off doing their own numbers.”

Let that sink in for a moment.

After I did, the very idea of this comment struck me at a few levels. It might have the same effect on you.

In one sense, admitting there is an acceptable level of shared inaccuracy is anathema to the way we like to describe data governance. It was especially so at a MDM-centric conference where people are pretty single-minded about what constitutes “truth.”

As a decision support philosophy, it wouldn’t fly at a health care conference.

I rather like that: “Sometimes it’s better to have everyone agreeing on numbers that aren’t entirely accurate than having everyone off doing their own numbers.”

I suspect because it is the opposite of how I really like to see data. I don’t want rough results, say in a citation network but rather all the relevant citations. Even if it isn’t possible to review all the relevant citations. Still need to be complete.

But completeness is the enemy of results or at least published results. Sure, eventually, assuming a small enough data set, it is possible to map it in its entirety. But that means that whatever good would have come from it being available sooner, has been lost.

I don’t want to lose the sense of rough agreement posed here, because that is important as well. There are many cases where, despite Fed and economists protests to the contrary, the numbers are almost fictional anyway. Pick some, they will be different soon enough. What counts is that we have agreed on numbers for planning purposes. Can always pick new ones.

The same is true for topic maps and perhaps even more so for topic maps. They are a view into an infoverse, fixed at a moment in time by authoring decisions.

Don’t like the view? Create another one.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 7, 2012

July 6, 2012

July 5, 2012

July 4, 2012