Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 27, 2012

An interactive view of star constellations

Filed under: Graphics,Visualization — Patrick Durusau @ 2:14 pm

An interactive view of star constellations by Nathan Yau.

From the post:

When we look up at the night sky to gaze at the stars, we see small, glowing dots that we perceive almost as if they were drawn on a flat surface. However, all these dots vary in distance from us. View of the Sky by visualization developer Santiago Ortiz shows this third dimension of depth.

The constellations are placed on a sphere that you can zoom and rotate. This is an interesting view in itself, but select the perspective for absolute distance and magnitude, and you’ll see something completely different. It’s no longer a network that resembles a globe, and instead it morphs to a cloud of stars and randomness. Also see Ortiz’s first view of the sky that includes stars not part of major constellations.

Ah, the sphere is the Earth and the default setting, “absolute magnitude, placed on sphere.”

Now try “absolute magnitude, actual positions.”

😉

I could not do it but it would be nice to be able to select a star in the actual position display and get a popup of its display on the sphere.

To sharpen the contrast between what we “see” and what is “seen” from a different perspective.

I have seen something along these lines before. Suggestions/pointers?

An API for European Union legislation

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 1:51 pm

An API for European Union legislation

From the webpage:

The API can help you conduct research, create data visualizations or you can even build applications upon it.

This is an application programming interface (API) that opens up core EU legislative data for further use. The interface uses JSON, meaning that you have easy to use machine-readable access to meta data on European Union legislation. It will be useful if you want to use or analyze European Union legislative data in a way that the official databases are not originally build for. The API extracts, organize and connects data from various official sources.

Among other things we have used the data to conduct research on the decision-making time*, analyze voting patterns*, measure the activity of Commissioners* and visualize the legislative integration process over time*, but you can use the API as you want to. When you use it to create something useful or interesting be sure to let us know, if you want to we can post a link to your project from this site.

For some non-apparent reason, the last paragraph has hyperlinks for the “*” characters. So that is not a typo, that is how it appears in the original text.

There are a large number of relationships captured by data accessible through this API. The sort of relationships that topic maps excel at handling.

I first saw this at: DZone: An API for European Union legislation

The Scourge of Data Silos

Filed under: Data,Data Silos — Patrick Durusau @ 1:28 pm

The Scourge of Data Silos by Rick Sherman

From the post:

“Those who cannot remember the past are condemned to repeat it.” [1]

Over the years there have been many technology waves related to the design, development and deployment of Business Intelligence (BI). As BI technologies evolved, they have been able to significantly expand their functionality by leveraging the incredible capacity growth of CPUs, storage, disk I/O, memory and network bandwidth. New technologies have emerged as enterprises’ data needs keep expanding in variety, volume and velocity.

Technology waves are occurring more frequently than ever. Current technology waves include Big Data, data virtualization, columnar databases, BI appliances, in-memory analytics, predictive analytics, and self-service BI.

Common Promises

Each wave brings with it the promise of faster, easier to use and cheaper BI solutions. Each wave promises to be the breakthrough that makes the “old ways” archaic, and introduces a new dawn of pervasive BI responsive to business needs. No more spreadsheets or reports needed!

IT and product vendors are ever hopeful that the latest technology wave will be the magic elixir for BI, however, people seem to miss that it is not technology that is the gating factor to pervasive BI. What has held back BI has been the reluctance to address the core issues of establishing enterprise data management, information architecture and data governance. Those core issues are hard and the perpetual hope is that one of these technology waves will be the Holy Grail of BI and allow enterprises to skip the hard work of transforming and managing information. We have discussed these issues many times (and will again), but what I want to discuss is the inevitable result in the blind faith in the latest technology wave.

Rick does a good job at pointing out “the inevitable result in the blind faith in the latest technology wave.”

His cool image of silos at the top is a hint about his conclusion:

silos

I have railed about data silos, along with everyone else, for years. But the line of data silos seems to be endless. As indeed I have come to believe it is.

Endless that is. We can’t build data structures or collections of data without building data silos. Some times with enough advantages to justify a new silo, sometimes not.

Rather than “kick against the bricks” of data silos, our time would be better spent making our data silos as transparent as need be.

Not completely and in some cases not at all. Simply not wrote the effort. In those cases, we can always fall back on ETL, or simply ignore the silo altogether.

I posted recently about open data passing the one millionth data set. Data that is trapped in data silos of one sort or another.

We can complain about the data that is trapped inside or we can create mechanisms to free it and data that will inevitably be contained in future data silos.

Even topic map syntaxes and/or models are data silos. But that’s the point isn’t it? We are silo builders and that’s ok.

What we need to add to our skill set is making windows in silos and sharing those windows with others.

neo4j: Handling optional relationships

Filed under: Modeling,Neo4j — Patrick Durusau @ 12:58 pm

neo4j: Handling optional relationships by Mark Needham.

From the post:

On my ThoughtWorks neo4j there are now two different types of relationships between people nodes – they can either be colleagues or one can be the sponsor of the other.

Getting the information/relationships “in” wasn’t a problem. Getting the required information back out, that was a different story.

A useful illustration of how establishing the desired result (output in this case) can clarify what needs to be asked.

Don’t jump to the solution. Read the post and write down how you would get the desired results.

I first saw this at DZone’s Neo4j page.

Predictive Coding Patented, E-Discovery World Gets Jealous

Filed under: e-Discovery,Law,Predictive Analytics — Patrick Durusau @ 12:48 pm

Predictive Coding Patented, E-Discovery World Gets Jealous by Christopher Danzig

From the post:

The normally tepid e-discovery world felt a little extra heat of competition yesterday. Recommind, one of the larger e-discovery vendors, announced Wednesday that it was issued a patent on predictive coding (which Gabe Acevedo, writing in these pages, named the Big Legal Technology Buzzword of 2011).

In a nutshell, predictive coding is a relatively new technology that allows large chunks of document review to be automated, a.k.a. done mostly by computers, with less need for human management.

Some of Recommind’s competitors were not happy about the news. See how they responded (grumpily), and check out what Recommind’s General Counsel had to say about what this means for everyone who uses e-discovery products
.

Predictive coding has received a lot of coverage recently as a new way to save buckets of money during document review (a seriously expensive endeavor, for anyone who just returned to Earth).

I am always curious why a patent or even patent number will be cited but no link to the patent given?

In case you are curious, it is patent 7,933,859, as a hyperlink.

The abstract reads:

Systems and methods for analyzing documents are provided herein. A plurality of documents and user input are received via a computing device. The user input includes hard coding of a subset of the plurality of documents, based on an identified subject or category. Instructions stored in memory are executed by a processor to generate an initial control set, analyze the initial control set to determine at least one seed set parameter, automatically code a first portion of the plurality of documents based on the initial control set and the seed set parameter associated with the identified subject or category, analyze the first portion of the plurality of documents by applying an adaptive identification cycle, and retrieve a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle test on the first portion of the plurality of documents.

If that sounds familiar to you, you are not alone.

Predictive coding, developed over the last forty years, is an excellent feed into a topic map. As a matter of fact, it isn’t hard to imagine a topic map seeding and being augmented by a predictive coding process.

I also mention it as a caution that the IP in this area, as in many others, is beset by the ordinary being approved as innovation.

A topic map would be ideal to trace claims, prior art and to attach analysis to a patent. I saw several patents assigned to Recommind and some pending applications. When I have a moment I will post a listing with links to those documents.

I first saw this at Beyond Search.

MyMoneyAppUp by U.S. Department of the Treasury – $25,000

Filed under: Contest,Funding — Patrick Durusau @ 10:25 am

MyMoneyAppUp by U.S. Department of the Treasury – $25,000

Submission period: June 27 – August 12, 2012.

Prizes:

1st: $10,000

2nd: $5,000 (2)

3rd: $2,500 (2)

From the webpage:

The MyMoneyAppUp Challenge, launched by the U.S. Treasury Department in partnership with the D2D Fund and Center for Financial Services Innovation, is a contest intended to motivate American entrepreneurs, software developers, the public, and students to propose the best ideas and designs for next-generation mobile tools to help Americans control and shape their financial futures. The Challenge calls for mobile app ideas (IdeaBank) and designs (App Design), with cash prizes awarded to the best submissions. Competitors are encouraged to propose mobile apps that incorporate data to empower consumers, as part of Treasury’s initiative to promote Smart Disclosure. MyMoneyAppUp competitors who want to take their winning ideas to the next step and develop prototypes may enter the FinCapDev Competition, a complementary competition sponsored exclusively by D2D and CFSI at the conclusion of the MyMoneyAppUp Challenge. Support for prizes and the administration of the Challenge by CFSI and D2D for the MyMoneyAppUp Challenge comes from the Ford Foundation, Omidyar Network, and the Citi Foundation.ï»ż

Sounds like a place where topic maps could play a role.

From something as simple as integrating balances from specified accounts or drafts on those accounts, to provide users with projected balances. Could even include projected credit card balances with interest rates.

Need a kill switch for the credit card one, at least while you are buying me a book present online. No particular holiday required. 😉

It’s not a lot of money but a good opportunity to build street cred for topic maps.

Heterogeneous data structures are the rule in the finance community.

PS: When some friend of yours says, “Oh, but we can use X to map between heterogeneous data structures.,” your response should be: “Sure, and when you move up in management, how do we know why that mapping exists?” “Or add to it?”

Fixed mappings are useful, but also repetitively expensive.

Introducing new Fusion Tables API [Deprecation – SQL API]

Filed under: Database,Fusion Tables,SQL — Patrick Durusau @ 10:03 am

Introducing new Fusion Tables API by Warren Shen.

The post in its entirety:

We are very pleased to announce the public availability of the new Fusion Tables API. The new API includes all of the functionality of the existing SQL API, plus the ability to read and modify table and column metadata as well as the definitions of styles and templates for data visualization. This API is also integrated with the Google APIs console which lets developers manage all their Google APIs in one place and take advantage of built-in reporting and authentication features.

With this launch, we are also announcing a six month deprecation period for the existing SQL API. Since the new API includes all of the functionality of the existing SQL API, developers can easily migrate their applications using our migration guide.

For a detailed description of the features in the new API, please refer to the API documentation.

BTW, if you go to the Migration Guide, be aware that as of 27 June 2012, the following links aren’t working (404):

This Migration Guide documents how to convert existing code using the SQL API to code using the Fusion Tables API 1.0. This information is discussed in more detail in the Getting Started and Using the API developer guides.

I have discovered the error:

https://developers.google.com/fusiontables/docs/v1/v1/getting_started.html – Wrong – note the successive “/v1.”

https://developers.google.com/fusiontables/docs/v1/getting_started – Correct – From the left side nav. bar.

https://developers.google.com/fusiontables/docs/v1/v1/using.html – Wrong – note the successive “/v1.”

https://developers.google.com/fusiontables/docs/v1/using – Correct – From the left side nav. bar.

The summary material appears to be useful but you will need the more detailed information as well.

For example, under HTTP Methods (in the Migration Guide), the SQL API is listed as having:

GET for SHOW TABLES, DESCRIBE TABLE, SELECT

And the equivalent in the Fusion API:

GET for SELECT

No equivalent of SHOW TABLES, DESCRIBE TABLE using GET.

If you find and read Using the API you will find:

Retrieving a list of tables

Listing tables is useful because it provides the table ID and column names of tables that are necessary for other calls. You can retrieve the list of tables a user owns by sending an HTTP GET request to the URI with the following format:

https://www.googleapis.com/fusiontables/v1/tables

Tables are listed along with column ids, names and datatypes.

That may be too much for the migration document but implying that all you have with GET is SELECT is misleading.

Rather: GET for TABLES (SHOW + DESCRIBE), SELECT

Yes?

Become a Google Power Searcher

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 9:15 am

Become a Google Power Searcher by Terry Ednacot.

From the post:

You may already be familiar with some shortcuts for Google Search, like using the search box as a calculator or finding local movie showtimes by typing [movies] and your zip code. But there are many more tips, tricks and tactics you can use to find exactly what you’re looking for, when you most need it.

Today, we’ve opened registration for Power Searching with Google, a free, online, community-based course showcasing these techniques and how you can use them to solve everyday problems. Our course is aimed at empowering you to find what you need faster, no matter how you currently use search. For example, did you know that you can search for and read pages written in languages you’ve never even studied? Identify the location of a picture your friend took during his vacation a few months ago? How about finally identifying that green-covered book about gardening that you’ve been trying to track down for years? You can learn all this and more over six 50-minute classes.

Lessons will be released daily starting on July 10, 2012, and you can take them according to your own schedule during a two-week window, alongside a worldwide community. The lessons include interactive activities to practice new skills, and many opportunities to connect with others using Google tools such as Google Groups, Moderator and Google+, including Hangouts on Air, where world-renowned search experts will answer your questions on how search works. Googlers will also be on hand during the course period to help and answer your questions in case you get stuck.

I know, I know, you are way beyond using Google but you may know some people who are not.

Try to suggest this course in a positive way, i.e., non-sneering sort of way.

Will be a new experience.

You may want to “audit” the course.

Would be unfortunate for someone to ask you a Google search question you can’t answer.

😉

Booting HCatalog on Elastic MapReduce [periodic discovery audits?]

Filed under: Amazon Web Services AWS,HCatalog,Hive,Pig — Patrick Durusau @ 8:06 am

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce by Russell Jurney.

From the post:

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Russell continues walking the Enron Emails through a full data lifecycle in the Hadoop ecosystem.

Given the current use and foreseeable use of email, these are important lessons for more than one reason.

What about periodic discovery audits on enterprise email archives?

To see what others may find, or to identify poor wording/disclosure practices?

Kiss the Weatherman [Weaponizing Data]

Filed under: BigData,Data,Dataset,Weather Data — Patrick Durusau @ 8:05 am

Kiss the Weatherman by James Locus.

From the post:

Weather Hurts

Catastrophic weather events like the historic 2011 floods in Pakistan or prolonged droughts in the horn of Africa make living conditions unspeakably harsh for tens of millions of families living in these affected areas. In the US, the winter storms of 2009-2010 and 2010-2011 brought record-setting snowfall, forcing mighty metropolises into an icy standstill. Extreme weather can profoundly impact the landscape of the planet.

The effects of extreme weather can send terrible ripples throughout an entire community. Unexpected cold snaps or overly hot summers can devastate crop yields and forcing producers to raise prices. When food prices rise, it becomes more difficult for some people to earn enough money to provide for their families, creating even larger problems for societies as a whole.

The central problem is the inability of current measuring technologies to more accurately predict large-scale weather patterns. Weathermen are good at predicting weather but poor at predicting climate. Weather occurs over a shorter period of time and can be reliability predicted within a 3-day timeframe. Climate stretches many months, years, or even centuries. Matching historical climate data with current weather data to make future weather and climate is a major challenge for scientists.

James has a good survey of both data sources and researchers working on using “big data” (read historical weather data) for both weather (short term) and climate (longer term) prediction.

Weather data by itself is just weather data.

What other data would you combine with it and on what basis to weaponize the data?

No one can control the weather but you can control your plans for particular weather events.

June 26, 2012

Hadoop Beyond MapReduce, Part 1: Introducing Kitten

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:01 pm

Hadoop Beyond MapReduce, Part 1: Introducing Kitten by Josh Wills

From the post:

This week, a team of researchers at Google will be presenting a paper describing a system they developed that can learn to identify objects, including the faces of humans and cats, from an extremely large corpus of unlabeled training data. It is a remarkable accomplishment, both in terms of the system’s performance (a 70% improvement over the prior state-of-the-art) and its scale: the system runs on over 16,000 CPU cores and was trained on 10 million 200×200 pixel images extracted from YouTube videos.

Doug Cutting has described Apache Hadoop as “the kernel of a distributed operating system.” Until recently, Hadoop has been an operating system that was optimized for running a certain class of applications: the ones that could be structured as a short sequence of MapReduce jobs. Although MapReduce is the workhorse programming framework for distributed data processing, there are many difficult and interesting problems– including combinatorial optimization problems, large-scale graph computations, and machine learning models that identify pictures of cats– that can benefit from a more flexible execution environment.

Hadoop 0.23 introduced a substantial re-design of the core resource scheduling and task tracking system that will allow developers to create entirely new classes of applications for Hadoop. Cloudera’s Ahmed Radwan has written an excellent overview of the architecture of the new resource scheduling system, known as YARN. Hadoop’s open-source foundation and its broad adoption by industry, academia, and government labs means that, for the first time in history, developers can assume that a common platform for distributed computing will be available at organizations all over the world, and that there will be a market for applications that take advantage of that common platform to solve problems at scales that have never been considered before.

I suppose it would not be fair to point out that a human and fertile male/female couple could duplicate this feat without 10 million images from YouTube. 😉

And while YARN is a remarkable achievement, in the United States it isn’t possible to get federal agencies to share data, much less time on computing platforms. May be able to presume a common platform, but access, well, that may be a more difficult issue.

Text Mining & R

Filed under: R,Text Mining — Patrick Durusau @ 6:48 pm

As I got deeper on Wordcloud of the Arizona et al. v. United States opinion, I ran across several resources on the tm package for text mining in R.

First, if you are in an R shell:


> library("tm")
> vignette("tm")

Produces an eight (8) page overview of the package.

Next stop should be An Introduction to Text Mining in R (R News volume 8/2, 2008, pages 19-22).

Demonstrations of stylometry using the Wizard of Oz book series and analysis of email archives either as RSS feeds or in mbox format.

If you are still curious, check out Text Mining Infrastructure in R, by Ingo Feinerer, Kurt Hornik and David Meyer. Journal of Statistical Software, March 2008, Volume 25, Issue 5.

Runs a little over fifty (50) pages.

The package is reported to be designed for extension and since this paper was published in 2008, I expect there are extensions not reflected in these resources.

Suggestions/pointers quite welcome!

Implementing Aggregation Functions in MongoDB

Filed under: Aggregation,MapReduce,MongoDB,NoSQL — Patrick Durusau @ 1:51 pm

Implementing Aggregation Functions in MongoDB by Arun Viswanathan and Shruthi Kumar.

From the post:

With the amount of data that organizations generate exploding from gigabytes to terabytes to petabytes, traditional databases are unable to scale up to manage such big data sets. Using these solutions, the cost of storing and processing data will significantly increase as the data grows. This is resulting in organizations looking for other economical solutions such as NoSQL databases that provide the required data storage and processing capabilities, scalability and cost effectiveness. NoSQL databases do not use SQL as the query language. There are different types of these databases such as document stores, key-value stores, graph database, object database, etc.

Typical use cases for NoSQL database includes archiving old logs, event logging, ecommerce application log, gaming data, social data, etc. due to its fast read-write capability. The stored data would then require to be processed to gain useful insights on customers and their usage of the applications.

The NoSQL database we use in this article is MongoDB which is an open source document oriented NoSQL database system written in C++. It provides a high performance document oriented storage as well as support for writing MapReduce programs to process data stored in MongoDB documents. It is easily scalable and supports auto partitioning. Map Reduce can be used for aggregation of data through batch processing. MongoDB stores data in BSON (Binary JSON) format, supports a dynamic schema and allows for dynamic queries. The Mongo Query Language is expressed as JSON and is different from the SQL queries used in an RDBMS. MongoDB provides an Aggregation Framework that includes utility functions such as count, distinct and group. However more advanced aggregation functions such as sum, average, max, min, variance and standard deviation need to be implemented using MapReduce.

This article describes the method of implementing common aggregation functions like sum, average, max, min, variance and standard deviation on a MongoDB document using its MapReduce functionality. Typical applications of aggregations include business reporting of sales data such as calculation of total sales by grouping data across geographical locations, financial reporting, etc.

Not terribly advanced but enough to get you started with creating aggregation functions.

Includes “testing” of the aggregation functions that are written in the article.

If Python is more your cup of tea, see: Aggregation in MongoDB (part1) and Aggregation in MongoDB (part 2).

Journal of Statistical Software

Filed under: Mathematica,Mathematics,R,Statistics — Patrick Durusau @ 12:53 pm

Journal of Statistical Software

From the homepage:

Established in 1996, the Journal of Statistical Software publishes articles, book reviews, code snippets, and software reviews on the subject of statistical software and algorithms. The contents are freely available on-line. For both articles and code snippets the source code is published along with the paper.

Statistical software is the key link between statistical methods and their application in practice. Software that makes this link is the province of the journal, and may be realized as, for instance, tools for large scale computing, database technology, desktop computing, distributed systems, the World Wide Web, reproducible research, archiving and documentation, and embedded systems.

We attempt to present research that demonstrates the joint evolution of computational and statistical methods and techniques. Implementations can use languages such as C, C++, S, Fortran, Java, PHP, Python and Ruby or environments such as Mathematica, MATLAB, R, S-PLUS, SAS, Stata, and XLISP-STAT.

There are currently 518 articles, 34 code snippets, 104 book reviews, 6 software reviews, and 13 special volumes in our archives. These can be browsed or searched. You can also subscribe for notification of new articles.

Running down resources used in Wordcloud of the Arizona et al. v. United States opinion when I encountered this wonderful site.

I have only skimmed the surface for an article or two in particular so can’t begin to describe the breadth of material you will find here.

I am sure I will be returning time and time again to this site. Suggest if you are interested in statistical manipulation of data that you do the same.

June 25, 2012

Wordcloud of the Arizona et al. v. United States opinion

Filed under: Humor — Patrick Durusau @ 7:54 pm

Wordcloud of the Arizona et al. v. United States opinion by Michael J Bommarito II.

From the post:

Here’s one purely for fun – a wordcloud built from the Supreme Court’s opinion on Arizona et al. v United States. Word clouds, though certainly not the most scientific of visualization techniques, are often engaging and “fun” ways to lead into discussion on NLP or topic modeling.

Includes the code to automate the process. One assumes you can amuse yourself as legal decisions, speeches, etc. emerge during an election year in the United States.

Suggestion: create a word cloud of questions by reporters and a separate word cloud of the responses by candidates.

The number of terms in common would be vanishingly low I suspect.

In the red corner – PubMed and in the blue corner – Google Scholar

Filed under: Bioinformatics,Biomedical,PubMed,Search Engines,Searching — Patrick Durusau @ 7:40 pm

Medical literature searches: a comparison of PubMed and Google Scholar by Eva Nourbakhsh, Rebecca Nugent, Helen Wang, Cihan Cevik and Kenneth Nugent. (Health Information & Libraries Journal, Article first published online: 19 JUN 2012)

From the abstract:

Background

Medical literature searches provide critical information for clinicians. However, the best strategy for identifying relevant high-quality literature is unknown.

Objectives

We compared search results using PubMed and Google Scholar on four clinical questions and analysed these results with respect to article relevance and quality.

Methods

Abstracts from the first 20 citations for each search were classified into three relevance categories. We used the weighted kappa statistic to analyse reviewer agreement and nonparametric rank tests to compare the number of citations for each article and the corresponding journals’ impact factors.

Results

Reviewers ranked 67.6% of PubMed articles and 80% of Google Scholar articles as at least possibly relevant (P = 0.116) with high agreement (all kappa P-values < 0.01). Google Scholar articles had a higher median number of citations (34 vs. 1.5, P < 0.0001) and came from higher impact factor journals (5.17 vs. 3.55, P = 0.036). Conclusions

PubMed searches and Google Scholar searches often identify different articles. In this study, Google Scholar articles were more likely to be classified as relevant, had higher numbers of citations and were published in higher impact factor journals. The identification of frequently cited articles using Google Scholar for searches probably has value for initial literature searches.

I have several concerns that may or may not be allied by further investigation:

  • Four queries seems like an inadequate basis for evaluation. Not that I expect to see one “winner” and one “loser,” but am more concerned with what lead to the differences in results.
  • It is unclear why a citation from a journal with a higher impact factor is superior to one with a lesser impact factor? I assume the point of the query is to obtain a useful result (in the sense of medical treatment, not tenure).
  • Neither system enabled users to build upon the query experience of prior users with a similar query.
  • Neither system enabled users to avoid re-reading the same texts as other had read before them.

Thoughts?

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE

Filed under: Bioinformatics,Biomedical,Text Mining — Patrick Durusau @ 7:15 pm

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE by Neveol, A., Wilbur, W. J., Lu, Z.

Abstract:

High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, <50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation.

Database URLs: MEDLINE http://www.ncbi.nlm.nih.gov/PubMed, GEO http://www.ncbi.nlm.nih.gov/geo/, PDB http://www.rcsb.org/pdb/.

A good illustration of the use of automated means to augment the capacity of curators of data links.

Or topic map authors performing the same task.

Scholarly network similarities…

Filed under: Networks,Similarity — Patrick Durusau @ 4:40 pm

Scholarly network similarities: How bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other by Erjia Yan and Ying Ding. (Journal of the American Society for Information Science and Technology Volume 63, Issue 7, pages 1313–1326, July 2012)

Abstract:

This study explores the similarity among six types of scholarly networks aggregated at the institution level, including bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks. Cosine distance is chosen to measure the similarities among the six networks. The authors found that topical networks and coauthorship networks have the lowest similarity; cocitation networks and citation networks have high similarity; bibliographic coupling networks and cocitation networks have high similarity; and coword networks and topical networks have high similarity. In addition, through multidimensional scaling, two dimensions can be identified among the six networks: Dimension 1 can be interpreted as citation-based versus noncitation-based, and Dimension 2 can be interpreted as social versus cognitive. The authors recommend the use of hybrid or heterogeneous networks to study research interaction and scholarly communications.

Interesting that I should come across this article after posting about data sets. See http://info.slis.indiana.edu/~eyan/papers/citation/ the post-processing data and interactive visualizations that are reproduced as static views in the article.

At page 1323 the authors say:

In addition to social networks versus information networks, another distinction of real connection-based networks versus artificial connection-based networks can be made. Coauthorship networks and citation networks are constructed based on real connections, whereas cocitation, bibliographic coupling, topical, and coword networks are constructed based on artificial connections,5 usually in the form of similarity measurements.

I am not sure about the “real” versus “artificial” connection that comes so easily to hand. In part because authors, in my experience, tend to use terminology similar to other scholars with who they agree. So the connection between the work of scholars isn’t “artificial,” although it isn’t accounted for in this study.

There is much to admire and even imitate in this article but the interaction between scholars is more complex than its representation by the networks here.

Data citation initiatives and issues

Filed under: Data,Data Citation,Data Management — Patrick Durusau @ 3:57 pm

Data citation initiatives and issues by Matthew S. Mayernik (Bulletin of the American Society for Information Science and Technology Volume 38, Issue 5, pages 23–28, June/July 2012)

Abstract:

The importance of formally citing scientific research data has been recognized for decades but is only recently gaining momentum. Several federal government agencies urge data citation by researchers, DataCite and its digital object identifier registration services promote the practice of citing data, international citation guidelines are in development and a panel at the 2012 ASIS&T Research Data Access and Preservation Summit focused on data citation. Despite strong reasons to support data citation, the lack of individual user incentives and a pervasive cultural inertia in research communities slow progress toward broad acceptance. But the growing demand for data transparency and linked data along with pressure from a variety of stakeholders combine to fuel effective data citation. Efforts promoting data citation must come from recognized institutions, appreciate the special characteristics of data sets and initially emphasize simplicity and manageability.

This is an important and eye-opening article on the state of data citations and issues related to it.

I found it surprising in part because citation of data in radio and optical astronomy has long been commonplace. In part because for decades now, the astronomical community has placed a high value on public archiving of research data as it is acquired, both in raw and processed formats.

As pointed out in this paper, without public archiving, there can be no effective form of data citation. Sad to say, the majority of data never makes it to public archives.

Given the reliance on private and public sources of funding for research, public archiving and access should be guaranteed as a condition of funding. Researchers would be free to continue to not make their data publicly accessible, should they choose to fund their own work.

If that sounds harsh, consider the well deserved amazement at the antics over access to the Dead Sea Scrolls.

If the only way for your opinion/analysis to prevail is to deny others access to the underlying data, that is all the commentary the community needs on your work.

Google search parameters in 2012

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 3:02 pm

Google search parameters in 2012

From the post:

Knowing the parameters Google uses in its search is not only important for SEO geeks. It allow you to use shortcuts and play with the Google filters. The parameters also reveal more juicy things: Is it safe to share your Google search URLs or screenshots of your Google results? This post argues that it is important to be aware of the complicated nature of the Google URL. As we will see later posting your own Google URL can reveal personal information about you that you might not feel too comfortable sharing. So read on to learn more about the Google search parameters used in 2012.

Why do I say “in 2012″? Well, the Google URL changed over time and more parameters were added to keep pace with the increasing complexity of the search product, the Google interface and the integration of verticals. Before looking at the parameter table below, though, I encourage you to quickly perform the following 2 things:

  1. Go directly to Google and search for your name. Look at the URL.
  2. Go directly to DuckDuckGo and perform the same search. Look at the URL.

This little exercise serves well to demonstrate just how simple and how complicated URLs used by search engines can look like. These two cases are at the opposing ends: While DuckDuckGo has only one search parameter, your query, and is therefore quite readable, Google uses a cryptic construct that only IT professionals can try to decipher. What I find interesting is that on my Smartphone, though, the Google search URL is much simpler than on the desktop.

This blog post is primarily aimed at Google’s web search. I will not look at their other verticals such as scholar or images. But because image search is so useful, I encourage you to look at the image section of the Unofficial Google Advanced Search guide

The tables of search parameters are a nice resource.

Suggestions of similar information for other search engines?

…a phylogeny-aware graph algorithm

Filed under: Algorithms,Alignment,Genome,Graphs,Sequence Detection — Patrick Durusau @ 2:43 pm

Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm by Loytynoja, A., Vilella, A. J., Goldman, N.

From the post:

Motivation: Accurate alignment of large numbers of sequences is demanding and the computational burden is further increased by downstream analyses depending on these alignments. With the abundance of sequence data, an integrative approach of adding new sequences to existing alignments without their full re-computation and maintaining the relative matching of existing sequences is an attractive option. Another current challenge is the extension of reference alignments with fragmented sequences, as those coming from next-generation metagenomics, that contain relatively little information. Widely used methods for alignment extension are based on profile representation of reference sequences. These do not incorporate and use phylogenetic information and are affected by the composition of the reference alignment and the phylogenetic positions of query sequences.

Results: We have developed a method for phylogeny-aware alignment of partial-order sequence graphs and apply it here to the extension of alignments with new data. Our new method, called PAGAN, infers ancestral sequences for the reference alignment and adds new sequences in their phylogenetic context, either to predefined positions or by finding the best placement for sequences of unknown origin. Unlike profile-based alternatives, PAGAN considers the phylogenetic relatedness of the sequences and is not affected by inclusion of more diverged sequences in the reference set. Our analyses show that PAGAN outperforms alternative methods for alignment extension and provides superior accuracy for both DNA and protein data, the improvement being especially large for fragmented sequences. Moreover, PAGAN-generated alignments of noisy next-generation sequencing (NGS) sequences are accurate enough for the use of RNA-seq data in evolutionary analyses.

Availability: PAGAN is written in C++, licensed under the GPL and its source code is available at http://code.google.com/p/pagan-msa.

Contact: ari.loytynoja@helsinki.fi

Does your graph software support “…phylogeny-aware alignment of partial-order sequence graphs…?”

Show Me The Money!

Filed under: Conferences,XBRL,XML,XPath,XQuery — Patrick Durusau @ 2:28 pm

I need to talk to Tommie Usdin about marketing the Balisage conference.

The final program came out today and here is what Tommie had to say:

When the regular (peer-reviewed) part of the Balisage 2012 program was scheduled, a few slots were reserved for presentation of “Late breaking” material. These presentations have now been selected and added to the program.

Topics added include:

  • making robust and multi-platform ebooks
  • creating representative documents from large document collections
  • validating RESTful services using XProc, XSLT, and XSD
  • XML for design-based (e.g. magazine) publishing
  • provenance in XSLT transformation (tracking what XSLT does to documents)
  • literate programming
  • managing the many XML-related standards and specifications
  • leveraging XML for web applications

The program already included talks about adding RDF to TEI documents, compression of XML documents, exploring large XML collections, Schematron, relation of XML to JSON, overlap, higher-order functions in XSLT, the balance between XML and non-XML notations, and many other topics. Now it is a real must for anyone who thinks deeply about markup.

Balisage is the XML Geek-fest; the annual gathering of people who design markup and markup-based applications; who develop XML specifications, standards, and tools; the people who read and write, books about publishing technologies in general and XML in particular; and super-users of XML and related technologies. You can read about the Balisage 2011 conference at http://www.balisage.net.

Yawn. Are we there yet? 😉

Why you should care about XML and Balisage:

  • US government and others are publishing laws and regulations and soon to be legislative material in XML
  • Securities are increasingly using XML for required government reports
  • Texts and online data sets are being made available in XML
  • All the major document formats are based in XML

A $billion here, a $billion there and pretty soon you are talking about real business opportunity.

Your un-Balisaged XML developers have $1,000 bills blowing overhead.

Be smart, make your XML developers imaginative and productive.

Send your XML developers to Balisage.

(http://www.balisage.net/registration.html)

June 24, 2012

Stardog 1.0

Filed under: OWL,RDF,Semantic Web — Patrick Durusau @ 8:19 pm

Stardog 1.0 by Kendall Clark.

From the post:

Today I’m happy to announce the release of Stardog 1.0, the fastest, smartest, and easiest to use RDF database on the planet. Stardog fills a hole in the Semantic Technology (and NoSQL database) market for an RDF database that is fast, zero config, lightweight, and feature-rich.

Speed Kills

RDF and OWL are excellent technologies for building data integration and analysis apps. Those apps invariably require complex query processing, i.e., queries where there are lots of joins, complex logical conditions to evaluate, etc. Stardog is targeted at query performance for complex SPARQL queries. We publish performance data so you can see how we’re doing.

Braindead Simple Deployment

Winners ship. Period.

We care very much about simple deployments. Stardog works out-of-the-box with minimal (none, typically) configuration. You shouldn’t have to fight an RDF database for days to install or tune it for great performance. Because Stardog is pure Java, it will run anywhere. It just works and it’s damn fast. You shouldn’t need to buy and configure a cluster of machines to get blazing fast performance from an RDF database. And now you don’t have to.

One More Thing
OWL Reasoning

Finally, Stardog has the deepest, most comprehensive, and best OWL reasoning support of any commerical RDF database available.

Stardog 1.0 supports RDFS, OWL 2 QL, EL, and RL, as well as OWL 2 DL schema-reasoning. It’s also the only RDF database to support closed-world integrity constraint validation and automatic explanations of integrity constraint violations.

If you care about data quality, Stardog 1.0 is worth a hard look.

OK, so I have signed up for an evaluation version, key, etc. Email just arrived.

Downloaded software and license key.

With all the open data laying around, should not be hard to find test data.

More to follow. Comments welcome.

The Turing Digital Archive

Filed under: Computer Science,Semantics,Turing Machines — Patrick Durusau @ 8:18 pm

The Turing Digital Archive

From the webpage:

Alan Turing (1912-54) is best-known for helping decipher the code created by German Enigma machines in the Second World War, and for being one of the founders of computer science and artificial intelligence.

This archive contains many of Turing’s letters, talks, photographs and unpublished papers, as well as memoirs and obituaries written about him. It contains images of the original documents that are held in the Turing collection at King’s College, Cambridge. For more information about this digital archive and tips on using the site see About the archive.

I ran across this archive when I followed a reference to the original paper on Turing machines, http://www.turingarchive.org/viewer/?id=466&title=01a.

I will be returning to this original description in one or more posts on Turing machines and semantics.

Closing In On A Million Open Government Data Sets

Filed under: Dataset,Geographic Data,Government,Government Data,Open Data — Patrick Durusau @ 7:57 pm

Closing In On A Million Open Government Data Sets by Jennifer Zaino.

From the post:

A million data sets. That’s the number of government data sets out there on the web that we have closed in on.

“The question is, when you have that many, how do you search for them, find them, coordinate activity between governments, bring in NGOs,” says James A. Hendler, Tetherless World Senior Constellation Professor, Department of Computer Science and Cognitive Science Department at Rensselaer Polytechnic Institute, and a principal investigator of its Linking Open Government Data project lives, as well as Internet web expert for data.gov, He also is connected with many other governments’ open data projects. “Semantic web tools organize and link the metadata about these things, making them searchable, explorable and extensible.”

To be more specific, Hendler at SemTech a couple of weeks ago said there are 851,000 open government data sets across 153 catalogues from 30-something countries, with the three biggest representatives, in terms of numbers, at the moment being the U.S., the U.K, and France. Last week, the one million threshold was crossed.

About 410,000 of these data sets are from the U.S. (federal, state, city, county, tribal included), including quite a large number of geo-data sets. The U.S. government’s goal is to put “lots and lots and lots of stuff out there” and let people figure out what they want to do with it, he notes.

My question about data that “..[is] searchable, explorable and extensible,” is whether anyone wants to search, explore or extend it?

Simply piling up data to say you have a large pile of data doesn’t sound very useful.

I would rather have a smaller pile of data that included contract/testing transparency on anti-terrorism IT projects, for example. If the systems aren’t working, then disclosing them isn’t going to make them work any less well.

Not that anyone need fear transparency or failure to perform. The TSA has failed to perform for more than a decade now, failed to catch a single terrorist and it remains funded. Even when it starts groping children, passengers are so frightened that even that outrage passes without serious opposition.

Still, it would be easier to get people excited about mining government data if the data weren’t so random or marginal.

Pulp Fiction presented in chronological order

Filed under: Flowchart,Humor — Patrick Durusau @ 4:33 pm

Pulp Fiction presented in chronological order

Nathan Yau reports:

Quentin Tarantino’s Pulp Fiction twists and turns through plot lines and time. Designer Noah Smith untangled the story and put it in a linear flowchart.

Just the sort of thing you will need if you want to experiment with topic mapping “Pulp Fiction.”

And if not that, certainly an amusing way to begin the week.

Makes me wonder what untangling Downton Abbey would take?

Predictive Analytics: Evaluate Model Performance

Filed under: Predictive Analytics,Statistics — Patrick Durusau @ 4:19 pm

Predictive Analytics: Evaluate Model Performance by Ricky Ho.

Ricky finishes his multi-part series on models for machine learning with the one question left hanging:

OK, so which model should I use?

In previous posts, we have discussed various machine learning techniques including linear regression with regularization, SVM, Neural network, Nearest neighbor, Naive Bayes, Decision Tree and Ensemble models. How do we pick which model is the best ? or even whether the model we pick is better than a random guess ? In this posts, we will cover how we evaluate the performance of the model and what can we do next to improve the performance.

Best guess with no model

First of all, we need to understand the goal of our evaluation. Are we trying to pick the best model ? Are we trying to quantify the improvement of each model ? Regardless of our goal, I found it is always useful to think about what the baseline should be. Usually the baseline is what is your best guess if you don’t have a model.

For classification problem, one approach is to do a random guess (with uniform probability) but a better approach is to guess the output class that has the largest proportion in the training samples. For regression problem, the best guess will be the mean of output of training samples.

Ricky walks you through the steps and code to make an evaluation of each model.

It is always better to have evidence that your choices were better than a coin flip.

Although, I am mindful of the wealth advice story in “Thinking, Fast and Slow” by Daniel Kahneman, where he was given data of investment outcomes for eight years by 28 wealth advisers. The results indicated there was no correlation between “skill” and the outcomes. Luck and not skill was being rewarded with bonuses.

The results were ignored by both management and advisers as inconsistent with their “…personal experiences from experience.” (pp. 215-216)

Do you think the same can be said of search results? Just curious.

What’s Your Default Search Engine?

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 3:55 pm

Bing’s Evolving Local Search by Matthew Hurst.

From the post:

Recently, there have been a number of announcements regarding the redesign of Bing’s main search experience. The key difference is the use of three parallel zones in the SERP. Along with the traditional page results area, there are two new results columns: the task pane, which highlights factual data and the social pane which currently highlights social information from individuals (I distinguish social from ‘people’ as entities – for example a restaurant – can have a social presence even though they are only vaguely regarded as people).

I don’t get out much but I can appreciate the utility of the aggregate results for local views.

Matthew writes:

  1. When we provide flat structured data (as Bing did in the past), while we continued to strive for high quality data, there is no burning light focused on any aspect of the data. However, when we require to join the data to the web (local results are ‘hanging off’ the associated web sites), the quality of the URL associated with the entity record becomes a critical issue.
  2. The relationship between the web graph and the entity graph is subtle and complex. Our legacy system made do with the notion of a URL associated with an entity. As we dug deeper into the problem we discovered a very rich set of relationships between entities and web sites. Some entities are members of chains, and the relationships between their chain home page and the entity is quite different from the relationship between a singleton business and its home page. This also meant that we wanted to treat the results differently. See below for the results for {starbucks in new york}
  3. The structure of entities in the real world is subtle and complex. Chains, franchises, containment (shop in mall, restaurant in casino, hotel in airport), proximity – all these qualities of how the world works scream out for rich modeling if the user is to be best supported in navigating her surroundings.

Truth be told, the structure of entities in the “real world” and their representatives (somewhere other than the “real” world), not to mention their relationships to each other, are all subtle and complex.

That is part of what makes searching, discovery, mapping such exciting areas for exploration. There is always something new just around the next corner.

Report of Second Phase of Seventh Circuit eDiscovery Pilot Program

Filed under: Law,Legal Informatics — Patrick Durusau @ 3:42 pm

Report of Second Phase of Seventh Circuit eDiscovery Pilot Program Published

From Legal Informatics:

The Seventh Circuit Electronic Discovery Pilot Program has published its Final Report on Phase Two, May 2010 to May 2012 (very large PDF file).

A principal purpose of the program is to determine the effects of the use of Principles Relating to the Discovery of Electronically Stored Information in litigation in the Circuit.

The report describes the results of surveys of lawyers who participated in efiling in the Seventh Circuit, and of judges and lawyers who participated in trials in which the Circuit’s Principles Relating to the Discovery of Electronically Stored Information were applied.

True enough, the report is “a very large PDF file.” At 969 pages and 111.5 MB. Don’t try downloading while you are on the road, unless you are in South Korea or Japan.

I don’t have the time today but the report isn’t substantively 969 pages long. Pages of names and addresses, committee minutes, presentations, filler of various kinds. If you find it other than in PDF format, I might be interested in generating a shorter version that might be of more interest.

Bottom line was that cooperation in discovery as it relates to electronically stored information reduces costs and yet maintains standards for representation.

Topic maps can play an important role both in eDiscovery but in relating information together, whatever its original form.

True enough, there are services that perform those functions now, but have you ever taken one of their work products and merged it with another?

By habit or chance, the terms used may be close enough to provide a useful result, but how do you verify the results?

Rise above the Cloud hype with OpenShift

Filed under: Cloud Computing,Red Hat — Patrick Durusau @ 1:29 pm

Rise above the Cloud hype with OpenShift by Eric D. Schabell.

From the post:

Are you tired of requesting a new development machine for your application? Are you sick of having to setup a new test environment for your application? Do you just want to focus on developing your application in peace without ‘dorking with the stack’ all of the time? We hear you. We have been there too. Have no fear, OpenShift is here!

In this article will walk you through the simple steps it takes to setup not one, not two, not three, but up to five new machines in the Cloud with OpenShift. You will have your applications deployed for development, testing or to present them to the world at large in minutes. No more messing around.

We start with an overview of what OpenShift is, where it comes from and how you can get the client tooling setup on your workstation. You will then be taken on a tour of the client tooling as it applies to the entry level of OpenShift, called Express. In minutes you will be off and back to focusing on your application development, deploying to test it in OpenShift Express. When finished you will just discard your test machine and move on. When you have mastered this, it will be time to ramp up into the next level with OpenShift Flex. This opens up your options a bit so you can do more with complex applications and deployments that might need a bit more fire power. After this you will be fully capable of ascending into the OpenShift Cloud when you chose, where you need it and at a moments notice. This is how development is supposed to be, development without stack distractions.

Specific to the Red Hat Cloud but that doesn’t trouble me if it doesn’t trouble you.

What is important is that like many cloud providers, the goal is to make software development in the cloud as free from “extra” concerns as possible.

Think of users who rely upon network based applications for word processing, spreadsheets, etc. Fewer of them would do so if every use of the application required steps that expose the network-based nature of the application. Users just want the application to work. (full stop)

A bit more of the curtain can be drawn back for developers but even there, the goal isn’t to master the intricacies of cloud computing but to produce robust applications that so happen to run on the cloud.

This is one small step towards a computing fabric where developers write and deploy software. (full stop) The details of where it is executed, where data is actually stored, being known only by computing fabric specialists. The application serves it users, produces the expected answers, delivers specified performance, what more do you need to know?

I would like to see topic maps playing a role in developing the transparency for the interconnected systems that grow into that fabric.

(I first saw this at DZone’s replication of the Java Code Geeks reposting at: http://www.dzone.com/links/r/rise_above_the_cloud_hype_with_openshift.html)

« Newer PostsOlder Posts »

Powered by WordPress