Archive for September, 2013

Email Indexing Using Cloudera Search [Stepping Beyond “Hello World”]

Thursday, September 26th, 2013

Email Indexing Using Cloudera Search by Jeff Shmain

From the post:

Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?

Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.

A little over a year ago we described how to archive and index emails using HDFS and Apache Solr. However, at that time, searching and analyzing emails were still relatively cumbersome and technically challenging tasks. We have come a long way in document indexing automation since then — especially with the recent introduction of Cloudera Search, it is now easier than ever to extract value from the corpus of available information.

In this post, you’ll learn how to set up Apache Flume for near-real-time indexing and MapReduce for batch indexing of email documents. Note that although this post focuses on email data, there is no reason why the same concepts could not be applied to instant messages, voice transcripts, or any other data (both structured and unstructured).

If you want a beyond “Hello World” introduction to: Flume, Solr, Cloudera Morphlines, HDFS, Hue’s Search application, and Cloudera Search, this is the post for you.

With the added advantage that you can apply the basic principles in this post as you expand your knowledge of the Hadoop ecosystem.

Big Data Boot Camp Day 1, Simons Institute, Berkeley

Thursday, September 26th, 2013

Big Data Boot Camp Day 1, Simons Institute, Berkeley by Igor Carron.

Igor has posted links to videos and supplemental materials for:

  • Big Data: The Computation/Statistics Interface by Michael Jordan, UC Berkeley.
  • Algorithmic High-Dimensional Geometry I by Alex Andoni, Microsoft Research.
  • User-Friendly Tools for Random Matrices I, II, III by Joel Tropp, California Institute of Technology.

BTW, the Simons Institute for the Theory of Computing has a channel on YouTube.

As of September 26, 2013, some one hundred and five videos. Lots of top quality, cutting edge material.

Now, more than ever, ignorance is a self-inflicted wound.

GraphLab Internship Program…

Thursday, September 26th, 2013

GraphLab Internship Program (Machine Learning Summer Internship) by Danny Bickson.

From the post:

We are glad to announce our latest internship program for the summer of 2014. We have around 10 open positions, either at GraphLab/UW or affiliated companies we work with.

Would you like to have a chance to deploy cutting edge machine learning algorithms in practice? Do you want to get your hands on the largest and most interesting datasets out there? Do you have valuable applied experience working with machine learning in the cloud? If so, you should consider our internship program.

Candidates must be US-based PhD or master students in one of the following areas: machine learning, statistics, AI, systems, high performance computing, distributed algorithms, or math. We are especially interested in those who have used GraphLab/GraphChi for a research project or have contributed to the GraphLab community.

All interested applicants should send their resume to bickson@graphlab.com. If you are a company interested having a GraphLab intern, please feel free to get in touch.
Here is a (very preliminary) list of open positions:

See the positions list at Danny’s post. And start your application sooner rather than later.

PS: They also do graphs at Graphlab. 😉

Time-varying social networks in a graph database…

Thursday, September 26th, 2013

Time-varying social networks in a graph database: a Neo4j use case by Ciro Cattuto, Marco Quaggiotto, André Panisson, and Alex Averbuch.

Abstract:

Representing and efficiently querying time-varying social network data is a central challenge that needs to be addressed in order to support a variety of emerging applications that leverage high-resolution records of human activities and interactions from mobile devices and wearable sensors. In order to support the needs of specific applications, as well as general tasks related to data curation, cleaning, linking, post-processing, and data analysis, data models and data stores are needed that afford efficient and scalable querying of the data. In particular, it is important to design solutions that allow rich queries that simultaneously involve the topology of the social network, temporal information on the presence and interactions of individual nodes, and node metadata. Here we introduce a data model for time-varying social network data that can be represented as a property graph in the Neo4j graph database. We use time-varying social network data collected by using wearable sensors and study the performance of real-world queries, pointing to strengths, weaknesses and challenges of the proposed approach.

A good start on modeling networks that vary based on time.

If the overhead sounds daunting, remember the graph data used here measured the proximity of actors every 20 seconds for three days.

Imagine if you added social connections between those actors, attended the same schools/conferences, co-authored papers, etc.

We are slowly loosing our reliance on simplification of data and models to make them computationally tractable.

Explore Your Data with Elasticsearch

Thursday, September 26th, 2013

From the description:

As Honza Kral puts it, “Elasticsearch is a very buzz-word compliant piece of software.” By this he means, it’s open source, it can do REST, JSON, HTTP, it has real time, and even Lucene is somewhere in there. What does this all really mean? Well, simply, Elasticsearch is a distributed data store that’s very good at searching and analyzing data.

Honza, a Python programmer and Django core developer, visits SF Python, to show off what this powerful tool can do. He uses real data to demonstrate how Elasticsearch’s real-time analytics and visualizations tools can help you make sense of your application.

Follow along with Honza’s slides: http://crcl.to/6tdvs

There are clients for ElasticSearch so don’t worry about the deeply nested brackets in the examples. 😉

A very good presentation on exploring data with ElasticSearch.

Computational Chemogenomics

Thursday, September 26th, 2013

Computational Chemogenomics by Edgar Jacoby (Novartis Pharma AG, Switzerland).

Description:

In the post-genomic era, one of the key challenges for drug discovery consists in making optimal use of comprehensive genomic data to identify effective new medicines. Chemogenomics addresses this challenge and aims to systematically identify all ligands and modulators for all gene products expressed, besides allowing accelerated exploration of their biological function.

Computational chemogenomics focuses on applications of compound library design and virtual screening to expand the bioactive chemical space, to target hopping of chemotypes to identify synergies within related drug discovery projects or to repurpose known drugs, to propose mechanisms of action of compounds, and to identify off-target effects by cross-reactivity analysis.

Both ligand-based and structure-based in silico approaches, as reviewed in this book, play important roles in all these applications. Computational chemogenomics is expected to increase the quality and productivity of drug discovery and lead to the discovery of new medicines.

If you are on the cutting edge of bioinformatics or want to keep up with the cutting edge in bioinformatics, this is a volume to consider.

The hard copy price is $149.95 so it may be a while before I acquire a copy of it.

GraphGist Wiki

Wednesday, September 25th, 2013

GraphGist Wiki

Quite naturally after I posted Why JIRA should use Neo4j I discovered it is part of a larger mother lode of graph gists!

You will find entries for the GraphGist Challenge, examples, graph design problems, tutorials, fun graph gists and philosopher graph gists.

One of the fun graphs is about Belgian beer, which lead me to: Everyone loves beer. Several Neo4j projects about beer.

Late enough in the week that I suspect Lars is thinking about what establishments to visit this weekend. 😉

Why JIRA should use Neo4j

Wednesday, September 25th, 2013

Why JIRA should use Neo4j by Pieter-Jan Van Aeken.

From the post:

There are few developers in the world that have never used an issue tracker. But there are even fewer developers who have ever used an issue tracker which uses a graph database. This is a shame because issue tracking really maps much better onto a graph database, than it does onto a relational database. Proof of that is the JIRA database schema.

Now obviously, the example below does not have all of the features that a tool like JIRA provides. But it is only a proof of concept, you could map every feature of JIRA into a Neo4J database. What I’ve done below, is take out some of the core functionalities and implement those.

This caught my eye because I have been in discussions about an upgrade from an older version of JIRA to the latest and greatest.

It’s not every feature but enough to convey the flavor of a possible graph mapping.

Given the openness of a graph model, does this suggest a model for mocking up topic map models?

I first saw this in a tweet by Peter Neubauer.

Easier than Excel:…

Wednesday, September 25th, 2013

Easier than Excel: Social Network Analysis of DocGraph with Gephi by Janos G. Hajagos and Fred Trotter. (PDF)

From the session description:

The DocGraph dataset was released at Strata RX 2012. The dataset is the result of FOI request to CMS by healthcare data activist Fred Trotter (co-presenter). The dataset is minimal where each row consists of just three numbers: 2 healthcare provider identifiers and a weighting factor. By combining these three numbers with other publicly available information sources novel conclusions can be made about delivery of healthcare to Medicare members. As an example of this approach see: http://tripleweeds.tumblr.com/post/42989348374/visualizing-the-docgraph-for-wyoming-medicare-providers

The DocGraph dataset consists of over 49,685,810 relationships between 940,492 different Medicare providers. Analyzing the complete dataset is too big for traditional tools but useful subsets of the larger dataset can be analyzed with Gephi. Gephi is a opensource tool to visually explore and analyze graphs. This tutorial will teach participants how to use Gephi for social network analysis on the DocGraph dataset.

Outline of the tutorial:

Part 1: DocGraph and the network data model (30% of the time)

The DocGraph dataset The raw data Helper data (NPI associated data) The graph / network data model Nodes versus edges How graph models are integral to social networking Other Healthcare graph data sets

Part 2: Using Gephi to perform analysis (70% of the time)

Basic usage of Gephi Saving and reading the GraphML format Laying out edges and nodes of a graph Navigating and exploring the graph Generating graph metrics on the network Filtering a subset of the graph Producing the final output of the graph.

Links from the last slide:

http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html (information)

https://github.com/jhajagos/DocGraph (code)

http://notonlydev.com/docgraph-data (open source $1 covers bandwidth fees)

https://groups.google.com/forum/#!forum/docgraph (mailing list)

Just in case you don’t have it bookmarked already: Gephi.

The type of workshop that makes an entire conference seem like lagniappe.

Just sorry I will have to appreciate it from afar.

Work through this one carefully. You will acquire useful skills doing so.

Benchmarking Graph Databases

Wednesday, September 25th, 2013

Benchmarking Graph Databases by Alekh Jindal.

Speaking of data skepticism.

From the post:

Graph data management has recently received a lot of attention, particularly with the explosion of social media and other complex, inter-dependent datasets. As a result, a number of graph data management systems have been proposed. But this brings us to the question: What happens to the good old relational database systems (RDBMSs) in the context of graph data management?

The article names some of the usual graph database suspects.

But for its comparison, it selects only one (Neo4j) and compares it against three relational databases, MySQL, Vertica and VoltDB.

What’s missing? How about expanding to include GraphLab (GraphLab – Next Generation [Johnny Come Lately VCs]) and Giraph (Scaling Apache Giraph to a trillion edges) or some of the other heavy hitters (insert your favorite) in the graph world?

Nothing against Neo4j. It is making rapid progress on a query language and isn’t hard to learn. But it lacks the raw processing power of an application like Apache Giraph. Giraph, after all, is used to process the entire Facebook data set, not a “4k nodes and 88k edges” Facebook sample as in this comparison.

Not to mention that only two algorithms were used in this comparison: PageRank and Shortest Paths.

Personally I can imagine users being interested in running more than two algorithms. But that’s just me.

Every benchmarking project has to start somewhere but this sort of comparison doesn’t really advance the discussion of competing technologies.

Not that any comparison would be complete without a discussion of typical uses cases and user observations on how each candidate did or did not meet their expectations.

Machine Learning: The problem is…

Wednesday, September 25th, 2013

I am watching the Data Mining with Weka videos and Prof. Ian Witten observed that Weka makes machine learning easy but:

The problem is understanding what it is that you have done.

That’s really the rub isn’t it? You loaded data, the program ran without crashing, some output was displayed.

All well and good but does it mean anything?

Or does your boss tell you what a data set will show after you complete machine learning on it?

Not to single out machine learning because there any number of ways to “cook” data long before it gets to the machine learning processor.

Take survey data for example. Where you ask some group of people for their responses.

A quick scan of survey methodology at Wikipedia and you will realize that services like Survey Monkey are for:

Monkey

I’ve heard the arguments of no money to do a survey correctly so mid-management makes up questions that leads to the correct result. Business decisions are justified on that type of survey data.

Collecting data and running machine learning algorithms are vital day to day activities in data science.

Even if you plan to fool others, do be fooled yourself. Develop a critical outlook and questions that should be asked of data sets, depending upon their point of origin.

PS: Do you know of any courses on “data skepticism?” That would make a great course title. 😉

Data Visualization at IRSA

Tuesday, September 24th, 2013

Data Visualization at IRSA by Vandana Desai.

From the post:

The Infrared Science Archive (IRSA) is part of the Infrared Processing and Analysis Center (IPAC) at Caltech. We curate the science products of NASA’s infrared and submillimeter missions, including Spitzer, WISE, Planck, 2MASS, and IRAS. In total, IRSA provides access to more than 20 billion astronomical measurements, including all-sky coverage in 20 bands, spanning wavelengths from 1 micron to 10 mm.

One of our core goals is to enable optimal scientific exploitation of these data sets by astronomers. Many of you already use IRSA; approximately 10% of all refereed astronomical journal articles cite data sets curated by IRSA. However, you may be unaware of our most recent visualization tools. We provide some of the highlights below. Whether you are a new or experienced user, we encourage you to try them out at irsa.ipac.caltech.edu.

Vandana reviews a number of new visualization features and points out additional education resources.

Even if you aren’t an astronomy buff, the tools and techniques here may inspire a new approach to your data.

Not to mention being a good example of data that is too large to move. Astronomers have been developing answers to that problem for more than a decade.

Might have some lessons for dealing with big data sets.

Three exciting Lucene features in one day

Tuesday, September 24th, 2013

Three exciting Lucene features in one day by Mike McCandless.

From the post:

The first feature, committed yesterday, is the new expressions module. This allows you to define a dynamic field for sorting, using an arbitrary String expression. There is builtin support for parsing JavaScript, but the parser is pluggable if you want to create your own syntax.

The second feature, also committed yesterday, is updateable numeric doc-values fields, letting you change previously indexed numeric values using the new updateNumericDocValue method on IndexWriter. It works fine with near-real-time readers, so you can update the numeric values for a few documents and then re-open a new near-real-time reader to see the changes.

Finally, the third feature is a new suggester implementation, FreeTextSuggester. It is a very different suggester than the existing ones: rather than suggest from a finite universe of pre-built suggestions, it uses a simple ngram language model to predict the “long tail” of possible suggestions based on the 1 or 2 previous tokens.

By anybody’s count, that was an extraordinary day!

Drop by Mike’s post for the details.

A Course in Machine Learning (book)

Tuesday, September 24th, 2013

A Course in Machine Learning by Hal Daumé III.

From the webpage:

Machine learning is the study of algorithms that learn from data and experience. It is applied in a vast variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need to make sense of data is a potential consumer of machine learning.

CIML is a set of introductory materials that covers most major aspects of modern machine learning (supervised learning, unsupervised learning, large margin methods, probabilistic modeling, learning theory, etc.). It’s focus is on broad applications with a rigorous backbone. A subset can be used for an undergraduate course; a graduate course could probably cover the entire material and then some.

You may obtain the written materials by purchasing a ($55) print copy, by the entire book, or by downloading individual chapters below. If you find the electronic version of the book useful and would like to donate a small amount to support further development, that’s always appreciated! The current version is 0.9 (the “beta” pre-release).

Have you noticed that the quality of materials on the Internet is increasing. At least in some domains?

If you want to look at individual chapters:

  1. Front Matter
  2. Decision Trees
  3. Geometry and Nearest Neighbors
  4. The Perceptron
  5. Machine Learning in Practice
  6. Beyond Binary Classification
  7. Linear Models
  8. Probabilistic Modeling
  9. Neural Networks
  10. Kernel Methods
  11. Learning Theory
  12. Ensemble Methods
  13. Efficient Learning
  14. Unsupervised Learning
  15. Expectation Maximization
  16. Semi-Supervised Learning
  17. Graphical Models
  18. Online Learning
  19. Structured Learning
  20. Bayesian Learning
  21. Back Matter

Code and datasets said to be coming soon.

I first saw this at: A Course in Machine Learning (free book).

Data Science: Not Just for Big Data (Webinar)

Tuesday, September 24th, 2013

Data Science: Not Just for Big Data

October 16th at 11am EST

From the webpage:

These days, data science and big data have become synonymous phrases. But data doesn’t have to be big for data science to unlock big value.

Join Kalido CTO Darren Peirce as he hosts David Smith, Data Scientist at Revolution Analytics and Gregory Piatetsky, Editor of KDNuggets, two of today’s most influential data scientists, for an open-panel discussion. They’ll discuss why the value of the insights is not directly proportional to the size of a dataset.

If you are wondering whether data science can give your business an edge this may be the most important hour you’ll spend all week.

Confirmation that having the right data isn’t the same thing as having “big data.”

The NSA can mine all the telephone traffic if it wants. Mining the telephone traffic of security risks, a much smaller data set, is likely to be more productive.

See you at the webinar!

I first saw this at: Upcoming Data Science Webinar.

Rumors of Legends (the TMRM kind?)

Tuesday, September 24th, 2013

BioC: a minimalist approach to interoperability for biomedical text processing (numerous authors, see the article).

Abstract:

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/.

From the introduction:

With the proliferation of natural language text, text mining has emerged as an important research area. As a result many researchers are developing natural language processing (NLP) and information retrieval tools for text mining purposes. However, while the capabilities and the quality of tools continue to grow, it remains challenging to combine these into more complex systems. Every new generation of researchers creates their own software specific to their research, their environment and the format of the data they study; possibly due to the fact that this is the path requiring the least labor. However, with every new cycle restarting in this manner, the sophistication of systems that can be developed is limited. (emphasis added)

That is the experience with creating electronic versions of the Hebrew Bible. Every project has started from a blank screen, requiring re-proofing of the same text, etc. As a result, there is no electronic encoding of the masora magna (think long margin notes). Duplicated effort has a real cost to scholarship.

The authors stray into legend land when they write:

Our approach to these problems is what we would like to call a ‘minimalist’ approach. How ‘little’ can one do to obtain interoperability? We provide an extensible mark-up language (XML) document type definition (DTD) defining ways in which a document can contain text, annotations and relations. Major XML elements may contain ‘infon’ elements, which store key-value pairs with any desired semantic information. We have adapted the term ‘infon’ from the writings of Devlin (1), where it is given the sense of a discrete item of information. An associated ‘key’ file is necessary to define the semantics that appear in tags such as the infon elements. Key files are simple text files where the developer defines the semantics associated with the data. Different corpora or annotation sets sharing the same semantics may reuse an existing key file, thus representing an accepted standard for a particular data type. In addition, key files may describe a new kind of data not seen before. At this point we prescribe no semantic standards. BioC users are encouraged to create their own key files to represent their BioC data collections. In time, we believe, the most useful key files will develop a life of their own, thus providing emerging standards that are naturally adopted by the community.

The “key files” don’t specify subject identities for the purposes of merging. But defining the semantics of data is a first step in that direction.

I like the idea of popular “key files” (read legends) taking on a life of their own due to their usefulness. An economic activity based on reducing the friction in using or re-using data. That should have legs.

BTW, don’t overlook the author’s data and code, available at: http://bioc.sourceforge.net/.

…OCLC Control Numbers Public Domain

Tuesday, September 24th, 2013

OCLC Declare OCLC Control Numbers Public Domain by Richard Wallis.

From the post:

I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:

Use of the OCLC Control Number (OCN)
OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.

The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.

See: OCLC Control Number if you are interested in the details of OCNs (which are interesting in and of themselves).

Unlike the Perma.cc links, OCNs are not tied to any particular network protocol.

However you deliver an OCN, by postcard, phone or network query, an information system can respond with the information that corresponds to that OCN.

No one can promise you “forever,” but not tying identifiers to ephemeral network protocols is one way to get closer to “forever.”

…Link and Reference Rot in Legal Citations

Tuesday, September 24th, 2013

Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations by Jonathan Zittrain, Kendra Albert, Lawrence Lessig.

Abstract:

We document a serious problem of reference rot: more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs found within U.S. Supreme Court opinions do not link to the originally cited information.

Given that, we propose a solution for authors and editors of new scholarship that involves libraries undertaking the distributed, long-term preservation of link contents.

Imagine trying to use a phone book where 70% of the addresses were wrong.

Or you are looking for your property deed and learn that only 50% of the references are correct.

Do those sound like acceptable situations?

Considering the Harvard Law Review and the U.S. Supreme Court put a good deal of effort into correct citations, the fate of the rest of the web must be far worse.

The about page for Perma reports:

Any author can go to the Perma.cc website and input a URL. Perma.cc downloads the material at that URL and gives back a new URL (a “Perma.cc link”) that can then be inserted in a paper.

After the paper has been submitted to a journal, the journal staff checks that the provided Perma.cc link actually represents the cited material. If it does, the staff “vests” the link and it is forever preserved. Links that are not “vested” will be preserved for two years, at which point the author will have the option to renew the link for another two years.

Readers who encounter Perma.cc links can click on them like ordinary URLs. This takes them to the Perma.cc site where they are presented with a page that has links both to the original web source (along with some information, including the date of the Perma.cc link’s creation) and to the archived version stored by Perma.cc.

I would caution that “forever” is a very long time.

What happens to the binding between an identifier and a URL when URLs are replaced by another network protocol?

After all the change over the history of the Internet, you don’t believe the current protocols will last “forever” Yes?

A more robust solution would divorce identifiers/citations from any particular network protocol, whether you think it will last forever or not.

That separation of identifier from network protocol preserves the possibility of an online database such as Perma.cc but also databases that have local caches of the citations and associated content, databases that point to multiple locations for associated content, and databases that support currently unknown protocols to access content associated with an identifier.

Just as a database of citations from Codex Justinianus could point to the latest printed Latin text, online versions or future versions.

Citations can become permanent identifiers if they don’t rely on a particular network addressing systems.

Crowdsourcing.org

Tuesday, September 24th, 2013

Crowdsourcing.org

From the about page:

Crowdsourcing.org is the leading industry resource offering the largest online repository of news, articles, videos, and site information on the topic of crowdsourcing and crowdfunding.

Founded in 2010, crowdsourcing.org, is a neutral professional association dedicated solely to crowdsourcing and crowdfunding. As one of the most influential and credible authorities in the crowdsourcing space, crowdsourcing.org is recognized worldwide for its intellectual capital, crowdsourcing and crowdfunding practice expertise and unbiased thought leadership.

Crowdsourcing.org’s mission is to serve as an invaluable source of information to analysts, researchers, journalists, investors, business owners, crowdsourcing experts and participants in crowdsourcing and crowdfunding platforms. (emphasis in original)

If you are interested in crowdsourcing, there are worse places to start searching. 😉

Seriously, Crowdsourcing.org, hosts a directory of 2,482 crowdsourcing and crowdfunding sites (as of September 24, 2013) along with numerous other resources.

Court Listener

Tuesday, September 24th, 2013

Court Listener

From the about page:

Started as a part-time hobby in 2010, CourtListener is now a core project of the Free Law Project, a California Non-Profit corporation. The goal of the site is to provide powerful free legal tools for everybody while giving away all our data in bulk downloads.

We collect legal opinions from court websites and from data donations, and are aiming to have the best, most complete data on the open Web within the next couple years. We are slowly expanding to provide search and awareness tools for as many state courts as possible, and we already have tools for all of the Federal Appeals Courts. For more details on which jurisdictions we support, see our coverage page. If you’re able to help us acquire more cases, please get in touch.

This rather remarkable site has collected 905,842 court opinions as of September 24, 2013.

The default listing of cases is newest first but you can choose oldest first, most/least cited first and keyword relevance. Changing the listing order becomes interesting once you perform a keyword search (top search bar). The refinement (left hand side) works quite well, except that I could not filter search results by a judges name. On case names, separate the parties with “v.” as “vs” doesn’t work.

It is also possible to discover examples of changing legal terminology that impact your search results.

For example, try searching for the keyword phrase, “interstate commerce.” Now choose “Oldest first.” you will see Price v. Ralston (1790) and the next case is Crandall v. State of Nevada (1868). Hmmm, what happened to the early interstate commerce cases under John Marshall?

OK, so try “commerce.” Now set to “Oldest first.” Hmmm, a lot more cases. Yes? Under case name, type in “Gibbons” and press return. Now the top case is Gibbons v. Ogden (1824). The case name is a hyperlink so follow that now.

It is a long opinion by Chief Justice Marshall but at paragraph 5 he announces:

The power to regulate commerce extends to every species of commercial intercourse between the United States and foreign nations, and among the several States. It does not stop at the external boundary of a State.

The phrase “among the several States,” occurs 21 times in Gibbons v. Ogden, with no mention of the modern “interstate commerce.”

What we now call the “interstate commerce clause” played a major role in the New Deal legislation that ended the 1930’s depression in the United States. See Commerce Clause. Following the cases cited under “New Deal” will give you an interesting view of the conflicting sides. A conflict that still rages today.

The terminology problem, “among the several states” vs. “interstate commerce” is one that makes me doubt the efficacy of public access to law programs. Short of knowing the “right” search words, it is unlikely you would have found Gibbons v. Ogden. Well, short of reading through the entire corpus of Supreme Court decisions. 😉

Public access to law would be enhanced with mappings such as “interstate commerce,” and “among the several states,” but also distinguishing “due process,” didn’t always mean what it means today, and further mappings to colloquial search expressions.

A topic map could capture those nuances and many more.

I guess the question is whether people should be free to search for the law or should they be freed by finding the law?

DBpedia 3.9 released…

Monday, September 23rd, 2013

DBpedia 3.9 released, including wider infobox coverage, additional type statements, and new YAGO and Wikidata links by Christopher Sahnwaldt.

From the post:

we are happy to announce the release of DBpedia 3.9.

The most important improvements of the new release compared to DBpedia 3.8 are:

1. the new release is based on updated Wikipedia dumps dating from March / April 2013 (the 3.8 release was based on dumps from June 2012), leading to an overall increase in the number of concepts in the English edition from 3.7 to 4.0 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner concept descriptions.

3. we extended the DBpedia type system to also cover Wikipedia articles that do not contain an infobox.

4. we provide links pointing from DBpedia concepts to Wikidata concepts and updated the links pointing at YAGO concepts and classes, making it easier to integrate knowledge from these sources.

The English version of the DBpedia knowledge base currently describes 4.0 million things, out of which 3.22 million are classified in a consistent Ontology, including 832,000 persons, 639,000 places (including 427,000 populated places), 372,000 creative works (including 116,000 music albums, 78,000 films and 18,500 video games), 209,000 organizations (including 49,000 companies and 45,000 educational institutions), 226,000 species and 5,600 diseases.

We provide localized versions of DBpedia in 119 languages. All these versions together describe 24.9 million things, out of which 16.8 million overlap (are interlinked) with the concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 12.6 million unique things in 119 different languages; 24.6 million links to images and 27.6 million links to external web pages; 45.0 million external links into other RDF datasets, 67.0 million links to Wikipedia categories, and 41.2 million YAGO categories.

Altogether the DBpedia 3.9 release consists of 2.46 billion pieces of information (RDF triples) out of which 470 million were extracted from the English edition of Wikipedia, 1.98 billion were extracted from other language editions, and about 45 million are links to external data sets.

Detailed statistics about the DBpedia data sets in 24 popular languages are provided at Dataset Statistics.

The main changes between DBpedia 3.8 and 3.9 are described below. For additional, more detailed information please refer to the Change Log.

Almost like an early holiday present isn’t it? 😉

I continue to puzzle over the notion of “extraction.”

Not that I have an alternative but extracting data only kicks the data can one step down the road.

When someone wants to use my extracted data, they are going to extract data from my extraction. And so on.

That seems incredibly wasteful and error-prone.

Enough money is spend doing the ETL shuffle every year that research on ETL avoidance should be a viable proposition.

XML to Cypher Converter/Geoff Converter

Monday, September 23rd, 2013

XML to Cypher Converter

From the webpage:

This service allows conversion of generic XML data into a Cypher CREATE statement, which can then be loaded into Neo4j.

And:

XML to Geoff Converter

From the webpage:

This service allows conversion of generic XML data into a Geoff interchange file, which can then be loaded into Neo4j.

Both services can be used as a web service, in addition to supporting the pasting in of XML in a form.

You will also want to visit Nigel Small’s Github page and his
homepage.

While poking around I also found:

XML to Graph Converter

XML data can easily be converted into a graph. Simply load paste the XML data into the left-hand side, convert into both Geoff and a Cypher CREATE statement, then view the results in the Neo4j console.

Definitely worth a deep look later this week with XML schemas.

Map: Where the People Are

Monday, September 23rd, 2013

Map: Where the People Are by Joshua Keating.

Maps created by Radical Cartography (yes, that is how I found the site) that show the distribution of the world population by latitude and longitude. Data for the year 2000.

Would be interesting to see latitude/longitude maps like this for every ten (10) years starting in about 1800.

Did the population increase in place or expand to new places? Or were there spikes of expansion followed by spikes of increased density?

Radical Cartography

Monday, September 23rd, 2013

Radical Cartography

A very rich site with examples of cartography that I find hard to describe.

Rather than an inadequate description, here is an example of a custom map that the site generated at my request:

2013 Calendar for Atlanta

For best viewing, save the image to your computer and view in a browser.

See Your Calendar if you want to generate a custom calendar for yourself.

Don’t skip exploring the other projects at this site.

…Crowd-Sourcing to Classify Strange Oceanic Creatures

Monday, September 23rd, 2013

Plankton Portal Uses Crowd-Sourcing to Classify Strange Oceanic Creatures

From the post:

Today, an online citizen-science project launches called “Plankton Portal” was created by researchers at the University of Miami Rosenstiel School of Marine and Atmospheric Sciences (RSMAS) in collaboration with the National Oceanic and Atmospheric Administration (NOAA) and the National Science Foundation (NSF) and developers at Zooniverse.org Plankton Portal allows you to explore the open ocean from the comfort of your own home. You can dive hundreds of feet deep, and observe the unperturbed ocean and the myriad animals that inhabit Earth’s last frontier.

The goal of the site is to enlist volunteers to classify millions of underwater images to study plankton diversity, distribution and behavior in the open ocean. It was developed under the leadership of Dr. Robert K. Cowen, UM RSMAS Emeritus Professor in Marine Biology and Fisheries (MBF) and now the Director of Oregon State University’s Hatfield Marine Science Center, and by Research Associate Cedric Guigand and MBF graduate students Jessica Luo and Adam Greer.

Millions of plankton images are taken by the In Situ Ichthyoplankton Imaging System (ISIIS), a unique underwater robot engineered at the University of Miami in collaboration with Charles Cousin at Bellamare LLC and funded by NOAA and NSF. ISIIS operates as an ocean scanner that casts the shadow of tiny and transparent oceanic creatures onto a very high resolution digital sensor at very high frequency. So far, ISIIS has been used in several oceans around the world to detect the presence of larval fish, small crustaceans and jellyfish in ways never before possible. This new technology can help answer important questions ranging from how do plankton disperse, interact and survive in the marine environment, to predicting the physical and biological factors could influence the plankton community.

You can go to Zoniverse.org or jump directly to the Plankton Portal.

If plankton don’t excite you all that much, consider one of the other projects at Zoniverse:

Galaxy Zoo
How do galaxies form?
NASA’s Hubble Space Telescope archive provides hundreds of thousands of galaxy images.
Ancient Lives
Study the lives of ancient Greeks
The data gathered by Ancient Lives helps scholars study the Oxyrhynchus collection.
Moon Zoo
Explore the surface of the Moon
We hope to study the lunar surface in unprecedented detail.
WhaleFM
Hear Whales communicate
You can help marine researchers understand what whales are saying
Solar Stormwatch
Study explosions on the Sun
Explore interactive diagrams to learn about the Sun and the spacecraft monitoring it.
Seafloor Explorer
Help explore the ocean floor
The HabCam team and the Woods Hole Oceanographic Institution need your help!
PlanetHunters.org
Find planets around stars
Lightcurve changes from the Kepler spacecraft can indicate transiting planets.
Bat Detective
You’re hot on the trail of bats!
Help scientists characterise bat calls recorded by citizen scientists.
The Milky Way Project
How do stars form?
We’re asking you to help us find and draw circles on infrared image data from the Spitzer Space Telescope.
Snapshot Serengeti
Go wild in the Serengeti!
We need your help to classify all the different animals caught in millions of camera trap images.
Planet Four
Explore the Red Planet
Planetary scientists need your help to discover what the weather is like on Mars.
Notes from Nature
Take Notes from Nature
Transcribe museum records to take notes from nature, contribute to science.
SpaceWarps
Help us find gravitational lenses
Imagine a galaxy, behind another galaxy. Think you won’t see it? Think again.
Plankton Portal
No plankton means no life in the ocean
Plankton are a critically important food source for our oceans.
oldWeather
Model Earth’s climate using historic ship logs
Help scientists recover Arctic and worldwide weather observations made by US Navy and Coast Guard ships.
Cell Slider
Analyse real life cancer data.
You can help scientists from the world’s largest cancer research institution find cures for cancer.
CycloneCenter
Classify over 30 years of tropical cyclone data.
Scientists at NOAA’s National Climatic Data Center need your help.
Worm Watch Lab
Track genetic mysteries
We can better understand how our genes work by spotting the worms laying eggs.

I count eighteen (18) projects and this is just one of the many crowd source project collections.

Question: We overcome semantic impedance to work cooperatively on these projects, what is it that creates semantic impedance in other projects?

Or perhaps better: How do we or others benefit from the presence of semantic impedance?

The second question might lead to a strategy that replaces that benefit with a bigger one from using topic maps.

Broadening Google Patents [Patent Troll Indigestion]

Monday, September 23rd, 2013

Broadening Google Patents by Jon Orwant.

From the post:

Last year, we launched two improvements to Google Patents: the Prior Art Finder and European Patent Office (EPO) patents. Today we’re happy to announce the addition of documents from four new patent agencies: China, Germany, Canada, and the World Intellectual Property Organization (WIPO). Many of these documents may provide prior art for future patent applications, and we hope their increased discoverability will improve the quality of patents in the U.S. and worldwide.

The broadening of Google Patents is welcome news!

Especially following the broadening of “prior art” under the America Invents Act (AIA).

On the expansion of prior art, such as publication before date of filing the patent (old rule was before the date of invention), a good summary can be found at: The Changing Boundaries of Prior Art under the AIA: What Your Company Needs to Know.

The information you find needs to remain found, intertwined with other information you find.

Regular search engines won’t help you there. May I suggest topic maps?

Help, I need somebody*

Monday, September 23rd, 2013

Don Knuth has asked for help with Volume 4B of The Art of Computer Programming saying:

Volume 4B of The Art of Computer Programming will begin with a special section called ‘Mathematical Preliminaries Redux’, which extends the ‘Mathematical Prelimaries’ of Section 1.2 in Volume 1 to things that I didn’t know about in the 1960s. Most of this new material deals with probabilities and expectations of random events; there’s also an introduction to the theory of martingales.

You can have a sneak preview by looking at the current draft of pre-fascicle 5a (39 pages), last updated 31 August 2013. As usual, rewards will be given to whoever is first to find and report errors or to make valuable suggestions. I’m particularly interested in receiving feedback about the exercises (of which there are 99) and their answers (of which there are 99).

There’s stuff in here that isn’t in Wikipedia yet!


I worked particularly hard while preparing some of those exercises, attempting to improve on expositions that I found in the literature; and in several noteworthy cases, nobody has yet pointed out any errors. It would be nice to believe that I actually got the details right in my first attempt; but that seems unlikely, because I had hundreds of chances to make mistakes. So I fear that the most probable hypothesis is that nobody has been sufficiently motivated to check these things out as yet.


I still cling to a belief that these details are extremely instructive, and I’m uncomfortable with the prospect of printing a hardcopy edition with so many exercises unvetted. Thus I would like to enter here a plea for some readers to tell me explicitly, “Dear Don, I have read exercise N and its answer very carefully, and I believe that it is 100% correct,” where N is one of the following exercises in prefascicle 5a:

  • 24 (median of the cumulative binomial distribution)
  • 28 (Hoeffding’s theory of generalized cumulative binomial distributions)
  • 29 (the nearly forgotten inequality of Samuels)
  • 59 (the four functions theorem)
  • 61 (the FKG inequality)
  • 99 (Motwani and Raghavan’s generalized bound on random loop termination time)


Remember that you don’t have to work the exercise first; you’re allowed and even encouraged to peek at the answer. Please send success reports to the usual address for bug reports (taocp@cs.stanford.edu), if you have time to provide this extra help. Thanks in advance!

Given Knuth’s contribution to the field of computer programming, if you have the time and ability, helping proof the exercises would be a small repayment for his efforts.

BTW, standards editors should take heed of Knuth’s statement in the preface to this pre-fascicle:

This material has not yet been proofread as thoroughly as the manuscripts of Volumes 1, 2, 3, and 4A were at the time of their first printings. And those carefully-checked volumes, alas, were subsequently found to contain thousands of mistakes. (The Art of Computer Programming, Volume 4 Pre-Fascicle 5, Mathematical Preliminaries Redux, page iii)

Mistakes, even in great writing (See TAOCP’s ranking along side Principia Mathematica, Theory of Games and Economic Behavior, Fractals: Form, Chance and Dimension, Cybernetics, QED, Quantum Mechanics, The Meaning of Relativity, Nature of the Chemical Bond, Search for Structure, Conservation of Orbital Symmetry, The Collected papers of Albert Einstein. Vol. 2 : The Swiss years : writings, 1900-1909 in 100 or so Books that shaped a Century of Science by Philip, Phylis Morrison) happen.

Discovery of a mistake should lead to acknowledgement of the mistake and the earliest correction possible.

Discovery of a mistake should not be met with: “everyone knows what we meant to say,” “other ‘standards’ do it that way,” or any of the other excuses for not fixing mistakes.

Mistakes are excusable as human error, failure to correct mistakes when identified, is not.


*With apologies to the Beatles, the first line of “Help!” was just too appropriate to pass up!

…Hive Functions in Hadoop

Sunday, September 22nd, 2013

Cheat Sheet: How To Work with Hive Functions in Hadoop by Marc Holmes.

From the post:

Just a couple of weeks ago we published our simple SQL to Hive Cheat Sheet. That has proven immensely popular with a lot of folk to understand the basics of querying with Hive. Our friends at Qubole were kind enough to work with us to extend and enhance the original cheat sheet with more advanced features of Hive: User Defined Functions (UDF). In this post, Gil Allouche of Qubole takes us from the basics of Hive through to getting started with more advanced uses, which we’ve compiled into another cheat sheet you can download here.

The cheat sheet will be useful but so is this observation in the conclusion of the post:

One of the key benefits of Hive is using existing SQL knowledge, which is a common skill found across business analysts, data analysts, software engineers, data scientist and others. Hive has nearly no barriers for new users to start exploring and analyzing data.

I’m sure use of existing SQL knowledge isn’t the only reason for Hive’s success, but the Hive PowerBy page shows it didn’t hurt!

Something to think about in creating a topic map query language. Yes, the queries executed by an engine will be traversing a topic map graph, but presenting it to the user as a graph query isn’t required.

Relationship Timelines

Sunday, September 22nd, 2013

Relationship Timelines by Skye Bender-deMoll.

From the post:

I finally had a chance to pull together a bunch of interesting timeline examples–mostly about the U.S. Congress. Although several of these are about networks, the primary features being visualized are changes in group structure and membership over time. Should these be called “alluvial diagrams”, “stream graphs” “Sankey charts”, “phase diagrams”, “cluster timelines”?

From the U.S. Congress to characters in the Lord of the Rings (movie version) and beyond, Skye explores visualization of dynamic relationships over time.

Raises the interesting issue of how do you represent a dynamic relationship in a topic map?

For example, at some point in a topic map of a family, the mother and father did not know each other. At some later point they met, but were not yet married. Still later they were married and later still, had children. Other events in their lives happened before or after those major events.

Scope could segment off a segment of events, but you would have to create a date/time datatype or use one from the W3C, XML Schema Part 2: Datatypes Second Edition, for calculation of which scope precedes or follows another scope.

A closely related problem is to show what facts were known to a person at some point in time. Or as put by Howard Baker:

“What did the President know and when did he know it?” [During the Watergate Hearings

That may again be a relevant question in the not too distant future.

Suggestions for a robust topic map modeling solution would be most welcome!