Archive for the ‘Tables’ Category

Scala DataTable

Monday, February 9th, 2015

Scala DataTable by Martin Cooper.

From the webpage:


Scala DataTable is a lightweight, in-memory table structure written in Scala. The implementation is entirely immutable. Modifying any part of the table, adding or removing columns, rows, or individual field values will create and return a new structure, leaving the old one completely untouched. This is quite efficient due to structural sharing.

Features :

  • Fully immutable implementation.
  • All changes use structural sharing for performance.
  • Table columns can be added, inserted, updated and removed.
  • Rows can be added, inserted, updated and removed.
  • Individual cell values can be updated.
  • Any inserts, updates or deletes keep the original structure and data completely unchanged.
  • Internal type checks and bounds checks to ensure data integrity.
  • RowData object allowing typed or untyped data access.
  • Full filtering and searching on row data.
  • Single and multi column quick sorting.
  • DataViews to store sets of filtered / sorted data.

If you are curious about immutable data structures and want to start with something familiar, this is your day!

See the Github page for example code and other details.

A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles

Monday, January 12th, 2015

A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles by Stefan Klampfl, Kris Jack, Roman Kern.


In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain.

Excellent article if you have ever struggled with the endless tables in government documents.

I first saw this in a tweet by Anita de Waard.

Extending GraphLab to tables

Sunday, February 23rd, 2014

Extending GraphLab to tables by Ben Lorica.

From the post:

GraphLab’s SFrame, an interesting and somewhat under-the-radar tool was unveiled1 at Strata Santa Clara. It is a disk-based, flat table representation that extends GraphLab to tabular data. With the addition of SFrame, users can leverage GraphLab’s many algorithms on data stored as either graphs or tables. More importantly SFrame increases GraphLab’s coverage of the data science workflow: it allows users with terabyte-sized datasets to clean their data and create new features directly within GraphLab (SFrame performance can scale linearly with the number of available cores).

The beta version of SFrame can read data from local disk, HDFS, S3 or a URL, and save to a human-readable .csv or a more efficient native format. Once an SFrame is created and saved to disk no reprocessing of the data is needed. Below is Python code that illustrates how to read a .csv file into SFrame, create a new data feature and save it to disk on S3:

Jay Gu wrote Introduction to SFrame, which isn’t as short as the coverage on the GraphLab Create FAQ.

Remember that Spark has integrated GraphX and so also extended it reach into data processing workflow.

The standard for graph software is growing by leaps and bounds!

Why the Obsession with Tables?

Thursday, May 2nd, 2013

Why the Obsession with Tables? by Robert Kosara.

From the post:

Lots of data are still presented and released as tables. But why, when we know that visual representations are so much easier to read and understand? Eric Newburger from the U.S. Census Bureau has an interesting theory.

In a short talk on visualization at the Census Bureau, he describes how in the 1880s, the Census published maps and charts. Many of those are actually amazingly well done, even by today’s standards. But starting with 1890 census, they were replaced with tables.

This, according to Newburger, was due to an important innovation: the Hollerith Tabulating Machine. The new machines were much faster and could slice and dice the data in a lot of new ways, but their output ended up in tables. Throughout the 20th century, the Census created enormous amount of tables, with only a small fraction of the data shown as maps or charts.

Newburger argues that people don’t bother trying to read tables, whereas visualizations are much more likely to catch their attention and get them interested in the underlying data. We clearly have the means to create any visualization we want today, and there is plenty of data available, so why keep publishing tables? It’s a matter of the attitudes towards data, and these can be hard to change after more than 100 years:

Suggestions of images from maps and charts from the Census in the 1880s?

If the Hollerith Tabulating Machine is responsible for the default to tables, it is also responsible for spreadsheets?

Quicker for a machine to produce but less useful to an end user.

Introducing Tabula

Thursday, April 4th, 2013

Introducing Tabula by Manuel Aristarán, Mike Tigas.

From the post:

Tabula lets you upload a (text-based) PDF file into a simple web interface and magically pull tabular data into CSV format.

It is hard to say why governments and other imprison tabular data in PDF files.

I suspect they see some advantage in preventing comparison to other data or even checking the consistency of data in a single report.

Whatever their motivations, let’s disappoint them!

Details on how to help are in the blog post.

Better table search through Machine Learning and Knowledge

Friday, August 24th, 2012

Better table search through Machine Learning and Knowledge by Johnny Chen.

From the post:

The Web offers a trove of structured data in the form of tables. Organizing this collection of information and helping users find the most useful tables is a key mission of Table Search from Google Research. While we are still a long way away from the perfect table search, we made a few steps forward recently by revamping how we determine which tables are “good” (one that contains meaningful structured data) and which ones are “bad” (for example, a table that hold the layout of a Web page). In particular, we switched from a rule-based system to a machine learning classifier that can tease out subtleties from the table features and enables rapid quality improvement iterations. This new classifier is a support vector machine (SVM) that makes use of multiple kernel functions which are automatically combined and optimized using training examples. Several of these kernel combining techniques were in fact studied and developed within Google Research [1,2].

Important work on tables from Google Research.

Important in part because you can compare your efforts on accessible tables to theirs, to gain insight into what you are, or aren’t doing “right.”

For any particular domain, you should be able to do better than a general solution.

BTW, I disagree on the “good” versus “bad” table distinction. I suspect that tables that hold the layout of web pages, say for a CMS, are more consistent than database tables of comparable size. And that data, may or may not be important to you.

Important versus non-important data for a particular set of requirements is a defensible distinction.

“Good” versus “bad” tables is not.