I first noticed this item at Mathew Hurst’s blog Table Competition at ICDAR 2011.
As a markup person with some passing familiarity with table encoding issues, this is just awesome!
Update: March 10, 2011 Competition registration, which consists of expressing interest in competing, by email, to the competition organisers
The basic description is OK:
Motivation: Tables are a prominent element of communication in documents, often containing information that would take many a paragraph to write otherwise. The first step to table understanding is to draw the tables physical model, i.e. identify its location and component cells, rows ad columns. Several authors have dedicated themselves to these tasks, using diverse methods, however it is difficult to know which methods work best under which circumstance because of the diverse testing conditions used by each. This competition aims at addressing this lacuna in our field.
Tasks: This competition will involve two independent sub-competitions. Authors may choose to compete for one task or the other or both.
1. Table location sub-competition:
This task consists of identifying which lines in the document belong to one same table area or not;
2. Table segmentation sub-competition:
This task consists of identifying which column the cells of each table belong to, i.e. identifying which cells belong to one same column. Each cell should be attributed a start and end column index (which will be different from each other for spanning cells). Identifying row spanning cells is not relevant for this competition.
But what I think will excite markup folks (and possibly topic map advocates) is the description of the data sets:
Description of the datasets: We have gathered 22 PDF financial statements. Our documents have lengths varying between 13 and 235 pages with very diverse page layouts, for example, pages can be organised in one or two columns and page headers and footers are included; each document contains between 3 and 162 tables. In Appendix A, we present some examples of pages in our dataset with tables that we consider hard to locate or segment. We randomly chose 19 documents for training and 3 for validation; our tougher cases turned out to be in the training set.
We then converted all files to ASCII using the pdttotext linux utility2 (2Red Hat Linux 7.2 (Enigma), October 22, 2001, Linux 2.4.7-10, pdftotext version 0.92., copyright 1996-2000 Derek B. Noonburg.). As a result of the conversion, each line of each document became a line of ASCII, which when imported into a database becomes a record in a relational table. Apart from this, we collected an extra 19 PDF financial statements to form the test set; these were converted into ASCII using the same tool as the training set.
Table 1 underneath shows the resulting dimensions of the datasets and how they compare to those used by other authors (Wang et al. (2002)’s tables were automatically generated and Pinto et al. (2003)’s belong to the same government statistics website). The sizes of the datasets in other papers are not distant from ours. An exception would be Cafarella et al. (2008), who created the first large repository of HTML tables, with 154 million tables. These consist of non-marked up HTML tables detected using Wang and Hu (2002)’s algorithm, which is naturally subject to mistakes.
We have then manually created the ground-truth for this data, which involved: a) identifying which lines belong to tables and which do not; b) for each line, identifying how it should be clipped into cells; c) for each cell, identifying which table column it belongs to.
Whether you choose to compete or not, this should prove to be very interesting.
Sorry, left off the dates from the original post:
Important dates:
- February 26, 2011 Training set is made available on the Competition Website
- March 10, 2011 Competition registration, which consists of expressing interest in competing, by email, to the competition organisers
- May 13, 2011 Validation set is made available on the Competition Website
- May 15, 2011 Submission of results by competitors, which should be executable files; if at all impossible, the test data will be given out to competitors, but results must be submitted within no more than one hour (negotiable)
- June 15, 2011 Submission of summary paper for ICDAR’s proceedings, already including the identification of the competition’s winner
- September, 2011 Test set is made available on the Competition Website
- September, 2011 Announcement of the results will be made during ICDAR’2011, the competition session