Better table search through Machine Learning and Knowledge by Johnny Chen.
From the post:
The Web offers a trove of structured data in the form of tables. Organizing this collection of information and helping users find the most useful tables is a key mission of Table Search from Google Research. While we are still a long way away from the perfect table search, we made a few steps forward recently by revamping how we determine which tables are “good” (one that contains meaningful structured data) and which ones are “bad” (for example, a table that hold the layout of a Web page). In particular, we switched from a rule-based system to a machine learning classifier that can tease out subtleties from the table features and enables rapid quality improvement iterations. This new classifier is a support vector machine (SVM) that makes use of multiple kernel functions which are automatically combined and optimized using training examples. Several of these kernel combining techniques were in fact studied and developed within Google Research [1,2].
Important work on tables from Google Research.
Important in part because you can compare your efforts on accessible tables to theirs, to gain insight into what you are, or aren’t doing “right.”
For any particular domain, you should be able to do better than a general solution.
BTW, I disagree on the “good” versus “bad” table distinction. I suspect that tables that hold the layout of web pages, say for a CMS, are more consistent than database tables of comparable size. And that data, may or may not be important to you.
Important versus non-important data for a particular set of requirements is a defensible distinction.
“Good” versus “bad” tables is not.