RecordBreaker: Automatic structure for your text-formatted data
From the post:
This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike’s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.
RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.
No quite “automatic” but a step in that direction and a useful one.
Think of “automatic” identification of subjects and associations in such files.
Like the files from campaign financing authorities.