Archive for the ‘PDF’ Category

Tabula: Extracting A Hit (sorry) Security List From PDF Report

Tuesday, December 5th, 2017

Benchmarking U.S. Government Websites by Daniel Castro, Galia Nurko, and Alan McQuinn, provides a quick assessment of 468 of the most popular federal websites for “…page-load speed, mobile friendliness, security, and accessibility.”

Unfortunately, it has an ugly table layout:

Double column listings with the same headers?

There are 476 results on Stackoverflow this morning for extracting tables from PDF.

However, I need a cup of coffee, maybe two cups of coffee answer to extracting data from these tables.

Enter Tabula.

If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — there’s no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux.

Tabula is download, extract, start and point your web browser to http://localhost:8080 (or, load your PDF file, select the table, export the content, easy to use.

I tried selecting the columns separately (one page at a time) but then used table recognition and selected the entirety of Table 6 (security evaluation). I don’t think it made any difference in the errors I was seeing in the result (dropping first letter of site domains, but check with your data.)

Warning: For some unknown reason, possibly a defect in the PDF and/or Tabula, the leading character from the second domain field was dropped on some entries. Not all, not consistently, but it was dropped. Not to mention missing the last line of entries on a couple of pages. Proofing is required!

Not to mention there were other recognition issues

Capture wasn’t perfect due to underlying differences in the PDF:,100,901,,100,"3,284",100,904,,100,"3,307",,,100,,,"3,340",,,,,,100,,,"9,012",

With proofing, we are way beyond two cups of coffee but once proofed, I tossed it into Calc and produced a single column CSV file: 2017-Benchmarking-US-Government-Websites-Security-Table-6.csv.


PS: I discovered a LibreOffice Calc “gotcha” in this exercise. If you select a column for the top and attempt to paste it under an existing column (same or different spreadsheet), you get the error message: “There is not enough room on the sheet to insert here.”

When you select a column from the top, it copies all the blank cells in that column so there truly isn’t sufficient space to paste it under another column. Tip: Always copy columns in Calc from the bottom of the column up.

Apache PDFBox 2 – Vulnerability Warning

Tuesday, July 5th, 2016

Apache PDFBox 2 by Dustin Marx.

From the post:

Apache PDFBox 2 was released earlier this year and Apache PDFBox 2.0.1 and Apache PDFBox 2.0.2 have since been released. Apache PDFBox is open source (Apache License Version 2) and Java-based (and so is easy to use with wide variety of programming language including Java, Groovy, Scala, Clojure, Kotlin, and Ceylon). Apache PDFBox can be used by any of these or other JVM-based languages to read, write, and work with PDF documents.

Apache PDFBox 2 introduces numerous bug fixes in addition to completed tasks and some new features. Apache PDFBox 2 now requires Java SE 6 (J2SE 5 was minimum for Apache PDFBox 1.x). There is a migration guide, Migration to PDFBox 2.0.0, that details many differences between PDFBox 1.8 and PDFBox 2.0, including updated dependencies (Bouncy Castle 1.53 and Apache Commons Logging 1.2) and “breaking changes to the library” in PDFBox 2.

PDFBox can be used to create PDFs. The next code listing is adapted from the Apache PDFBox 1.8 example “Create a blank PDF” in the Document Creation “Cookbook” examples. The referenced example explicitly closes the instantiated PDDocument and probably does so for benefit of those using a version of Java before JDK 7. For users of Java 7, however, try-with-resources is a better option for ensuring that the PDDocument instance is closed and it is supported because PDDocument implements AutoCloseable.

If you don’t know Apache PDFBox™, its homepage lists the following features:

  • Extract Text
  • Print
  • Split & Merge
  • Save as Image
  • Fill Forms
  • Create PDFs
  • Preflight
  • Signing

Warning: If you are using Apache PDFBox, update to the most recent version.

CVE-2016-2175 XML External Entity vulnerability (2016-05-27)

Due to a XML External Entity vulnerability we strongly recommend to update to the most recent version of Apache PDFBox.

Versions Affected: Apache PDFBox 1.8.0 to 1.8.11 and 2.0.0. Earlier, unsupported versions may be affected as well.

Mitigation: Upgrade to Apache PDFBox 1.8.12 respectively 2.0.1

Manipulate PDFs with Python

Wednesday, January 14th, 2015

Manipulate PDFs with Python by Tim Arnold.

From the overview:

PDF documents are beautiful things, but that beauty is often only skin deep. Inside, they might have any number of structures that are difficult to understand and exasperating to get at. The PDF reference specification (ISO 32000-1) provides rules, but it is programmers who follow them, and they, like all programmers, are a creative bunch.

That means that in the end, a beautiful PDF document is really meant to be read and its internals are not to be messed with. Well, we are programmers too, and we are a creative bunch, so we will see how we can get at those internals.

Still, the best advice if you have to extract or add information to a PDF is: don’t do it. Well, don’t do it if there is any way you can get access to the information further upstream. If you want to scrape that spreadsheet data in a PDF, see if you can get access to it before it became part of the PDF. Chances are, now that is is inside the PDF, it is just a bunch of lines and numbers with no connection to its former structure of cells, formats, and headings.

If you cannot get access to the information further upstream, this tutorial will show you some of the ways you can get inside the PDF using Python. (emphasis in the original)

Definitely a collect the software and experiment type post!

Is there a collection of “nasty” PDFs on the web? Thinking that would be a useful think to have for testing tools such as the ones listed in this post. Not to mention getting experience with extracting information from them. Suggestions?

I first saw this in a tweet by Christophe Lalanne.

A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles

Monday, January 12th, 2015

A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles by Stefan Klampfl, Kris Jack, Roman Kern.


In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain.

Excellent article if you have ever struggled with the endless tables in government documents.

I first saw this in a tweet by Anita de Waard.

CERMINE: Content ExtRactor and MINEr

Wednesday, September 24th, 2014

CERMINE: Content ExtRactor and MINEr

From the webpage:

CERMINE is a Java library and a web service for extracting metadata and content from scientific articles in born-digital form. The system analyses the content of a PDF file and attempts to extract information such as:

  • Title of the article
  • Journal information (title, etc.)
  • Bibliographic information (volume, issue, page numbers, etc.)
  • Authors and affiliations
  • Keywords
  • Abstract
  • Bibliographic references

CERMINE at Github

I used the following three files for a very subjective test of the online interface:

I am mostly interested in extraction of bibliographic entries and can report that while CERMINE made some mistakes, it is quite useful.

I first saw this in a tweet by Docear.

QPDF – PDF Transformations

Wednesday, September 10th, 2014

QPDF – PDF Transformations

From the webpage:

QPDF is a command-line program that does structural, content-preserving transformations on PDF files. It could have been called something like pdf-to-pdf. It also provides many useful capabilities to developers of PDF-producing software or for people who just want to look at the innards of a PDF file to learn more about how they work.

QPDF is capable of creating linearized (also known as web-optimized) files and encrypted files. It is also capable of converting PDF files with object streams (also known as compressed objects) to files with no compressed objects or to generate object streams from files that don’t have them (or even those that already do). QPDF also supports a special mode designed to allow you to edit the content of PDF files in a text editor….

Government agencies often publish information in PDF. PDF which often has restrictions on copying and printing.

I have briefly tested QPDF and it does take care of copying and printing restrictions. Be aware that QPDF has many other capabilities as well.

Domeo and Utopia for PDF…

Tuesday, July 1st, 2014

Domeo and Utopia for PDF, Achieving annotation interoperability by Paolo Ciccarese.

From the description:

The Annotopia ( Open Annotation universal Hub allows to achieve annotation interoperability between different annotation clients. This is a first small demo where the annotations created with the Domeo Web Annotation Tool ( can be seen by the users of the Utopia for PDF application (

The demonstration shows highlighting of text and attachment of a note to an HTML page in a web browser and then the same document is loaded as PDF and the highlighting and note appear as specified in the HTML page.

The Domeo Web Annotation Tool appears to have the capacity to be a topic map authoring tool against full text.

Definite progress on the annotation front!

Next question is how do we find all the relevant annotations despite differences in user terminology? Same problem that we have with searching but in annotations instead of the main text.

You could start from some location in the text but I’m not sure all users will annotate the same material. Some may comment on the article in general, others, will annotate very specific text.

Definitely a topic map issue both in terms of subjects in the text as well as in the annotations.

Overview and Splitting PDF Files

Wednesday, June 4th, 2014

I have been seeing tweets from the Overview Project that as of today, yoiu can split PDF files into pages without going through DocumentCloud or other tools.

I don’t have Overview installed so I can’t confirm that statement but if true, it is a step in the right direction.

Think about it for a moment.

If you “tag” a one hundred page PDF file with all the “tags” you need to return to that document, what happens? Sure, you can go back to that document, but then you have to search for the material you were tagging.

It is a question of the granularity of your “tagging.” Now imagine tagging a page in PDF. Is it now easier for you to return to that one page? Can you also say it would be easier for someone else to return to the same page following your path?

Which makes you wonder about citation practices that simply cite an article and not a location within the article.

Are they trying to make your job as a reader that much harder?


Thursday, May 22nd, 2014


From the webpage:

PDFium is an open-source PDF rendering engine.

Just in case you need a PDF rendering engine for your topic map application and/or want to make subjects out of the internal structure of PDF files.

I first saw this at Nat Torkington’s Four short links: 22 May 2014.

Is PDF the Problem?

Wednesday, May 14th, 2014

The solutions to all our problems may be buried in PDFs that nobody reads by Christopher Ingraham.

From the post:

What if someone had already figured out the answers to the world’s most pressing policy problems, but those solutions were buried deep in a PDF, somewhere nobody will ever read them?

According to a recent report by the World Bank, that scenario is not so far-fetched. The bank is one of those high-minded organizations — Washington is full of them — that release hundreds, maybe thousands, of reports a year on policy issues big and small. Many of these reports are long and highly technical, and just about all of them get released to the world as a PDF report posted to the organization’s Web site.

The World Bank recently decided to ask an important question: Is anyone actually reading these things? They dug into their Web site traffic data and came to the following conclusions: Nearly one-third of their PDF reports had never been downloaded, not even once. Another 40 percent of their reports had been downloaded fewer than 100 times. Only 13 percent had seen more than 250 downloads in their lifetimes. Since most World Bank reports have a stated objective of informing public debate or government policy, this seems like a pretty lousy track record.

I’m not so sure that the PDF format, annoying as it can be, lies at the heart of non-reading of World Bank reports.

Consider Rose Eveleth’s recent (2014) Academics Write Papers Arguing Over How Many People Read (And Cite) Their Papers.

Eveleth writes:

There are a lot of scientific papers out there. One estimate puts the count at 1.8 million articles published each year, in about 28,000 journals. Who actually reads those papers? According to one 2007 study, not many people: half of academic papers are read only by their authors and journal editors, the study’s authors write.

But not all academics accept that they have an audience of three. There’s a heated dispute around academic readership and citation—enough that there have been studies about reading studies going back for more than two decades.

In the 2007 study, the authors introduce their topic by noting that “as many as 50% of papers are never read by anyone other than their authors, referees and journal editors.” They also claim that 90 percent of papers published are never cited. Some academics are unsurprised by these numbers. “I distinctly remember focusing not so much on the hyper-specific nature of these research topics, but how it must feel as an academic to spend so much time on a topic so far on the periphery of human interest,” writes Aaron Gordon at Pacific Standard. “Academia’s incentive structure is such that it’s better to publish something than nothing,” he explains, even if that something is only read by you and your reviewers.

Fifty (50%) of papers have an audience of three? Being mindful these aren’t papers from the World Bank but papers spread across a range of disciplines.

Before you decide that PDF format is the issue or that academic journal articles aren’t read, you need to consider other evidence from sources such as: Measuring Total Reading of Journal, Donald W. King, Carol Tenopir, and, Michael Clarke, D-Lib Magazine, October 2006, Volume 12 Number 10, ISSN 1082-9873.

King, Tenopir, and, Clarke write in part:

The Myth of Low Use of Journal Articles

A myth that journal articles are read infrequently persisted over a number of decades (see, for example, Williams 1975, Lancaster 1978, Schauder 1994, Odlyzko 1996). In fact, early on this misconception led to a series of studies funded by the National Science Foundation (NSF) in the 1960s and 1970s to seek alternatives to traditional print journals, which were considered by many to be a huge waste of paper. The basis for this belief was generally twofold. First, many considered citation counts to be the principal indicator of reading articles, and studies showed that articles averaged about 10 to 20 citations to them (a number that has steadily grown over the past 25 years). Counts of citations to articles tend to be highly skewed with a few articles having a large number of citations and many with few or even no citation to them. This led to the perception that articles were read infrequently or simply not at all.

King, Tenopir, and, Clarke make a convincing case that “readership” for an article is a more complex question than checking download statistics.

Let’s say that the question of usage/reading of reports/articles is open to debate. Depending on who you ask, some measures are thought to be better than others.

But there is a common factor that all of these studies ignore: Usage, however you define it, is based on article or paper level access.

What if instead of looking for an appropriate World Bank PDF (or other format) file, I could search for the data used in such a file? Or the analysis of some particular data that is listed in a file? I may or may not be interested in the article as a whole.

An author’s arrangement of data and their commentary on it is one presentation of data, shouldn’t we divorce access to the data from reading it through the lens of the author?

If we want greater re-use of experimental, financial, survey and other data, then let’s stop burying it in an author’s presentation, whether delivered as print, PDF, or some other format.

I first saw this in a tweet by Duncan Hull.

Full-Text Indexing PDFs in Javascript

Saturday, November 9th, 2013

Full-Text Indexing PDFs in Javascript by Gary Sieling.

From the post:

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem. (Emphasis added.)

The need for a full-text indexer without using one of the major indexing packages had not occurred to me.

Access to the user’s machine might be limited by time, for example. You would not want to waste cycles spinning up a major indexer when you don’t know the installed software.

Something to add to your USB stick. 😉

Docear 1.0 (stable),…

Thursday, October 17th, 2013

Docear 1.0 (stable), a new video, new manual, new homepage, new details page, … by Joeran Beel.

From the post:

It’s been almost two years since we released the first private Alpha of Docear and today, October 17 2013, Docear 1.0 (stable) is finally available for Windows, Mac, and Linux to download. We are really proud of what we accomplished in the past years and we think that Docear is better than ever. In addition to all the enhancements we made during the past years, we completely rewrote the manual with step-by-step instructions including an overview of supported PDF viewers, we changed the homepage, we created a new video, and we made the features & details page much more comprehensive. For those who already use Docear 1.0 RC4, there are not many changes (just a few bug fixes). For new users, we would like to explain what Docear is and what makes it so special.

Docear is a unique solution to academic literature management that helps you to organize, create, and discover academic literature. The three most distinct features of Docear are:

  1. A single-section user-interface that differs significantly from the interfaces you know from Zotero, JabRef, Mendeley, Endnote, … and that allows a more comprehensive organization of your electronic literature (PDFs) and the annotations you created (i.e highlighted text, comments, and bookmarks).
  2. A ‘literature suite concept’ that allows you to draft and write your own assignments, papers, theses, books, etc. based on the annotations you previously created.
  3. A research paper recommender system that allows you to discover new academic literature.

Aside from Docear’s unique approach, Docear offers many features more. In particular, we would like to point out that Docear is free, open source, not evil, and Docear gives you full control over your data. Docear works with standard PDF annotations, so you can use your favorite PDF viewer. Your reference data is directly stored as BibTeX (a text-based format that can be read by almost any other reference manager). Your drafts and folders are stored in Freeplane’s XML format, again a text-based format that is easy to process and understood by several other applications. And although we offer several online services such as PDF metadata retrieval, backup space, and online viewer, we do not force you to register. You can just install Docear on your computer, without any registration, and use 99% of Docear’s functionality.

But let’s get back to Docear’s unique approach for literature management…

Impressive “academic literature management” package!

I have done a lot of research over the years but unaided in large part by citation management software. Perhaps it is time to try a new approach.

Just scanning the documentation it does not appear that I can share my Docear annotations with another user.

Unless we were fortunate enough to have used the same terminology the same way while doing our research.

That is to say any research project I undertake will result in the building of a silo that is useful to me, but that others will have to duplicate.

If true, I just scanned the documentation, that is an observation and not a criticism.

I will keep track of my experience with a view towards suggesting changes that could make Docear more transparent.


Tuesday, September 3rd, 2013


Not a recent release but version 2.02 of the PDF Toolkit is available at PDF Labs.

I could rant about PDF as a format but that won’t change the necessity of processing them.

The PDF Toolkit has the potential to take some of the pain out of that task.

BTW, as of today, the Pro version is only $3.99 and the proceeds support development of the GPL PDFtk.

Not a bad investment.

Working with PDFs…

Saturday, August 31st, 2013

Working with PDFs Using Command Line Tools in Linux by William J. Turkel.

From the post:

We have already seen that the default assumption in Linux and UNIX is that everything is a file, ideally one that consists of human- and machine-readable text. As a result, we have a very wide variety of powerful tools for manipulating and analyzing text files. So it makes sense to try to convert our sources into text files whenever possible. In the previous post we used optical character recognition (OCR) to convert pictures of text into text files. Here we will use command line tools to extract text, images, page images and full pages from Adobe Acrobat PDF files.

A great post if you are working with PDF files.

Freeing Information From Its PDF Chains

Friday, May 31st, 2013

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js by Gary Sieling.

From the post:

Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.

I like the phrase “[m]uch information is trapped inside PDFs….”

Despite window dressing executive orders, information is going to continue to be trapped inside PDFs.

What information do you want to free from its PDF chains?

I first saw this at DZone.

Cool Tools: pdf2html

Friday, May 10th, 2013

Cool Tools: pdf2html by Derek Willis.

From the post:

A PDF does one thing very well: it presents an accurate image that can be viewed on just about any device. Unfortunately, PDFs also cause grief for anyone who wants to use the data they contain. Governments, in particular, have a habit of releasing PDFs when the information would be more useful and accessible as a spreadsheet. The tools for extracting text from PDFs can be flaky, but Lu Wang’s pdf2htmlEX project solves this problem. Pdf2htmlEX takes PDFs and converts them into HTML5 documents while preserving the layout and appearance of the original.

This looks very cool!

Of course, moving from HTML5 is left as an exercise for the reader. 😉

Indexing PDF for OSINT and Pentesting [or Not!]

Saturday, April 6th, 2013

Indexing PDF for OSINT and Pentesting by Alejandro Nolla.

From the post:

Most of us, when conducting OSINT tasks or gathering information for preparing a pentest, draw on Google hacking techniques like site:company.acme filetype:pdf “for internal use only” or something similar to search for potential sensitive information uploaded by mistake. At other times, a customer will ask us to find out if through negligence they have leaked this kind of sensitive information and we proceed to make some google hacking fu.

But, what happens if we don’t want to make this queries against Google and, furthermore, follow links from search that could potentially leak referrers? Sure we could download documents and review them manually in local but it’s boring and time consuming. Here is where Apache Solr comes into play for processing documents and creating an index of them to give us almost real time searching capabilities.

A nice outline of using Solr for internal security testing of PDF files.

At the same time, a nice outline of using Solr for external security testing of PDF files. 😉

You can sweep sites for new PDF files on a periodic basis and retain only those meeting a particular criteria.

Low grade ore but even low grade ore can have a small diamond every now and again.

Introducing Tabula

Thursday, April 4th, 2013

Introducing Tabula by Manuel Aristarán, Mike Tigas.

From the post:

Tabula lets you upload a (text-based) PDF file into a simple web interface and magically pull tabular data into CSV format.

It is hard to say why governments and other imprison tabular data in PDF files.

I suspect they see some advantage in preventing comparison to other data or even checking the consistency of data in a single report.

Whatever their motivations, let’s disappoint them!

Details on how to help are in the blog post.

Bookmarks/Notes in PDF

Tuesday, January 29th, 2013

One of the advantages of the original topic maps standard, being based on HyTime, was its ability to point into documents. That is the structure of a document could be treated as an anchor for linking into the document.

Sadly I am not writing to announce the availability of a HyTime utility for pointing into PDF.

I am writing to list links to resources for creating bookmarks/notes in PDF.

Not the next best thing but a pale substitute until something better comes along.

Open Source:

JPdfBookmarks: Pdf bookmarks editor: Active project with excellent documentation (including on bookmarks themselves). GPLv3 license.

Ahem, commercial options:



Nitro Reader

Others that I have overlooked?

Pointing into PDF is an important issue because scanning/reading the same introductory materials on graphs in dozens of papers is tiresome.

A link directly to the material of interest would save time and quite possibly serve as an extraction point for collating the important bits from several papers together.

Think of it as automated note taking with the advantage of not forgetting to write down the proper citation information.

Utopia Documents

Thursday, December 27th, 2012

Checking the “sponsored by” link for pdfx v1.0 and discovered: Utopia Documents.

From the homepage:

Reading, redefined.

Utopia Documents brings a fresh new perspective to reading the scientific literature, combining the convenience and reliability of the PDF with the flexibility and power of the web. Free for Linux, Mac and Windows.

Building Bridges

The scientific article has been described as a Story That Persuades With Data, but all too often the link between data and narrative is lost somewhere in the modern publishing process. Utopia Documents helps to rebuild these connections, linking articles to underlying datasets, and making it easy to access online resources relating to an article’s content.

A Living Resource

Published articles form the ‘minutes of science‘, creating a stable record of ideas and discoveries. But no idea exists in isolation, and just because something has been published doesn’t mean that the story is over. Utopia Documents reconnects PDFs with the ongoing discussion, keeping you up-to-date with the latest knowledge and metrics.


Make private notes for yourself, annotate a document for others to see or take part in an online discussion.

Explore article content

Looking for clarification of given terms? Or more information about them? Do just that, with integrated semantic search.

Interact with live data

Interact directly with curated database entries- play with molecular structures; edit sequence and alignment data; even plot and export tabular data.

A finger on the pulse

Stay up to date with the latest news. Utopia connects what you read with live data from Altmetric, Mendeley, CrossRef, Scibite and others.

A user can register for an account (enabling comments on documents) or use the application anonymously.

Presently focused on the life sciences but no impediment to expansion into computer science for example.

It doesn’t solve semantic diversity issues so an opportunity for topic maps there.

Doesn’t address the issue of documents being good at information delivery but not so good for information storage.

But issues of semantic diversity and information storage, are growth areas for Utopia Documents, not reservations about its use.

Suggest you start using and exploring Utopia Documents sooner rather than later!

pdfx v1.0 [PDF-to-XML]

Thursday, December 27th, 2012

pdfx v1.0

From the homepage:

Fully-automated PDF-to-XML conversion of scientific text

I submitted Static and Dynamic Semantics of NoSQL Languages, a paper I blogged about earlier this week. Twenty-four pages of lots of citations and equations.

I forgot to set a timer but it isn’t for the impatient. I think the conversion ran more than ten (10) minutes.

Some mathematical notation defeats the conversion process.

See: Static-and-Dynamic-Semantics-NoSQL-Languages.tar.gz for the original PDF plus the HTML and PDF outputs.

For occasional conversions where heavy math notation isn’t required, this may prove to be quite useful.

Layout-aware text extraction from full-text PDF of scientific articles

Monday, October 8th, 2012

Layout-aware text extraction from full-text PDF of scientific articles by Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy and Gully APC Burns. (Source Code for Biology and Medicine 2012, 7:7 doi:10.1186/1751-0473-7-7)



The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.


Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.


LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at

Scanning TOCs from a variety of areas can uncover goodies like this one.

What is the most recent “unexpected” paper/result outside your “field” have you found?