After reading Jeff Larson’s account of his text mining adventures in ProPublica’s Jeff Larson on the NSA Crypto Story, I encountered a triplet of post from Gary Sieling on Postgres and full text indexes.
In order of appearance:
Fixing Issues Where Postgres Optimizer Ignores Full Text Indexes
GIN vs GiST For Faceted Search with Postgres Full Text Indexes
Querying Multiple Postgres Full-Text Indexes
If Postgres and full text indexing are project requirements, these are must read posts.
Gary does note in the middle post that Solr with default options (no tuning) out performs Postgres.
Solr would have been the better option for Jeff Larson when compared to Postgres.
But the difference in that case is a contrast between structured data and “dumpster data.”
It appears that the hurly-burly race to enable “connecting the dots” post-9/11:
Structural barriers to performing joint intelligence work. National intelligence is still organized around the collection disciplines of the home agencies, not the joint mission. The importance of integrated, all-source analysis cannot be overstated. Without it, it is not possible to “connect the dots.” No one component holds all the relevant information.
Yep, #1 with a bullet problem.
Response? From the Manning and Snowden leaks, one can only guess that “dumpster data” is the preferred solution.
By “dumpster data” I mean that data from different sources, agencies, etc., are simply dumped into a large data store.
No wonder the NSA runs 600,000 of queries a day or about 20 million queries a month. That is a lot of data dumpster diving.
Secrecy may be hiding that data from the public, but poor planning is hiding it from the NSA.