Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 9, 2013

Google encrypts data amid backlash….

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 4:59 pm

Google encrypts data amid backlash against NSA spying by Craig Timberg.

Be a smart consumer: Don’t pay extra for the encryption.

How did your data get to Google?

Digital equivalent of Lady Godiva?

Covering up data after it gets to Google may make you mother feel better but it is wholly ineffectual.

From the post:

Encrypting information flowing among data centers will not make it impossible for intelligence agencies to snoop on individual users of Google services, nor will it have any effect on legal requirements that the company comply with court orders or valid national security requests for data. But company officials and independent security experts said that increasingly widespread use of encryption technology makes mass surveillance more difficult — whether conducted by governments or other sophisticated hackers.

Nor does it help if the NSA obtains a copy as it streams into Google.

Encrypting data before it leaves your computer makes surveillance more difficult.

The “I have nothing to hide” crowd needs to realize that encrypting data flows, all data flows, is a contribution to privacy around the world.

Until that happy day when we can all go dark from intelligence surveillance, be careful out there. Someone is watching over you. (Not for your best interest.)

PS: Be sure to read ‘I’ve Got Nothing to Hide’ and Other Misunderstandings of Privacy by Daniel J. Solove. Great essay on privacy from 2007 that mentions the NSA among others.

The Monstrous Cost of Work Failure

Filed under: Marketing,Project Management,Topic Maps — Patrick Durusau @ 3:34 pm

Failure Infographic

I first saw this posted by Randy Krum.

The full sized infographic at AtTask.

Would you care to guess what accounts for 60% to 80% of project failures?

According to the ASAPM (American Society for the Advancement of Project Management):

According to the Meta Group, 60% – 80% of project failures can be attributed directly to poor requirements gathering, analysis, and management. (emphasis added)

Requirements, what some programmers are too busy coding to collect and some managers fear because of accountability.

Topic maps can’t solve your human management problems.

Topic maps can address:

  • Miscommunication between business and IT – $30 Billion per year
  • 58% of workers spending half of each workday, filing, deleting, sorting information

Reducing information shuffling is like adding more staff for the same bottom line.

Interested?

September 8, 2013

Scaling Writes

Filed under: Graphs,Neo4j,Scalability — Patrick Durusau @ 5:58 pm

Scaling Writes by Max De Marzi.

From the post:

Most of the applications using Neo4j are read heavy and scale by getting more powerful servers or adding additional instances to the HA cluster. Writes however can be a little bit tricker. Before embarking on any of the following strategies it is best that the server is tuned.

Max encountered someone who wants to write data. How weird is that? 😉

Seriously, your numbers may vary from Max’s but you will be on your way to tuning write performance after reading this post.

Don’t depend on the NSA to capture your data. Freedom of Information requests take time and often have omissions. Test your recovery options with any writing solution.

Think Big… Right Start Big Data Projects [Religion]

Filed under: Marketing,Topic Maps — Patrick Durusau @ 4:45 pm

Think Big… Right Start Big Data Projects by Rod Bodkin (Think Big Analytics).

Rod lists three “Must Dos:”

  1. Test and learn
  2. Incremental adoption
  3. Change management

I won’t try to summarize Rod’s points. You will be better off reading the original post.

I would this point: “Leave your technology religion at the door.”

Realize most customers have no religious convictions about software. And they are not interested in having religious convictions about software or knowing yours.

You may well be convinced your software or approach will be the salvation of the human race, solar system or even the galaxy.

My suggestion is you keep that belief to yourself. Unless your client wants world salvation rhetoric for fund raising or some other purpose.

Clients, from your local place of worship to Wall Street, from the NSA to the KGB, and everywhere in between have some need other than paying you for your products or services.

The question you need to answer is how does your product or service met that need?

Postgres and Full Text Indexes

Filed under: Indexing,PostgreSQL,Solr — Patrick Durusau @ 4:06 pm

After reading Jeff Larson’s account of his text mining adventures in ProPublica’s Jeff Larson on the NSA Crypto Story, I encountered a triplet of post from Gary Sieling on Postgres and full text indexes.

In order of appearance:

Fixing Issues Where Postgres Optimizer Ignores Full Text Indexes

GIN vs GiST For Faceted Search with Postgres Full Text Indexes

Querying Multiple Postgres Full-Text Indexes

If Postgres and full text indexing are project requirements, these are must read posts.

Gary does note in the middle post that Solr with default options (no tuning) out performs Postgres.

Solr would have been the better option for Jeff Larson when compared to Postgres.

But the difference in that case is a contrast between structured data and “dumpster data.”

It appears that the hurly-burly race to enable “connecting the dots” post-9/11:

Structural barriers to performing joint intelligence work. National intelligence is still organized around the collection disciplines of the home agencies, not the joint mission. The importance of integrated, all-source analysis cannot be overstated. Without it, it is not possible to “connect the dots.” No one component holds all the relevant information.

Yep, #1 with a bullet problem.

Response? From the Manning and Snowden leaks, one can only guess that “dumpster data” is the preferred solution.

By “dumpster data” I mean that data from different sources, agencies, etc., are simply dumped into a large data store.

No wonder the NSA runs 600,000 of queries a day or about 20 million queries a month. That is a lot of data dumpster diving.

Secrecy may be hiding that data from the public, but poor planning is hiding it from the NSA.

September 7, 2013

Most popular porn searches, by state

Filed under: Humor — Patrick Durusau @ 7:46 pm

Most popular porn searches, by state by Nathan Yau.

I can’t say I am surprised by hentai being the most popular search in Alabama.

Looking forward to seeing a color version at state hospitality centers. 😉

September 6, 2013

Open Source = See Where It Keeps Its Brain

Filed under: Cybersecurity,NSA,Open Source,Security — Patrick Durusau @ 5:52 pm

A recent article by James Ball, Julian Borger and Glenn Greenwald, How US and UK spy agencies defeat internet privacy and security confirms non-open source software is dangerous to your privacy, business, military and government (if you are not the U.S.).

One brief quote from an article you need to digest in full:

Funding for the program – $254.9m for this year – dwarfs that of the Prism program, which operates at a cost of $20m a year, according to previous NSA documents. Since 2011, the total spending on Sigint enabling has topped $800m. The program “actively engages US and foreign IT industries to covertly influence and/or overtly leverage their commercial products’ designs”, the document states. None of the companies involved in such partnerships are named; these details are guarded by still higher levels of classification.

Among other things, the program is designed to “insert vulnerabilities into commercial encryption systems”. These would be known to the NSA, but to no one else, including ordinary customers, who are tellingly referred to in the document as “adversaries”.

No names but it isn’t hard to guess whose software products has backdoors.

How to know if your system is vulnerable to the U.S. government?

Find the Gartner Report that includes your current office suite or other software.

Compare the names in the Gartner report to your non-open source software. Read’em and weep.

How to stop being vulnerable to the U.S. government?

A bit harder but doable.

Support the Apache Software Foundation and other open source software projects.

As Ginny Weasley finds in the Harry Potter series, it’s important to know where magical objects keep their brains.

Same is true for software. Just because you can’t see into it, doesn’t mean it can see you. It may be spying on you.

Open software is far less likely to spy on you. Why? Because the backdoor or security compromise would be visible to anyone. Including people who would blow the whistle.

OpenOffice or other open source software not meeting your needs?

For OpenDocument Format (ODF) (used by numerous open source software projects), post your needs: office-comment-subscribe@lists.oasis-open.org (subscription link).

Support the open source project of your choice.

Or not, if you like being spied on by software you paid for.

September 5, 2013

NSA Crackers

Filed under: Cryptography,Cybersecurity,NSA,Security — Patrick Durusau @ 7:42 pm

Revealed: The NSA’s Secret Campaign to Crack, Undermine Internet Security by Jeff Larson, ProPublica, Nicole Perlroth, The New York Times, and Scott Shane, The New York Times.

From the story:

The National Security Agency is winning its long-running secret war on encryption, using supercomputers, technical trickery, court orders and behind-the-scenes persuasion to undermine the major tools protecting the privacy of everyday communications in the Internet age, according to newly disclosed documents.

The agency has circumvented or cracked much of the encryption, or digital scrambling, that guards global commerce and banking systems, protects sensitive data like trade secrets and medical records, and automatically secures the e-mails, Web searches, Internet chats and phone calls of Americans and others around the world, the documents show.

Many users assume — or have been assured by Internet companies — that their data is safe from prying eyes, including those of the government, and the N.S.A. wants to keep it that way. The agency treats its recent successes in deciphering protected information as among its most closely guarded secrets, restricted to those cleared for a highly classified program code-named Bullrun, according to the documents, provided by Edward J. Snowden, the former N.S.A. contractor.

Read the full story for the details.

If that weren’t bad enough, consider this report from The NSA has cracked the secure internet: 3 things to know about the latest Snowden leaks by Jeff John Roberts:

Despite Thursday’s detailed revelations, the precise scope of the government’s power to break encryption is not clear. This is in part because the New York Times and Guardian did not publish all that they know. While the government asked the news agencies not to publish the stories, they only withheld certain details.

So, rather than a corrupt government withholding information from the public, the press decides it wants to withhold information as well?

That’s rather cold comfort from the defender’s of the public’s right to know.

I understand why Glenn Greenwald has been releasing the Snowden documents in dribs and drabs.

You can see the evidence for yourself. Watch the news cycles. As one set of Snowden leaks starts to die off, suddenly there is another release from Greenwald.

Glenn is forty-six (46) now so he may be able to stay in the headlines for another nineteen years and retire to write books with more Snowden leaks. It’s a meal ticket.

The news media needs to choose sides.

It can side with inevitably corrupt governments and their venal servants or choose to side with the public.

Members of the public need to make their choices as well.

40 Maps…

Filed under: Mapping,Maps — Patrick Durusau @ 6:46 pm

40 Maps That Will Help You Make Sense of the World

From the post:

If you’re a visual learner like myself, then you know maps, charts and infographics can really help bring data and information to life. Maps can make a point resonate with readers and this collection aims to do just that.

Hopefully some of these maps will surprise you and you’ll learn something new. A few are important to know, some interpret and display data in a beautiful or creative way, and a few may even make you chuckle or shake your head.

If you enjoy this collection of maps, the Sifter highly recommends the r/MapPorn sub reddit. You should also check out ChartsBin.com. There were also fantastic posts on Business Insider and Bored Panda earlier this year that are worth checking out. Enjoy!

A must see collection of maps!

I’m not vouching for the accuracy of any of the maps.

After all, 20. Map of Countries with the Most Violations of Bribery shows none for the United States. Must have an odd definition of bribery.

United States Senators are paid $174,000 per year.

It cost $10.5 million on average to win a United States Senate seat.

Let’s see, six year term at $174,000 per year = $1,044,000. And you are going to spend more than 10X that amount to get the job? Plus that amount to retain it for another six years?

Or some public spirited person is going to give you > $10.5 million with no strings attached.

If you believe that last statement, please log off the Internet and never return. You are unsafe. (full stop)

Stinger Phase 2:…

Filed under: Hive,Hortonworks,SQL,STINGER — Patrick Durusau @ 6:28 pm

Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop by Carter Shanklin.

From the post:

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.

If you don’t think SQL is all that weird, ;-), this is a status update for you!

Serious progress is being made by a broad coalition of more than 60 developers.

Take the challenge and download HDP 2.0 Beta.

You can help build the future of SQL-IN-Hadoop.

But only if you participate.

Introducing Cloudera Search

Filed under: Cloudera,Hadoop,Search Engines — Patrick Durusau @ 6:15 pm

Introducing Cloudera Search

Cloudera Search 1.0 has hit the streets!

Download

Prior coverage of Cloudera Search: Hadoop for Everyone: Inside Cloudera Search.

Enjoy!

September 4, 2013

Text Analysis With R

Filed under: R,Text Analytics — Patrick Durusau @ 6:59 pm

Text Analysis With R for Students of Literature by Matthew L. Jockers.

A draft text asking for feedback but it has to be more enjoyable than some of the standards I have been reading. 😉

For that matter, some of the techniques Matthew describes should be useful in working with standards drafts.

What’s under the hood in Cassandra 2.0

Filed under: Cassandra — Patrick Durusau @ 6:48 pm

What’s under the hood in Cassandra 2.0 by Jonathan Ellis.

If you haven’t already downloaded Cassandra 2.0, Jonathan has twenty-three (23) reasons why you should.

September 3, 2013

…Integrate Tableau and Hadoop…

Filed under: Hadoop,Hortonworks,Tableau — Patrick Durusau @ 7:34 pm

How To Integrate Tableau and Hadoop with Hortonworks Data Platform by Kim Truong.

From the post:

Chances are you’ve already used Tableau Software if you’ve been involved with data analysis and visualization solutions for any length of time. Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Hadoop with Hortonworks Data Platform via Hive and the Hortonworks Hive ODBC driver.

If you want to get hands on with Tableau as quickly as possible, we recommend using the Hortonworks Sandbox and the ‘Visualize Data with Tableau’ tutorial.

(…)

Kim has a couple of great resources from Tableau to share with you so jump to her post now.

That’s right. I want you to look at someone else’s blog. Won’t catch on at capture sites with advertising but then that’s not me.

IDH Hbase & Lucene Integration

Filed under: HBase,IDH HBase,Lucene — Patrick Durusau @ 7:00 pm

IDH Hbase & Lucene Integration by Ritu Kama.

From the post:

HBase is a non-relational, column-oriented database that runs on top of the Hadoop Distributed File System (HDFS). Hbase’s tables contain rows and columns. Each table has an element defined as a Primary Key which is used for all Get/Put/Scan/Delete operations on those tables. To some extent this can be a shortcoming because one may want to search within, say, a given column.

The IDH Integration with Lucene

The Intel® Distribution for Apache Hadoop* (IDH) solves this problem by incorporating native features that permit straightforward integration with Lucene. Lucene is a search library that acts upon documents containing data fields and their values. The IDH-to-Lucene integration leverages the HBase Observer and Endpoint concepts, and therein lies the flexibility to access the HBase data with Lucene searches more robustly.

The Observers can be likened to triggers in RDBMS’s, while the Endpoints share some conceptual similarity to stored procedures. The mapping of Hbase records and Lucene documents is done by a convenience class called IndexMetadata. The Hbase observer monitors data updates to the Hbase table and builds indexes synchronously. The Indexes are stored in multiple shards with each shard tied to a region. The Hbase Endpoint dispatches search requests from the client to those regions.

When entering data into an HBase table you’ll need to create an HBase-Lucene mapping using the IndexMetadata class. During the insertion, text in the columns that are mapped get broken into indexes and stored in the Lucene index file. This process of creating the Lucene index is done automatically by the IDH implementation. Once the Lucene index is created, you can search on any keyword. The implementation searches for the word in the Lucene index and retrieves the row ID’s of the target word. Then, using those keys you can directly access the relevant rows in the database.

IDH’s HBase-Lucene integration extends HBase’s capability and provides many advantages:

  1. Search not only by row key but also by values.
  2. Use multiple query types such as Starts, Ends, Contains, Range, etc.
  3. Ranking scores for the search are also available.

(…)

Interested yet?

See Ritu’s post for sample code and configuration procedures.

Definitely one for the short list on downloads to make.

Elastisch 1.3.0-beta2 Is Released

Filed under: Clojure,ElasticSearch — Patrick Durusau @ 6:47 pm

Elastisch 1.3.0-beta2 Is Released

From the post:

Elastisch is a battle tested, small but feature rich and well documented Clojure client for ElasticSearch. It supports virtually every Elastic Search feature and has solid documentation.

Solid documentation. Well, the guides page says “10 minutes” to study Getting Started. And, Getting Started says it will take about “15 minutes to read and study the provided code examples.” No estimate on reading the prose. 😉

Just teasing.

If you are developing or maintaining your Clojure skills, this is a good opportunity to add a popular search engine to your skill set.

PDFtk

Filed under: PDF — Patrick Durusau @ 6:33 pm

PDFtk

Not a recent release but version 2.02 of the PDF Toolkit is available at PDF Labs.

I could rant about PDF as a format but that won’t change the necessity of processing them.

The PDF Toolkit has the potential to take some of the pain out of that task.

BTW, as of today, the Pro version is only $3.99 and the proceeds support development of the GPL PDFtk.

Not a bad investment.

OrientDB 1.5.1

Filed under: OrientDB — Patrick Durusau @ 6:24 pm

Release OrientDB 1.5.1 containing hot-fixes for 1.5 by Luca Garulli.

From the post:

Orient Technologies has just released the version 1.5.1 of OrientDB Standard and Graph Editions.

It pays to keep up to date.

Encounter the latest problems instead of old ones.

As they say, “use the latest version Luke!” 😉

Cassandra [2.0]

Filed under: Cassandra — Patrick Durusau @ 6:17 pm

Cassandra [2.0]

Cassandra 2.0 dropped today from the Apache Software Foundation.

If you don’t know Cassandra, check out the Getting Started guide.

Or visit Planet Cassandra that describes Cassandra this way:

Apache Cassandra is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers linear scalability and performance across many commodity servers with no single point of failure, and provides a powerful dynamic data model designed for maximum flexibility and fast response times.

Enjoy!

Apache CouchDB 1.4.0 Released

Filed under: CouchDB — Patrick Durusau @ 6:09 pm

Apache CouchDB 1.4.0 Released

From the webpage:

Apache CouchDB 1.4.0 has been released and is available for download.

CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of CouchDB. And you can distribute your data, or your apps, efficiently using CouchDB’s incremental replication. CouchDB supports master-master setups with automatic conflict detection.

Grab your copy here:

http://couchdb.apache.org/

Maybe I should spend more time reading graceless prose. 😉

There are several more really nice software announcements to report for today.

Summingbird [Twitter open sources]

Filed under: Hadoop,Storm,Summingbird,Tweets — Patrick Durusau @ 5:59 pm

Twitter open sources Storm-Hadoop hybrid called Summingbird by Derrick Harris.

I look away for a few hours to review a specification and look what pops up:

Twitter has open sourced a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system. In the case of Twitter, Hadoop handles batch processing, Storm handles stream processing, and the hybrid system is called Summingbird. It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.

Twitter’s blog post announcing Summingbird is pretty technical, but the problem is pretty easy to understand if you think about how Twitter works. Services like Trending Topics and search require real-time processing of data to be useful, but they eventually need to be accurate and probably analyzed a little more thoroughly. Storm is like a hospital’s triage unit, while Hadoop is like longer-term patient care.

This description of Summingbird from the project’s wiki does a pretty good job of explaining how it works at a high level.

(…)

While the Summingbird announcement is heavy sledding, it is well written. The projects spawned by Summingbird are rife with possibilities.

I appreciate Derrick’s comment:

It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.

I don’t know of any tools “for every job,” the opinions of some graph advocates notwithstanding. 😉

If Summingbird fits your problem set, spend some serious time seeing what it has to offer.

September 2, 2013

Defining Usability

Filed under: Design,Interface Research/Design,Usability — Patrick Durusau @ 7:39 pm

Over the Labor Day holiday weekend (U.S.) i had a house full of librarians.

That happens when you are married to a librarian, who has a first cousin who is a librarian and your child is also a librarian.

It’s no surprise they talked about library issues and information technology issues in libraries in particular.

One primary concern was how to define “usability” for a systems engineer.

Patrons could “request” items and would be assured that they request had been accepted. However, the “receiver” module for that message, used by circulation, had no way to retrieve the requests.

From a systems perspective, the system was accepting requests, as designed. While circulation (who fulfills the requests) could not retrieve the messages, that was also part of the system design.

The user’s expectation their request would be seen and acted was being disappointed.

Disappointment of a user expectation, even if within system design parameters, is by definition, failure of the UI.

The IT expectation users would, after enough silence, make in-person or phone requests was the one that should be disappointed.

Or to put it another way, IT systems do not exist to provide employment for people interested in IT.

They exist solely and proximity to assist users in tasks that may have very little to do with IT.

Users are interested in “real life” (a counter-part to “real world”) research, discovery, publication, invention, business, pleasure and social interaction.

Neo4j 1.9.3 now available

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:11 pm

Neo4j 1.9.3 now available by Ian Robinson.

From the post:

Today we’re pleased to release Neo4j 1.9.3. This is a bugfix release in the 1.9.x series and has no new features (though it does restore an old way of registering unmanaged extensions with the server).

If you’re on an earlier 1.9 release then you’re strongly encouraged to upgrade (which can be performed without downtime in an HA cluster). You can download it from the neo4j.org web site.

The Fall season of point releases approaches! 😉

W3C Cheatsheet

Filed under: W3C,XLink,XML — Patrick Durusau @ 7:07 pm

W3C Cheatsheet

You can see the cheatsheet in action or look at the developer documentation.

Interesting resource but needs wider coverage.

Do you recall a Windows executable that was an index of all the XML standards? I remember it quite distinctly but haven’t seen it in years now. Freeware product with updates.

I will look on old external drives and laptops to see if I have a copy.

It would be very useful to have a complete index to W3C work with scoping by version and default to the latest “official” release.

September 1, 2013

Compression Bombs

Filed under: Cybersecurity,Security,Topic Map Software — Patrick Durusau @ 6:49 pm

Vulnerabilities that just won’t die – Compression Bombs

From the post:

Recently Cyberis has reviewed a number of next-generation firewalls and content inspection devices – a subset of the test cases we formed related to compression bombs – specifically delivered over HTTP. The research prompted us to take another look at how modern browsers handle such content given that the vulnerability (or perhaps more accurately, ‘common weakness’ – http://cwe.mitre.org/data/definitions/409.html) has been reported and well known for over ten years. The results surprised us – in short, the majority of web browsers are still vulnerable to compression bombs leading to various denial-of-service conditions, including in some cases, full exhaustion of all available disk space with no user input.

“[F]ull exhaustion of all available disk space with no user input,”

sounds bad to me.

Does your topic map software protect itself against compression bombs?

Notes on DIH Architecture: Solr’s Data Import Handler

Filed under: Searching,Solr — Patrick Durusau @ 6:43 pm

Notes on DIH Architecture: Solr’s Data Import Handler by Mark Bennett.

From the post:

What the world really needs are some awesome examples of extending DIH (Solr DataImportHanlder), beyond the classes and unit tests that ship with Solr. That’s a tall order given DIH’s complexity, and sadly this post ain’t it either! After doing a lot of searches online, I don’t think anybody’s written an “Extending DIH Guide” yet – everybody still points to the Solr wiki, quick start, FAQ, source code and unit tests.

However, in this post, I will review a few concepts to keep in mind. And who knows, maybe in a future post I’ll have some concrete code.

When I make notes, I highlight the things that are different from what I’d expect and why, so I’m going to start with that. Sure DIH has an XML config where you tell it about your database or filesystem or RSS feed, and map those things into your Solr schema, so no surprise there. But the layering of that configuration really surprised me. (and turns out there’s good reasons for it)

If you aspire to be Solr proficient, print this article and work through it.

It will be time well spent.

Sane Data Updates Are Harder Than You Think

Filed under: Data,Data Collection,Data Quality,News — Patrick Durusau @ 6:35 pm

Sane Data Updates Are Harder Than You Think by Adrian Holovaty.

From the post:

This is the first in a series of three case studies about data-parsing problems from a journalist’s perspective. This will be meaty, this will be hairy, this will be firmly in the weeds.

We’re in the middle of an open-data renaissance. It’s easier than ever for somebody with basic tech skills to find a bunch of government data, explore it, combine it with other sources, and republish it. See, for instance, the City of Chicago Data Portal, which has hundreds of data sets available for immediate download.

But the simplicity can be deceptive. Sure, the mechanics of getting data are easy, but once you start working with it, you’ll likely face a variety of rather subtle problems revolving around data correctness, completeness, and freshness.

Here I’ll examine some of the most deceptively simple problems you may face, based on my eight years’ experience dealing with government data in a journalistic setting —most recently as founder of EveryBlock, and before that as creator of chicagocrime.org and web developer at washingtonpost.com. EveryBlock, which was shut down by its parent company NBC News in February 2013, was a site that gathered and sorted dozens of civic data sets geographically. It gave you a “news feed for your block”—a frequently updated feed of news and discussions relevant to your home address. In building this huge public-data-parsing machine, we dealt with many different data situations and problems, from a wide variety of sources.

My goal here is to raise your awareness of various problems that may not be immediately obvious and give you reasonable solutions. My first theme in this series is getting new or changed records.

A great introduction to deep problems that are lurking just below the surface of any available data set.

Not only do data sets change but reactions to and criticisms of data sets change.

What would you offer as an example of “stable” data?

I tried to think of one for this post and came up empty.

You could claim the text of the King Jame Bible is “stable” data.

But only from a very narrow point of view.

The printed text is stable but the opinions, criticisms, commentaries, all on the King James Bible have been anything but stable.

Imagine that you have a stock price ticker application and all it reports are the current prices for some stock X.

Is that sufficient or would it be more useful if it reported the price over the last four hours as a percentage of change?

Perhaps we need a modern data Heraclitus to proclaim:

“No one ever reads the same data twice”

« Newer Posts

Powered by WordPress