Archive for the ‘Government Data’ Category

Beta Testing eFOIA (FBI)

Thursday, December 3rd, 2015

Want to Obtain FBI Records a Little Quicker? Try New eFOIA System

From the post:

The FBI recently began open beta testing of eFOIA, a system that puts Freedom of Information Act (FOIA) requests into a medium more familiar to an ever-increasing segment of the population. This new system allows the public to make online FOIA requests for FBI records and receive the results from a website where they have immediate access to view and download the released information.

Previously, FOIA requests have only been made through regular mail, fax, or e-mail, and all responsive material was sent to the requester through regular mail either in paper or disc format. “The eFOIA system,” says David Hardy, chief of the FBI’s Record/Information Dissemination Section, “is for a new generation that’s not paper-based.” Hardy also notes that the new process should increase FBI efficiency and decrease administrative costs.

The eFOIA system continues in an open beta format to optimize the process for requesters. The Bureau encourages requesters to try eFOIA and to e-mail with any questions or difficulties encountered while using it. In several months, the FBI plans to move eFOIA into full production mode.

The post gives a list of things you need to know/submit in order to help with beta testing of the eFOIA system.

Why help the FBI?

It’s true, I often chide the FBI for its padding of terrorism statistics by framing the mentally ill and certainly its project management skills are nothing to write home about.

Still, there are men and women in the FBI who do capture real criminals and not just the gullible or people who have offended the recording or movie industries. There are staffers, like the ones behind the eFOIA project, who are trying to do a public service, despite the bad apples in the FBI barrel.

Let’s give them a hand, even though decisions on particular FOIA requests may be quite questionable. Not the fault of the technology or the people who are trying to make it work.

What are you going to submit an FOIA about?

I first saw this in a tweet by Nieman Lab.

Progress on Connecting Votes and Members of Congress (XQuery)

Tuesday, December 1st, 2015

Not nearly to the planned end point but I have corrected a file I generated with XQuery that provides the name-id numbers for members of the House and a link to their websites at

It is a rough draft but you can find it at:

While I was casting about for the resources for this posting, I had the sinking feeling that I had wasted a lot of time and effort when I found:

But, if you read that file carefully, what is the one thing it lacks?

A link to every members’s website at “…”

Isn’t that interesting?

Of all the things to omit, why that one?

Especially since you can’t auto-generate the website names from the member names. What appear to be older names use just the last name of members. But, that strategy must have fallen pretty quickly when members with the same last names appeared.

The conflicting names and even some non-conflicting names follow a new naming protocol that appears to be

That will work for a while until the next generation starts inheriting positions in the House.

Anyway, that is as far as I got today but at least it is a useful list for invoking the name-id of members of the House and obtaining their websites.

The next step will be hitting the websites to extract contact information.

Yes, I know that has the “official” contact information, along with their forms for email, etc.

If I wanted to throw my comment into a round file I could do that myself.

No, what I want to extract is their local office data so when they are “back home” meeting with constituents, the average voter has a better chance of being one of those constituents. Not just those who maxed out on campaign donations limits.

Connecting Roll Call Votes to Members of Congress (XQuery)

Monday, November 30th, 2015

Apologies for the lack of posting today but I have been trying to connect up roll call votes in the House of Representatives to additional information on members of Congress.

In case you didn’t know, roll call votes are reported in XML and have this form:

<recorded-vote><legislator name-id="A000374" sort-field="Abraham" 
unaccented-name="Abraham" party="R" state="LA"
<recorded-vote><legislator name-id="A000370" sort-field="Adams" 
unaccented-name="Adams" party="D" state="NC" 
<recorded-vote><legislator name-id="A000055" sort-field="Aderholt" 
unaccented-name="Aderholt" party="R" state="AL" 
<recorded-vote><legislator name-id="A000371" sort-field="Aguilar" 
unaccented-name="Aguilar" party="D" state="CA"

For a full example:

With the name-id attribute value, I can automatically construct URIs to the Biographical Directory of the United States Congress, for example, the entry on Abraham, Ralph.

More information than a poke with a sharp stick would give you but its only self-serving cant.

One of the things that would be nice to link up with roll call votes would be the homepages of those voting.

Continuing with Ralph Abraham, mapping A000374 to would be helpful in gathering other information, such as the various offices where Representative Abraham can be contacted.

If you are reading the URIs, you might think just prepending the last name of each representative to “” would be sufficient. Well, it would be except that there are eight-three cases where representatives share last names and/or a new naming scheme has more than the last name +

After I was satisfied that there wasn’t a direct mapping between the current uses of name-id and House member websites, I started creating such a mapping that you can drop into XQuery as a lookup table and/or use as an external file.

The lookup table should be finished tomorrow so check back.

PS: Yes, I am aware there are tables of contact information for members of Congress but I have yet to see one that lists all their local offices. Moreover, a lookup table for XQuery may encourage people to connect more data to their representatives. Such as articles in local newspapers, property deeds and other such material.

Now over 1,000,000 Items to Search on [Cause to Celebrate?]

Wednesday, October 7th, 2015

Now over 1,000,000 Items to Search on Communications and More Added by Andrew Weber.

From the post:

This has been a great year as we continue our push to develop and refine  There were email alerts added in February, treaties and better default text in March, the Federalist Papers and more browse options in May, and accessibility and user requested features in July.  With this October update, Senate Executive Communications from THOMAS have migrated to  There is an About Executive Communications page that provides more detail about the scope of coverage, searching, viewing, and obtaining copies.

Not to mention a new video “help” series, Legislative Subject Terms and Popular and Short Titles.

All good and from one of the few government institutions that merits respect, the Library of Congress.

Why the “Cause to Celebrate?”

This is an excellent start and certainly has shown itself to be far more responsive to user requests than vendors are to reports of software vulnerabilities.

But we are still at the higher level of data, legislation, regulations, etc.

Where needs to follow is a dive downward to identify who obtains the benefits of legislation/regulations? Who obtains permits, for what and at what market value? Who obtains benefits, credits, allowances? Who wins contracts and where does that money go as it tracks down the prime contractor -> sub-prime contractor -> etc. pipeline?

It is ironic that when candidates for president talk about tax reform they tend to focus on the tax tables. Which are two (2) pages out of the current 6,455 pages of the IRC (in pdf,

Knowing who benefits and by how much for the rest of the pages of the IRC isn’t going to make government any cleaner.

But, when paired with campaign contributions, it will give everyone an even footing on buying favors from the government.

Not unlike public disclosure enables a relatively fair stock exchange, in the case of government it will enable relative fairness in corruption.

Disclosing Government Contracts

Friday, August 21st, 2015

The More the Merrier? How much information on government contracts should be published and who will use it by Gavin Hayman.

From the post:

A huge bunch of flowers to Rick Messick for his excellent post asking two key questions about open contracting. And some luxury cars, expensive seafood and a vat or two of cognac.

Our lavish offerings all come from Slovakia, where in 2013 the Government Public Procurement Office launched a new portal publishing all its government contracts. All these items were part of the excessive government contracting uncovered by journalists, civil society and activists. In the case of the flowers, teachers investigating spending at the Department of Education uncovered florists’ bills for thousands of euros. Spending on all of these has subsequently declined: a small victory for fiscal probity.

The flowers, cars, and cognac help to answer the first of two important questions that Rick posed: Will anyone look at contracting information? In the case of Slovakia, it is clear that lowering the barriers to access information did stimulate some form of response and oversight.

The second question was equally important: “How much contracting information should be disclosed?”, especially in commercially sensitive circumstances.

These are two of key questions that we have been grappling with in our strategy at the Open Contracting Partnership. We thought that we would share our latest thinking below, in a post that is a bit longer than usual. So grab a cup of tea and have a read. We’ll be definitely looking forward to your continued thoughts on these issues.

Not a short read so do grab some coffee (outside of Europe) and settle in for a good read.

Disclosure: I’m financially interested in government disclosure in general and contracts in particular. With openness there comes more effort to conceal semantics and increase the need for topic maps to pierce the darkness.

I don’t think openness reduces the amount of fraud and misconduct in government, it only gives an alignment between citizens and the career interests of a prosecutor a sporting chance to catch someone out.

Disclosure should be as open as possible and what isn’t disclosed voluntarily, well, one hopes for brave souls who will leak the remainder.

Support disclosure of government contracts and leakers of the same.

If you need help “connecting the dots,” consider topic maps. (beta)

Wednesday, July 15th, 2015 (beta)

From the announcement post:

In February this year we announced that we will be iteratively improving the user experience. Today we are launching the new Beta site. There are many changes and we hope you will like them.

  • Dataset pages have been greatly simplified so that you can get to your data within two clicks.
  • We have re-written many of the descriptions to simply explanations.
  • We have launched which is aimed at non-developers to search and then download data.
  • We have also greatly improved and revised our API documentation. For example have a look here
  • We have added content from our blog and twitter feeds into the home page and I hope you agree that we are now presenting a more cohesive offering.

We are still working on datasets, and those in the pipeline waiting for release imminently are

  • Bills meta-data for bills going through Parliamentary process.)
  • Commons Select Committee meta-data.
  • Deposited Papers
  • Lords Attendance data

Let us know what you think.

There could be some connection between what the government says publicly and what it does privately. As they say, “anything is possible.”

Curious, what do you make of the Thesaurus?

Typing the “related” link to say how they are related would be a step in the right direction. Apparently there is an organization with the title: “‘Sdim Curo Plant!” (other sources report Welsh for “Children are Unbeatable”.) Which turns out to be the preferred label.

The entire set has 107,337 records and can be downloaded, albeit in 500 record chunks. That should improve over time according to: Downloading data from data.parliment.

I have always been interested in what terms other people use and this looks like an interesting data set, that is part of a larger interesting data set.


Nominations by the U.S. President

Monday, July 13th, 2015

Nominations by the U.S. President

The Library of Congress created this resource which enables you to search for nominations by U.S. Presidents starting in 1981. There information about the nomination process, the records and related nomination resources at About Nominations of the U.S. Congress.

Unfortunately I did not find a link to bulk data for presidential nominations nor an API for the search engine behind this webpage.

I say that because matching up nominees and/or their sponsors with campaign contributions would help get a price range on becoming the ambassador to Uraguay, etc.

I wrote to Ask a Law Librarian to check on the status of bulk data and/or an API. Will amend this post when I get a response.

Oh, there will be a response. For all the ills and failures of the U.S. government, which are legion, it is capable of assembling vast amounts of information and training people to perform research on it. Not in every case but if it falls within the purview of the Law Library of Congress, I am confident of a useful answer.

World Factbook 2015 (paper, online, downloadable)

Wednesday, June 24th, 2015

World Factbook 2015 (GPO)

From the webpage:

The Central Intelligence Agency’s World Factbook provides brief information on the history, geography, people, government, economy, communications, transportation, military, and transnational issues for 267 countries and regions around world.

The CIA’s World Factbook also contains several appendices and maps of major world regions, which are located at the very end of the publication. The appendices cover abbreviations, international organizations and groups, selected international environmental agreements, weights and measures, cross-reference lists of country and hydrographic data codes, and geographic names.

For maps, it provides a country map for each country entry and a total of 12 regional reference maps that display the physical features and political boundaries of each world region. It also includes a pull-out Flags of the World, a Physical Map of the World, a Political Map of the World, and a Standard Time Zones of the World map.

Who should read The World Factbook? It is a great one-stop reference for anyone looking for an expansive body of international data on world statistics, and has been a must-have publication for:

  • US Government officials and diplomats
  • News organizations and researchers
  • Corporations and geographers
  • Teachers, professors, librarians, and students
  • Anyone who travels abroad or who is interested in foreign countries

The print version is $89.00 (U.S.), is 923 pages long and weighs in at 5.75 lb. in paperback.

A convenient and frequently updated alternative is the online CIA World Factbook.

I can’t compare the two versions because I am not going to spend $89.00 for an arm wrecker. 😉

You can also download a copy of the HTML version.

I downloaded and unzipped the file, only to find that the last update was in June, 2014.

That may be updated soon or it may not. I really don’t know.

If you just need background information that is unlikely to change or you want to avoid surveillance on what countries you look at and for how long, download the 2014 HTML version or pony up for the 2015 paper version.

Saudi Cables (or file dump?)

Saturday, June 20th, 2015

WikiLeaks publishes the Saudi Cables

From the post:

Today, Friday 19th June at 1pm GMT, WikiLeaks began publishing The Saudi Cables: more than half a million cables and other documents from the Saudi Foreign Ministry that contain secret communications from various Saudi Embassies around the world. The publication includes “Top Secret” reports from other Saudi State institutions, including the Ministry of Interior and the Kingdom’s General Intelligence Services. The massive cache of data also contains a large number of email communications between the Ministry of Foreign Affairs and foreign entities. The Saudi Cables are being published in tranches of tens of thousands of documents at a time over the coming weeks. Today WikiLeaks is releasing around 70,000 documents from the trove as the first tranche.

Julian Assange, WikiLeaks publisher, said: “The Saudi Cables lift the lid on a increasingly erratic and secretive dictatorship that has not only celebrated its 100th beheading this year, but which has also become a menace to its neighbours and itself.

The Kingdom of Saudi Arabia is a hereditary dictatorship bordering the Persian Gulf. Despite the Kingdom’s infamous human rights record, Saudi Arabia remains a top-tier ally of the United States and the United Kingdom in the Middle East, largely owing to its globally unrivalled oil reserves. The Kingdom frequently tops the list of oil-producing countries, which has given the Kingdom disproportionate influence in international affairs. Each year it pushes billions of petro-dollars into the pockets of UK banks and US arms companies. Last year it became the largest arms importer in the world, eclipsing China, India and the combined countries of Western Europe. The Kingdom has since the 1960s played a major role in the Organization of Petroleum Exporting Countries (OPEC) and the Cooperation Council for the Arab States of the Gulf (GCC) and dominates the global Islamic charity market.

For 40 years the Kingdom’s Ministry of Foreign Affairs was headed by one man: Saud al Faisal bin Abdulaziz, a member of the Saudi royal family, and the world’s longest-serving foreign minister. The end of Saud al Faisal’s tenure, which began in 1975, coincided with the royal succession upon the death of King Abdullah in January 2015. Saud al Faisal’s tenure over the Ministry covered its handling of key events and issues in the foreign relations of Saudi Arabia, from the fall of the Shah and the second Oil Crisis to the September 11 attacks and its ongoing proxy war against Iran. The Saudi Cables provide key insights into the Kingdom’s operations and how it has managed its alliances and consolidated its position as a regional Middle East superpower, including through bribing and co-opting key individuals and institutions. The cables also illustrate the highly centralised bureaucratic structure of the Kingdom, where even the most minute issues are addressed by the most senior officials.

Since late March 2015 the Kingdom of Saudi Arabia has been involved in a war in neighbouring Yemen. The Saudi Foreign Ministry in May 2015 admitted to a breach of its computer networks. Responsibility for the breach was attributed to a group calling itself the Yemeni Cyber Army. The group subsequently released a number of valuable “sample” document sets from the breach on file-sharing sites, which then fell under censorship attacks. The full WikiLeaks trove comprises thousands of times the number of documents and includes hundreds of thousands of pages of scanned images of Arabic text. In a major journalistic research effort, WikiLeaks has extracted the text from these images and placed them into our searchable database. The trove also includes tens of thousands of text files and spreadsheets as well as email messages, which have been made searchable through the WikiLeaks search engine.

By coincidence, the Saudi Cables release also marks two other events. Today marks three years since WikiLeaks founder Julian Assange entered the Ecuadorian Embassy in London seeking asylum from US persecution, having been held for almost five years without charge in the United Kingdom. Also today Google revealed that it had been been forced to hand over more data to the US government in order to assist the prosecution of WikiLeaks staff under US espionage charges arising from our publication of US diplomatic cables.

A searcher with good Arabic skills is going to be necessary to take full advantage of this release.

I am unsure about the title: “Saudi Cables” because some of the documents I retrieved searching for “Bush,” were public interviews and statements. Hardly the burning secrets that are hinted at by “cables.” See for example, Exclusive Interview with Daily Telegraph 27-2-2005.doc or Interview with Wall Street Joutnal 26-4-2004.doc.

Putting “public document” in the words to exclude filter doesn’t eliminate the published interviews.

This has the potential, particularly out of more than 500,000 documents, to have some interesting tidbits. The first step would be to winnow out all published and/or public statements, in English and/or Arabic. Not discarded but excluded from search results until you need to make connections between secret statements and public ones.

A second step would be to identify the author/sender/receiver of each document so they can be matched to known individuals and events.

This is a great opportunity to practice your Arabic NLP processing skills. Or Arabic for that matter.

Hopefully Wikileaks will not decide to act as public censor with regard to these documents.

Governments do enough withholding of the truth. They don’t need the assistance of Wikileaks.

The Political One Percent of the One Percent:…

Wednesday, June 10th, 2015

The Political One Percent of the One Percent: Megadonors fuel rising cost of elections in 2014 by Peter Olsen-Phillips, Russ Choma, Sarah Bryner, and Doub Weber.

From the post:

In the 2014 elections, 31,976 donors — equal to roughly one percent of one percent of the total population of the United States — accounted for an astounding $1.18 billion in disclosed political contributions at the federal level. Those big givers — what we have termed the “Political One Percent of the One Percent” — have a massively outsized impact on federal campaigns.

They’re mostly male, tend to be city-dwellers and often work in finance. Slightly more of them skew Republican than Democratic. A small subset — barely five dozen — earned the (even more) rarefied distinction of giving more than $1 million each. And a minute cluster of three individuals contributed more than $10 million apiece.

The last election cycle set records as the most expensive midterms in U.S. history, and the country’s most prolific donors accounted for a larger portion of the total amount raised than in either of the past two elections.

The $1.18 billion they contributed represents 29 percent of all fundraising that political committees disclosed to the Federal Election Commission in 2014. That’s a greater share of the total than in 2012 (25 percent) or in 2010 (21 percent).

It’s just one of the main takeaways in the latest edition of the Political One Percent of the One Percent, a joint analysis of elite donors in America by the Center for Responsive Politics and the Sunlight Foundation.

BTW, although the report says conservatives “edged their liberal opponents,” the Republicans raised $553 million and Democrats raised $505 million from donors on the one percent of the one percent list. The $48 million difference isn’t rounding error size but once you break one-half $billon, it doesn’t seem as large as it might otherwise.

As far as I can tell, the report does not reproduce the addresses of the one percent of one percent donors. For that you need to use the advanced search option at the FEC and put 8810 (no dollar sign needed) in the first “amount range” box, set the date range to 2014 to 2015 and then search. Quite a long list so you may want to do it by state.

To get the individual location information, you can to follow the transaction number at the end of each record returned by your query and that returns a PDF page. Somewhere on that page will be the address information for the donor.

As far as campaign finance, the report indicates you need to find another way to influence the political process. Any donation much below the one percent of one percent minimum, i.e., $8810, isn’t going to buy you any influence. In fact, you are subsidizing the cost of a campaign that benefits the big donors the most. If big donors want to buy those campaigns, let them support the entire campaign.

In a sound bite: Don’t subsidize major political donors with small contributions.

Once you have identified the one percent of one percent donors, you can start to work out the other relationships between those donors and the levers of power.

Fast Track to the Corporate Wish List [Is There A Hacker In The House?]

Tuesday, June 9th, 2015

Fast Track to the Corporate Wish List by David Dayen.

From the post:

Some time in the next several days, the House will likely vote on trade promotion authority, enabling the Obama administration to proceed with its cherished Trans-Pacific Partnership (TPP). Most House Democrats want no part of the deal, which was crafted by and for corporations. And many Tea Party Republicans don’t want to hand the administration any additional powers, even in service of a victory dearly sought by the GOP’s corporate allies. The vote, which has been repeatedly delayed as both the White House and House GOP leaders try to round up support, is expected to be extremely close.

The Obama administration entered office promising to renegotiate unbalanced trade agreements, which critics believe have cost millions of manufacturing jobs in the past 20 years. But they’ve spent more than a year pushing the TPP, a deal with 11 Pacific Rim nations that mostly adheres to the template of corporate favors masquerading as free trade deals. Of the 29 TPP chapters, only five include traditional trade measures like reducing tariffs and opening markets. Based on leaks and media reports—the full text remains a well-guarded secret—the rest appears to be mainly special-interest legislation.

Pharmaceutical companies, software makers, and Hollywood conglomerates get expanded intellectual property enforcement, protecting their patents and their profits. Some of this, such as restrictions on generic drugs, is at the expense of competition and consumers. Firms get improved access to poor countries with nonexistent labor protections, like Vietnam or Brunei, to manufacture their goods. TPP provides assurances that regulations, from food safety to financial services, will be “harmonized” across borders. In practice, that means a regulatory ceiling. In one of the most contested provisions, corporations can use the investor-state dispute settlement (ISDS) process, and appeal to extra-judicial tribunals that bypass courts and usual forms of due process to seek monetary damages equaling “expected future profits.”

How did we reach this point—where “trade deals” are Trojan horses for fulfilling corporate wish lists, and where all presidents, Democrat or Republican, ultimately pay fealty to them? One place to look is in the political transfer of power, away from Congress and into a relatively obscure executive branch office, the Office of the United States Trade Representative (USTR).

USTR has become a way station for hundreds of officials who casually rotate between big business and the government. Currently, Michael Froman, former Citigroup executive and chief of staff to Robert Rubin, runs USTR, and his actions have lived up to the agency’s legacy as the white-shoe law firm for multinational corporations. Under Froman’s leadership, more ex-lobbyists have funneled through USTR, practically no enforcement of prior trade violations has taken place, and new agreements like TPP are dubiously sold as progressive achievements, laced with condescension for anyone who disagrees.

David does a great job of sketching the background both for the Trans-Pacific Partnership but also the U.S. Trade Representative.

Given the hundreds of people, nation states and corporations that have access to the text of the Trans-Pacific Partnership text, don’t you wonder why it remains secret?

I don’t think President Obama and his business cronies realize that secrecy of an agreement that will affect the vast majority of American citizens strikes at the legitimacy of government itself. True enough, corporations that own entire swaths of Congress are going to get more benefits than the average American. Those benefits are out in the open and citizens can press for benefits as well.

The benefits that accrue to corporations under the Trans-Pacific Partnership will be gained in secret, with little or no opportunity for the average citizen to object. There is something fundamentally unfair about the secret securing of benefits for corporations.

I hope that Obama doesn’t complain about “illegal” activity that foils his plan to secretly favor corporations. I won’t be listening. Will you?

Yemen Cyber Army will release 1M of records per week to stop Saudi Attacks

Sunday, May 31st, 2015

Yemen Cyber Army will release 1M of records per week to stop Saudi Attacks by Pierluigi Paganini.

From the post:

Hackers of the Yemen Cyber Army (YCA) had dumped another 1,000,000 records obtained by violating systems at the Saudi Ministry of Foreign Affairs.

The hacking crew known as the Yemen Cyber Army is continuing its campaign against the Government of Saudi Arabia.

The Yemen Cyber Army (YCA) has released other data from the stolen archived belonging to the Saudi Ministry of Foreign Affairs. The data breach was confirmed by the authorities, Osama bin Ahmad al-Sanousi, a senior official at the kingdom’s Foreign Ministry, made the announcement last week.

Now the hackers have released a new data dump containing 1,000,000 Records ff Saudi VISA Database, they also announced that every week they will release a new lot of 1M records. The Yemen Cyber Army have also shared secret documents of the Private Saudi MOFA with Wikileaks.

he hackers of the Yemen Cyber Army have released 10 records from the archive including a huge amount of data.

Mirror #1 :
Mirror #2 :
Mirror #3 :

The Website has published a detailed analysis of the dump published by the Yemen Cyber Army. reports that the latest dump is mostly visa data.

Good to know that the Yemen Cyber Army is backing up their data with Wikileaks but I don’t think of Wikileaks as a transparent source of government documents. For reasons best known to themselves, Wikileaks has taken on the role of government censor with regard to the information it releases. Acknowledging the critical role Wikileaks has played in recent public debates, don’t blind me to their arrogation of the role of public censor.

Speaking of data dumps, where are the diplomatic records from Iraq? Before or since becoming a puppet government for the United States?

In the meantime, keep watching for more data dumps from the Yemem Cyber Army.

Open Data: Getting Started/Finding

Friday, May 8th, 2015

Data Science – Getting Started With Open Data

23 Resources for Finding Open Data

Ryan Swanstrom has put together two posts will have you using and finding open data.

“Open data” can be a boon to researchers and others, but you should ask the following questions (among others) of any data set:

  1. Who collected the data?
  2. Why was the data collected?
  3. How was the recorded data selected?
  4. How large was the potential data pool?
  5. Was the original data cleaned after collection?
  6. If the original data was cleaned, by what criteria?
  7. How was the accuracy of the data measured?
  8. What instruments were used to collect the data?
  9. How were the instruments used to collect the data developed?
  10. How were the instruments used to collect the data validated?
  11. What publications have relied upon the data?
  12. How did you determine the semantics of the data?

That’s not a compete set but a good starting point.

Just because data is available, open, free, etc. doesn’t mean that it is useful. The best example is the still-in-print Budge translation The book of the dead : the papyrus of Ani in the British Museum. The original was published in 1895, making the current reprints more than a century out of date.

It is a very attractive reproduction (it is rare to see hieroglyphic text with inter-linear transliteration and translation in modern editions) of the papyrus of Ani, but it gives a mis-leading impression of the state of modern knowledge and translation of Middle Egyptian.

Of course, some readers are satisfied with century old encyclopedias as well, but I would not rely upon them or their sources for advice.

Open But Recorded Access

Thursday, May 7th, 2015

Search Airmen Certificate Information

Registry of certified pilots.

From the search page:


I didn’t perform a search so I don’t have a feel for what, if any, validation is done on the requested searcher information.

If you are on Tor, you might want to consider using the address for Wrigley field, 1060 W Addison St, Chicago, IL 60613, to see if it complains.

Bureau of Transportation Statistics

Thursday, May 7th, 2015

Bureau of Transportation Statistics

I discovered this site while looking for “official” statistics to debunk claims about air travel and screening for terrorists. (Begging National Security Questions #1)

I didn’t find it an easy site to navigate but that probably reflects my lack of familiarity with the data being collected. A short guide with a very good index would be quite useful.

A real treasure trove of transportation information (from the about page):

Major Programs of the Bureau of Transportation Statistics (BTS)

It is important to remember that federal agencies (and their equivalents under other governments) have distinct agendas. When confronting outlandish claims from one of the security agencies, it helps to have contradictory data gathered by other, “disinterested,” agencies of the same government.

Security types can dismiss your evidence and analysis as “that’s what you think.” After all, their world is nothing but suspicion and conjecture. Why shouldn’t that be true for others?

Not as easy to dismiss data and analysis by other government agencies.

NOAA weather data – Valuing Open Data – Guessing – History Repeats

Sunday, April 26th, 2015

Tech titans ready their clouds for NOAA weather data by Greg Otto.

From the post:

It’s fitting that the 20 terabytes of data the National Oceanic and Atmospheric Administration produces every day will now live in the cloud.

The Commerce Department took a step Tuesday to make NOAA data more accessible as Commerce Secretary Penny Pritzker announced a collaboration among some of the country’s top tech companies to give the public a range of environmental, weather and climate data to access and explore.

Amazon Web Services, Google, IBM, Microsoft and the Open Cloud Consortium have entered into a cooperative research and development agreement with the Commerce Department that will push NOAA data into the companies’ respective cloud platforms to increase the quantity of and speed at which the data becomes publicly available.

“The Commerce Department’s data collection literally reaches from the depths of the ocean to the surface of the sun,” Pritzker said during a Monday keynote address at the American Meteorological Society’s Washington Forum. “This announcement is another example of our ongoing commitment to providing a broad foundation for economic growth and opportunity to America’s businesses by transforming the department’s data capabilities and supporting a data-enabled economy.”

According to Commerce, the data used could come from a variety of sources: Doppler radar, weather satellites, buoy networks, tide gauges, and ships and aircraft. Commerce expects this data to launch new products and services that could benefit consumer goods, transportation, health care and energy utilities.

The original press release has this cheery note on the likely economic impact of this data:

So what does this mean to the economy? According to a 2013 McKinsey Global Institute Report, open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide. If more of this data could be efficiently released, organizations will be able to develop new and innovative products and services to help us better understand our planet and keep communities resilient from extreme events.

Ah, yes, that would be the Open data: Unlocking innovation and performance with liquid information, on which the summary page says:

Open data can help unlock $3 trillion to $5 trillion in economic value annually across seven sectors.

But you need to read the full report (PDF) in order to find footnote 3 on “economic value:”

3. Throughout this report we express value in terms of annual economic surplus in 2013 US dollars, not the discounted value of future cash flows; this valuation represents estimates based on initiatives where open data are necessary but not sufficient for realizing value. Often, value is achieved by combining analysis of open and proprietary information to identify ways to improve business or government practices. Given the interdependence of these factors, we did not attempt to estimate open data’s relative contribution; rather, our estimates represent the total value created.

That is a disclosure that the estimate of $3 to $5 trillion is a guess and/or speculation.

Odd how the guess/speculation disclosure drops out of the Commerce Department press release and when it gets to Greg’s story it reads:

open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide.

From guess/speculation to no mention to fact, all in the short space of three publications.

Does the valuing of open data remind you of:


(Image from:

The date of 1609 is important. Wikipedia has an article on Virginia, 1609-1610, titled, Starving Time. That year, only sixty (60) out of five hundred (500) colonists survived.

Does “Excellent Fruites by Planting” sound a lot like “new and innovative products and services?”

It does to me.

I first saw this in a tweet by Kirk Borne.

A Scary Earthquake Map – Oklahoma

Wednesday, April 22nd, 2015

Earthquakes in Oklahoma – Earthquake Map


Great example of how visualization can make the case that “standard” industry practices are in fact damaging the public.

The map is interactive and the screen shot above is only one example.

The main site is located at:

From the homepage:

Oklahoma experienced 585 magnitude 3+ earthquakes in 2014 compared to 109 events recorded in 2013. This rise in seismic events has the attention of scientists, citizens, policymakers, media and industry. See what information and research state officials and regulators are relying on as the situation progresses.

The next stage of data mapping should be identifying the owners or those who profited from the waste water disposal wells and their relationships to existing oil and gas interests, as well as their connections to members of the Oklahoma legislature.

What is it that Republicans call it? Ah, accountability, as in holding teachers and public agencies “accountable.” Looks to me like it is time to hold some oil and gas interests and their owners, “accountable.”

PS: Said to not be a “direct” result of fracking but of the disposal of water used for fracking. Close enough for my money. You?

Research Reports by U.S. Congress and UK House of Commons

Sunday, April 12th, 2015

Research Reports by U.S. Congress and UK House of Commons by Gary Price.

Gary’s post covers the Congressional Research Service (CRS) (US) and the House of Commons Library Research Service (UK).

Truly amazing I know for an open and transparent government like the United States Goverment but CRS reports are not routinely made available to the public and so we have to rely on the kindness of strangers to make them available. Gary reports:

The good news is that Steven Aftergood, director of the Government Secrecy Project at the Federation of American Scientists (FAS), gets ahold of many of these reports and shares them on the FAS website.

The House of Commons Library Research Service appears to not mind officially sharing its research with anyone with web access.

Unlike some government agencies and publications, the CRS and LRS enjoy reputations for high quality scholarship and accuracy. You still need to evaluate their conclusions and the evidence cited or not, but outright deception and falsehood aren’t part of their traditions.

Unleashing the Power of Data to Serve the American People

Sunday, February 22nd, 2015

Unleashing the Power of Data to Serve the American People by Dr. DJ Patil.

You can read (and listen) to Patil’s high level goals as the first ever U.S. Chief Data Scientist at his post.

His goals are too abstract and general to attract meaningful disagreement and that isn’t the purpose of this post.

I posted the link to his comments to urge you to contact Patil (or rather his office) with concrete plans for how his office can assist you in finding and using data. The sooner the better.

No doubt some areas are already off-limits for improved data access and some priorities are already set.

That said, contacting Patil before he and his new office have solidified in place can play an important role in establishing the scope of his office. On a lesser scale, the same situation that confronted George Washington as the first U.S. President. Nothing was set in stone and every act established a precedent for those who came after him.

Now is the time to press for an expansive and far reaching role for the U.S. Chief Data Scientist within the federal bureaucracy. offers email alerts

Thursday, February 19th, 2015 offers email alerts

From the post:

Beginning today [5 February 2015], the free legislative information website offers users a new optional email-alerts system that makes tracking legislative action even easier. Users can elect to receive email alerts for tracking:

  • A specific bill in the current Congress: Receive an email when there are updates to a specific bill (new cosponsors, committee action, vote taken, etc.); emails are sent once a day if there has been a change in a particular bill’s status since the previous day.
  • A specific member’s legislative activity: Receive an email when a specific member introduces or cosponsors a bill; emails are sent once a day if a member has introduced or cosponsored a bill since the previous day.
  • Congressional Record: Receive an email as soon as a new issue of the Congressional Record is available on

The alerts system is a new feature available to anyone who creates a free account on the site. Creating an account also enables users to save searches. Create an account and sign up for alerts at

If you are interested in legislation or in influencing those who vote on it, you should sign up for these alerts. No promises other than if you aren’t heard, your opinion won’t be considered.

You should also use to verify the content of legislation when you get a “…the world is ending as we know it…” emails from interest groups. You are not well-informed if you are completely reliant on the opinions of others. Misguided perhaps but not well-informed.

The US Patent and Trademark Office should switch from documents to data

Sunday, February 15th, 2015

The US Patent and Trademark Office should switch from documents to data by Justin Duncan.

From the post:

The debate over patent reform — one of Silicon Valley’s top legislative priorities — is once again in focus with last week’s introduction of the Innovation Act (H.R. 9) by House Judiciary Committee Chairman Bob Goodlatte (R-Va.), Rep. Peter DeFazio (D-Ore.), Subcommittee on Courts, Intellectual Property, and the Internet Chairman Darrell Issa (R-Calif.) and Ranking Member Jerrold Nadler (D-N.Y.), and 15 other original cosponsors.

The Innovation Act largely takes aim at patent trolls (formally “non-practicing entities”), who use patent litigation as a business strategy and make money by threatening lawsuits against other companies. While cracking down on litigious patent trolls is important, that challenge is only one facet of what should be a larger context for patent reform.

The need to transform patent information into open data deserves some attention, too.

The United States Patent and Trademark Office (PTO), the agency within the Department of Commerce that grants patents and registers trademarks, plays a crucial role in empowering American innovators and entrepreneurs to create new technologies. Ironically, many of the PTO’s own systems and technologies are out of date.

Last summer, Data Transparency Coalition advisor Joel Gurin and his colleagues organized an Open Data Roundtable with the Department of Commerce, co-hosted by the Governance Lab at New York University (GovLab) and the White House Office of Science and Technology Policy (OSTP). The roundtable focused on ways to improve data management, dissemination, and use at the Department of Commerce. It shed some light on problems faced by the PTO.

According to GovLab’s report of the day’s findings and recommendations, the PTO is currently working to improve the use and availability of some patent data by putting it in a more centralized, easily searchable form.

To make patent applications easier to navigate – for inventors, investors, the public, and the agency itself – the PTO should more fully embrace the use of structured data formats, like XML, to express the information currently collected as PDFs or text documents.

Justin’s post is a brief history of efforts to improve access to patent and trademark information, mostly focusing on the need for the USPTO (US Patent and Trademark Office) to stop relying on PDF as its default format.

Other potential improvements:

Additional GovLab recommendations included:

  • PTO [should] make more information available about the scope of patent rights, including expiration dates, or decisions by the agency and/or courts about patent claims.
  • PTO should add more context to its data to make it usable by non-experts – e.g. trademark transaction data and trademark assignment.
  • Provide Application Programming Interfaces (APIs) to enable third parties to build better interfaces for the existing legacy systems. Access to Patent Application Information Retrieval (PAIR) and Patent Trial and Appeal Board (PTAB) data are most important here.
  • Improve access to Cooperative Patent Classification (CPC)/U.S. Patent Classification (USPC) harmonization data; tie this data more closely to economic data to facilitate analysis.

Tying in related information, the first and last recommendations on the GovLab list is another step in the right direction.

But only a step.

If you have ever searched the USPTO patent database you know making the data “searchable” is only a nod and wink towards accessibility. Making the data is nothing to sneeze at but USPTO reform should have a higher target than simple being “searchable.”

Outside of patent search specialists (and not all of them), what ordinary citizen is going to be able to navigate the terms of art across domains when searching patents?

The USPTO should go beyond making patents literally “searchable” and instead make patents “reliably” searchable. By “reliable” searching I mean searching that returns all the relevant patents. A safe harbor if you will that protects inventors, investors and implementers from costly suits arising out of the murky wood filled with traps, intellectual quicksand and formulaic chants that are the USPTO patent database.

I first saw this in a tweet by Joel Gurin.

Federal Spending Data Elements

Sunday, February 15th, 2015

Federal Spending Data Elements

From the webpage:

The data elements in the below list represent the existing Federal Funding Accountability and Transparency Act (FFATA) data elements currently displayed on and the additional data elements that will be posted pursuant to the DATA Act. These elements are currently being deliberated on and discussed by the Federal community as a part of DATA Act implementation. At this point, this list is exhaustive. However, additional data elements may be standardized for transparency reporting in the future based on agency or community needs.

Join the Conversation

At this time, we are asking for comments in response to the following questions:

  1. Which data elements are most crucial to your current reporting and/or analysis?
  2. In setting standards, what are industry standards the Treasury and OMB should be considering?
  3. What are some of the considerations that Treasury and OMB should take into account when establishing data standards?

Just reading the responses to the questions on GitHub will give you a sense of what other community members are thinking about.

What responses are you going to contribute?

I first saw this in a tweet by Hudson Hollister.

Mercury [March 5, 2015, Washington, DC]

Saturday, February 14th, 2015

Mercury Registration Deadline: February 17, 2015.

From the post:

The Intelligence Advanced Research Projects Activity (IARPA) will host a Proposers’ Day Conference for the Mercury Program on March 5, in anticipation of the release of a new solicitation in support of the program. The Conference will be held from 8:30 AM to 5:00 PM EST in the Washington, DC metropolitan area. The purpose of the conference will be to provide introductory information on Mercury and the research problems that the program aims to address, to respond to questions from potential proposers, and to provide a forum for potential proposers to present their capabilities and identify potential team partners.

Program Description and Goals

Past research has found that publicly available data can be used to accurately forecast events such as political crises and disease outbreaks. However, in many cases, relevant data are not available, have significant lag times, or lack accuracy. Little research has examined whether data from foreign Signals Intelligence (SIGINT) can be used to improve forecasting accuracy in these cases.

The Mercury Program seeks to develop methods for continuous, automated analysis of SIGINT in order to anticipate and/or detect political crises, disease outbreaks, terrorist activity, and military actions. Anticipated innovations include: development of empirically driven sociological models for population-level behavior change in anticipation of, and response to, these events; processing and analysis of streaming data that represent those population behavior changes; development of data extraction techniques that focus on volume, rather than depth, by identifying shallow features of streaming SIGINT data that correlate with events; and development of models to generate probabilistic forecasts of future events. Successful proposers will combine cutting-edge research with the ability to develop robust forecasting capabilities from SIGINT data.

Mercury will not fund research on U.S. events, or on the identification or movement of specific individuals, and will only leverage existing foreign SIGINT data for research purposes.

The Mercury Program will consist of both unclassified and classified research activities and expects to draw upon the strengths of academia and industry through collaborative teaming. It is anticipated that teams will be multidisciplinary, and might include social scientists, mathematicians, statisticians, computer scientists, content extraction experts, information theorists, and SIGINT subject matter experts with applied experience in the U.S. SIGINT System.

Attendees must register no later than 6:00 pm EST, February 27, 2015 at Directions to the conference facility and other materials will be provided upon registration. No walk-in registrations will be allowed.

I might be interested if you can hide me under a third or fourth level sub-contractor. 😉

Seriously, it isn’t that I despair of the legitimate missions of intelligence agencies but I do despise waste on ways known to not work. Government funding, even unlimited funding, isn’t going to magically confer the correct semantics on data or enable analysts to meaningfully share their work products across domains.

You would think going on fourteen (14) years post-9/11 and not being one step closer to preventing a similar event, that would be a “wake-up” call to someone. If not in the U.S. intelligence community, perhaps in intelligence communities who tire of aping the U.S. community with no better results.

OpenGov Voices: Bringing transparency to earmarks buried in the budget

Saturday, February 14th, 2015

OpenGov Voices: Bringing transparency to earmarks buried in the budget by Matthew Heston, Madian Khabsa, Vrushank Vora, Ellery Wulczyn and Joe Walsh.

From the post:

Last week, President Obama kicked off the fiscal year 2016 budget cycle by unveiling his $3.99 trillion budget proposal. Congress has the next eight months to write the final version, leaving plenty of time for individual senators and representatives, state and local governments, corporate lobbyists, bureaucrats, citizens groups, think tanks and other political groups to prod and cajole for changes. The final bill will differ from Obama’s draft in major and minor ways, and it won’t always be clear how those changes came about. Congress will reveal many of its budget decisions after voting on the budget, if at all.

We spent this past summer with the Data Science for Social Good program trying to bring transparency to this process. We focused on earmarks – budget allocations to specific people, places or projects – because they are “the best known, most notorious, and most misunderstood aspect of the congressional budgetary process” — yet remain tedious and time-consuming to find. Our goal: to train computers to extract all the earmarks from the hundreds of pages of mind-numbing legalese and numbers found in each budget.

Watchdog groups such as Citizens Against Government Waste and Taxpayers for Common Sense have used armies of human readers to sift through budget documents, looking for earmarks. The White House Office of Management and Budget enlisted help from every federal department and agency, and the process still took three months. In comparison, our software is free and transparent and generates similar results in only 15 minutes. We used the software to construct the first publicly available database of earmarks that covers every year back to 1995.

Despite our success, we barely scratched the surface of the budget. Not only do earmarks comprise a small portion of federal spending but senators and representatives who want to hide the money they budget for friends and allies have several ways to do it:

I was checking the Sunlight Foundation Blog for any updated information on the soon to be released indexes of federal data holdings when I encountered this jewel on earmarks.

Important to read/support because:

  1. By dramatically reducing the human time investment to find earmarks, it frees up that time to be spent gathering deeper information about each earmark
  2. It represents a major step forward in the ability to discover relationships between players in the data (what the NSA wants to do but with a rationally chosen data set).
  3. It will educate you on earmarks and their hiding places.
  4. It is an inspirational example of how darkness can be replaced with transparency, some of it anyway.

Will transparency reduce earmarks? I rather doubt it because a sense of shame doesn’t seem to motivate elected and appointed officials.

What transparency can do is create a more level playing field for those who want to buy government access and benefits.

For example, if I knew what it cost to have the following exemption in the FOIA:

Exemption 9: Geological information on wells.

it might be possible to raise enough funds to purchase the deletion of:

Exemption 5: Information that concerns communications within or between agencies which are protected by legal privileges, that include but are not limited to:

4 Deliberative Process Privilege

Which is where some staffers hide their negotiations with former staffers as they prepare to exit the government.

I don’t know that matching what Big Oil paid for the geological information on wells exemption would be enough but it would set a baseline for what it takes to start the conversation.

I say “Big Oil paid…” assuming that most of us don’t equate matters of national security with geological information. Do you have another explanation for such an offbeat provision?

If government is (and I think it is) for sale, then let’s open up the bidding process.

A big win for open government: Sunlight gets U.S. to…

Saturday, February 14th, 2015

A big win for open government: Sunlight gets U.S. to release indexes of federal data by Matthew Rumsey and Sean Vitka and John Wonderlich.

From the post:

For the first time, the United States government has agreed to release what we believe to be the largest index of government data in the world.

On Friday, the Sunlight Foundation received a letter from the Office of Management and Budget (OMB) outlining how they plan to comply with our FOIA request from December 2013 for agency Enterprise Data Inventories. EDIs are comprehensive lists of a federal agency’s information holdings, providing an unprecedented view into data held internally across the government. Our FOIA request was submitted 14 months ago.

These lists of the government’s data were not public, however, until now. More than a year after Sunlight’s FOIA request and with a lawsuit initiated by Sunlight about to be filed, we’re finally going to see what data the government holds.

Since 2013, federal agencies have been required to construct a list of all of their major data sets, subject only to a few exceptions detailed in President Obama’s executive order as well as some information exempted from disclosure under the FOIA.

Many kudos to the Sunlight Foundation!

As to using the word “win,” do we need to wait and see what Enterprise Data Inventories are in fact produced?

I say that because the executive order of President Obama that is cited in the post, provides these exemptions from disclosure:

4 (d) (d) Nothing in this order shall compel or authorize the disclosure of privileged information, law enforcement information, national security information, personal information, or information the disclosure of which is prohibited by law.

Will that be taken as an excuse to not list the data collections at all?

Or, will the NSA say:

one (1) collection of telephone metadata, timeSpan: 4 (d) exempt, size: 4 (d) exempt, metadataStructure: 4 (d) exempt source: 4 (d) exempt

Do they mean internal NSA phone logs? Do they mean some other source?

Or will they simply not list telephone metadata at all?

What’s exempt under FOAI? (From

Not all records can be released under the FOIA.  Congress established certain categories of information that are not required to be released in response to a FOIA request because release would be harmful to governmental or private interests.   These categories are called "exemptions" from disclosures.  Still, even if an exemption applies, agencies may use their discretion to release information when there is no foreseeable harm in doing so and disclosure is not otherwise prohibited by law.  There are nine categories of exempt information and each is described below.  

Exemption 1: Information that is classified to protect national security.  The material must be properly classified under an Executive Order.

Exemption 2: Information related solely to the internal personnel rules and practices of an agency.

Exemption 3: Information that is prohibited from disclosure by another federal law. Additional resources on the use of Exemption 3 can be found on the Department of Justice FOIA Resources page.

Exemption 4: Information that concerns business trade secrets or other confidential commercial or financial information.

Exemption 5: Information that concerns communications within or between agencies which are protected by legal privileges, that include but are not limited to:

  1. Attorney-Work Product Privilege
  2. Attorney-Client Privilege
  3. Deliberative Process Privilege
  4. Presidential Communications Privilege

Exemption 6: Information that, if disclosed, would invade another individual’s personal privacy.

Exemption 7: Information compiled for law enforcement purposes if one of the following harms would occur.  Law enforcement information is exempt if it: 

  • 7(A). Could reasonably be expected to interfere with enforcement proceedings
  • 7(B). Would deprive a person of a right to a fair trial or an impartial adjudication
  • 7(C). Could reasonably be expected to constitute an unwarranted invasion of personal privacy
  • 7(D). Could reasonably be expected to disclose the identity of a confidential source
  • 7(E). Would disclose techniques and procedures for law enforcement investigations or prosecutions
  • 7(F). Could reasonably be expected to endanger the life or physical safety of any individual

Exemption 8: Information that concerns the supervision of financial institutions.

Exemption 9: Geological information on wells.

And the exclusions:

Congress has provided special protection in the FOIA for three narrow categories of law enforcement and national security records. The provisions protecting those records are known as “exclusions.” The first exclusion protects the existence of an ongoing criminal law enforcement investigation when the subject of the investigation is unaware that it is pending and disclosure could reasonably be expected to interfere with enforcement proceedings. The second exclusion is limited to criminal law enforcement agencies and protects the existence of informant records when the informant’s status has not been officially confirmed. The third exclusion is limited to the Federal Bureau of Investigation and protects the existence of foreign intelligence or counterintelligence, or international terrorism records when the existence of such records is classified. Records falling within an exclusion are not subject to the requirements of the FOIA. So, when an office or agency responds to your request, it will limit its response to those records that are subject to the FOIA.

You can spot the truck sized holes as well as I can that may prevent disclosure.

One analytic challenge upon the release of the Enterprise Data Inventories will be to determine what is present and what is missing but should be present. Another will be to assist the Sunlight Foundation in its pursuit of additional FOIAs to obtain data listed but not available. Perhaps I should call this an important victory although of a battle and not the long term war for government transparency.


FBI Records: The Vault

Wednesday, February 11th, 2015

FBI Records: The Vault

From the webpage:

The Vault is our new FOIA Library, containing 6,700 documents and other media that have been scanned from paper into digital copies so you can read them in the comfort of your home or office. 

Included here are many new FBI files that have been released to the public but never added to this website; dozens of records previously posted on our site but removed as requests diminished; files from our previous FOIA Library, and new, previously unreleased files.

The Vault includes several new tools and resources for your convenience:

  • Searching for Topics: You can browse or search for specific topics or persons (like Al Capone or Marilyn Monroe) by viewing our alphabetical listing, by using the search tool in the upper right of this site, or by checking the different category lists that can be found in the menu on the right side of this page. In the search results, click on the folder to see all of the files for that particular topic.
  • Searching for Key Words: Thanks to new technology we have developed, you can now search for key words or phrases within some individual files. You can search across all of our electronic files by using the search tool in the upper right of this site, or you can search for key words within a specific document by typing in terms in the search box in the upper right hand of the file after it has been opened and loaded. Note: since many of the files include handwritten notes or are not always in optimal condition due to age, this search feature does not always work perfectly.
  • Viewing the Files: We are now using an open source web document viewer, so you no longer need your own file software to view our records. When you click on a file, it loads in a reader that enables you to view one or two pages at a time, search for key words, shrink or enlarge the size of the text, use different scroll features, and more. In many cases, the quality and clarity of the individual files has also been improved.
  • Requesting a Status Update: Use our new Check the Status of Your FOI/PA Request tool to determine where your request stands in our process. Status information is updated weekly. Note: You need your FOI/PA request number to use this feature.

Please note: the content of the files in the Vault encompasses all time periods of Bureau history and do not always reflect the current views, policies, and priorities of the FBI.

New files will be added on a regular basis, so please check back often.

This may be meant as a distraction but I don’t know from what?

I suppose there is some value in knowing that ineffectual law enforcement investigations did not begin with 9/11.

Encouraging open data usage…

Saturday, February 7th, 2015

Encouraging open data usage by commercial developers: Report

From the post:

The second Share-PSI workshop was very different from the first. Apart from presentations in two short plenary sessions, the majority of the two days was spent in facilitated discussions around specific topics. This followed the success of the bar camp sessions at the first workshop, that is, sessions proposed and organised in an ad hoc fashion, enabling people to discuss whatever subject interests them.

Each session facilitator was asked to focus on three key questions:

  1. What X is the thing that should be done to publish or reuse PSI?
  2. Why does X facilitate the publication or reuse of PSI?
  3. How can one achieve X and how can you measure or test it?

This report summarises the 7 plenary presentations, 17 planned sessions and 7 bar camp sessions. As well as the Share-PSI project itself, the workshop benefited from sessions lead by 8 other projects. The agenda for the event includes links to all papers, slides and notes, with many of those notes being available on the project wiki. In addition, the #sharepsi tweets from the event are archived, as are a number of photo albums from Makx Dekkers,
Peter Krantz and José Luis Roda. The event received a generous write up
on the host’s Web site (in Portuguese). The spirit of the event is captured in this video by Noël Van Herreweghe of CORVe.

To avoid confusion, PSI in this context means Public Sector Information, not Published Subject Identifier (PSI).

Amazing coincidence that the W3C has smudged yet another name. You may recall the W3C decided to confuse URIs and IRIs in its latest attempt to re-write history, calling both the the acronym, URI:

Within this specification, the term URI refers to a Universal Resource Identifier as defined in [RFC 3986] and extended in [RFC 2987] [RFC 3987] with the new name IRI. The term URI has been retained in preference to IRI to avoid introducing new names for concepts such as “Base URI” that are defined or referenced across the whole family of XML specifications. (Corrected the RFC listing as shown.) (XQuery and XPath Data Model 3.1 , N. Walsh, J. Snelson, Editors, W3C Candidate Recommendation (work in progress), 18 December 2014, . Latest version available at

Interesting discussion but I would pay very close attention to market demand, perhaps I should say, commercial market demand, before planning a start-up based on government data. There is unlimited demand for free data or even better, free enhanced data, but that should not be confused with enhanced data that can be sold to support a start-up on an ongoing basis.

To give you an idea of the uncertainly of conditions for start-ups relying on open data, let me quote the final bullet points of this article:

  • There is a lack of knowledge of what can be done with open data which is hampering uptake.
  • There is a need for many examples of success to help show what can be done.
  • Any long term re-use of PSI must be based on a business plan.
  • Incubators/accelerators should select projects to support based on the business plan.
  • Feedback from re-users is an important component of the ecosystem and can be used to enhance metadata.
  • The boundary between what the public and private sectors can, should and should not do do needs to be better defined to allow the public sector to focus on its core task and businesses to invest with confidence.
  • It is important to build an open data infrastructure, both legal and technical, that supports the sharing of PSI as part of normal activity.
  • Licences and/or rights statements are essential and should be machine readable. This is made easier if the choice of licences is minimised.
  • The most valuable data is the data that the public sector already charges for.
  • Include domain experts who can articulate real problems in hackathons (whether they write code or not).
  • Involvement of the user community and timely response to requests is essential.
  • There are valid business models that should be judged by their effectiveness and/or social impact rather than financial gain.

Just so you know, that last point:

There are valid business models that should be judged by their effectiveness and/or social impact rather than financial gain.

that is not a business model, unless you have renewal financing from some source other than by financial gain. That is a charity model where you are the object of the charity.

Forty and Seven Inspector Generals Hit a Stone Wall

Thursday, February 5th, 2015

Inspectors general testify against agency ‘stonewalling’ before Congress by Sarah Westwood.

From the post:

Frustration with federal agencies that block probes from their inspectors general bubbled over Tuesday in a congressional hearing that dug into allegations of obstruction from a number of government watchdogs.

The Peace Corps, Environmental Protection Agency and Justice Department inspectors general each argued to members of the House Oversight and Government Reform Committee that some of their investigations had been thwarted or stalled by officials who refused to release necessary information to their offices.

Committee members from both parties doubled down on criticisms of the Justice Department’s lack of transparency and called for solutions to the government-wide problem during their first official hearing of the 114th Congress.

“If you can’t do your job, then we can’t do our job in Congress,” Chairman Jason Chaffetz, R-Utah, told the three witnesses and the scores of agency watchdogs who also attended, including the Department of Homeland Security and General Service Administration inspectors general.

Michael Horowitz, the Justice Department’s inspector general, testified that the FBI began reviewing requested documents in 2010 in what he said was a clear violation of federal law that is supposed to grant watchdogs unfettered access to agency records.

The FBI’s process, which involves clearing the release of documents with the attorney general or deputy attorney general, “seriously impairs inspector general independence, creates excessive delays, and may lead to incomplete, inaccurate or significantly delayed findings or recommendations,” Horowitz said.

Perhaps no surprise that the FBI shows up in the non-transparency column. But given the number of inspector generals with similar problems (47), it seems to be part of a larger herd.

If you are interested in going further into this issue, there was a hearing last August 2014), Obstructing Oversight: Concerns from Inspectors General, which is here in ASCII and here with video and witness statements in PDF.

Both sources omit the following documents:

Sept. 9, 2014, letter to Chairman Issa from OMB submitted by Chairman Issa.. 58
Aug. 5, 2014, letter to Reps. Issa, Cummings, Carper, and Coburn from 47 IGs, submitted by Rep. Chaffetz.. 61
Aug. 8, 2014, letter to OMB from Reps. Carper, Coburn, Issa and Cummings, submitted by Rep. Walberg.. 69
Statement for the record from The Institute of Internal Auditors. 71

Isn’t that rather lame? To leave these items in the table of contents but to omit them from the ASCII version and to not even include them with the witness statements.

I’m curious who the other forty-four (44) inspector generals might be. Aren’t you?

If you know where to find these appendix materials, please send me a pointer.

I think it will be more effective to list all of the Inspector Generals who have encountered this stone wall treatment than treat them as all and sundry.

Chairman Jason Chaffetz suggests that by controlling funding that Congress can force transparency. I would use a finer knife. Cut all funding for health care and retirement benefits in the agencies/departments in question. See how the rank and file in the agencies like them apples.

Assuming transparency results, I would not restore those benefits retroactively. Staff chose to support, explicitly or implicitly, illegal behavior. Making bad choices has negative consequences. It would be a teaching opportunity for all future federal staff members.

[U.S.] President’s Fiscal Year 2016 Budget

Wednesday, February 4th, 2015

Data for the The President’s Fiscal Year 2016 Budget

From the webpage:

Each year, after the President’s State of the Union address, the Office of Management and Budget releases the Administration’s Budget, offering proposals on key priorities and newly announced initiatives. This year we are releasing all of the data included in the President’s Fiscal Year 2016 Budget in a machine-readable format here on GitHub. The Budget process should be a reflection of our values as a country, and we think it’s important that members of the public have as many tools at their disposal as possible to see what is in the President’s proposals. And, if they’re motivated to create their own visualizations or products from the data, they should have that chance as well.

You can see the full Budget on Medium.

About this Repository

This repository includes three data files that contain an extract of the Office of Management and Budget (OMB) budget database. These files can be used to reproduce many of the totals published in the Budget and examine unpublished details below the levels of aggregation published in the Budget.

The user guide file contains detailed information about this data, its format, and its limitations. In addition, OMB provides additional data tables, explanations and other supporting documents in XLS format on its website.

Feedback and Issues

Please submit any feedback or comments on this data, or the Budget process here.

Before you start cheering too loudly, spend a few minutes with the User Guide. Not impenetrable but not an easy stroll either. I suspect the additional data tables, etc. are going to be necessary for interpretation of the main files.

Writing up how to use this data set would be a large but worthwhile undertaking.

A larger in scope but also worthwhile project would be to track how the initial allocations in the budget change through the legislative process. That is to know on a day to day basis, which departments, programs, etc. are up or down. Tied to votes in Congress and particular amendments that could prove to be very interesting.

Update: A tweet from Aaron Kirschenfeld directed us to: The U.S. Tax Code Is a Travesty by John Cassidy. Cassidy says to take a look at table S-9 in the numbers section under “Loophole closers.” The trick to the listed loopholes is that very few people qualify for the loophole. See Cassidy’s post for the details.

Other places that merit special attention?

Update: DHS Budget Justification 2016 (3906 pages, PDF). First saw this in a tweet by Dave Maass.

Project Blue Book Collection (UFO’s)

Thursday, January 22nd, 2015

Project Blue Book Collection

From the webpage:

This site was created by The Black Vault to house 129,491 pages, comprising of more than 10,000 cases of the Project Blue Book, Project Sign and Project Grudge files declassifed. Project Blue Book (along with Sign and Grudge) was the name that was given to the official investigation by the United States military to determine what the Unidentified Flying Object (UFO) phenomena was. It lasted from 1947 – 1969. Below you will find the case files compiled for research, and available free to download.

The CNN report Air Force UFO files land on Internet by Emanuella Grinberg reports Roswell is omitted from these files.

You won’t find anything new here, the files have been available on microfilm for years but being searchable and on the Internet is a step forward in terms of accessibility.

When I say “searchable,” the site notes:

1) A search is a good start — but is not 100% — There are more than 10,000 .pdf files here and although all of them are indexed in the search engine, the quality of the original documents, given the fact that many of them are more than 6 decades old, is very poor. This means that when they are converted to text for searching, many of the words are not readable to a computer. As a tip: make your search as basic as possible. Searching for a location? Just search a city, then the state, to see what comes up. Searching for a type of UFO? Use “saucer” vs. “flying saucer” or longer expression. It will increase the chances of finding what you are looking for.

2) The text may look garbled on the search results page (but not the .pdf!) — This is normal. For the same reason above… converting a sentence that may read ok to the human eye, may be gibberish to a computer due to the quality of the decades old state of many of the records. Don’t let that discourage you. Load the .PDF and see what you find. If you searched for “Hollywood” and a .pdf hit came up for Rome, New York, there is a reason why. The word “Hollywood” does appear in the file…so check it out!

3) Not everything was converted to .pdfs — There are a few case files in the Blue Book system that were simply too large to convert. They are:

undated/xxxx-xx-9667997-[BLANK][ 8,198 Pages ]
undated/xxxx-xx-9669100-[ILLEGIBLE]-[ILLEGIBLE]-/ [ 1,450 Pages ]
undated/xxxx-xx-9669191-[ILLEGIBLE]/ [ 3,710 Pages ]

These files will be sorted at a later date. If you are interested in helping, please email

I tried to access the files not yet processed but was redirected. I will see what is required to see the not yet processed files.

If you are interested in trying your skills at PDF conversion/improvement, the main data set should be more than sufficient.

If you are interested in automatic discovery of what or who was blacked out of government reports, this is also an interesting data set. Personally I think blacking out passages should be forbidden. People should have to accept the consequences of their actions, good or bad. We require that of citizens, why not government staff?

I assume crowd sourcing corrections has already been considered. 130K of pages is a fairly small number when it comes to crowd sourcing. Surely there are more than 10,000 people interested in the data set, which would be 13 pages each. Assuming each one did 100 pages each, you would have more than enough overlap to do statistics to choose the best corrections.

For those of you who see patterns in UFO reports, a good way to reach across the myriad sightings and reports would be to topic map the entire collection.

Personally I suspect at least some of the reports do concern alien surveillance and the absence in the intervening years indicates they have lost interest. Given our performance since the 1940’s, that’s not hard to understand.