Archive for the ‘Government Data’ Category Officially Out of Beta

Tuesday, October 28th, 2014 Officially Out of Beta

From the post:

The free legislative information website,, is officially out of beta form, and beginning today includes several new features and enhancements. URLs that include will be redirected to The site now includes the following:

New Feature: Resources

  • A new resources section providing an A to Z list of hundreds of links related to Congress
  • An expanded list of “most viewed” bills each day, archived to July 20, 2014

New Feature: House Committee Hearing Videos

  • Live streams of House Committee hearings and meetings, and an accompanying archive to January, 2012

Improvement: Advanced Search

  • Support for 30 new fields, including nominations, Congressional Record and name of member

Improvement: Browse

  • Days in session calendar view
  • Roll Call votes
  • Bill by sponsor/co-sponsor

When the Library of Congress, in collaboration with the U.S. Senate, U.S. House of Representatives and the Government Printing Office (GPO) released as a beta site in the fall of 2012, it included bill status and summary, member profiles and bill text from the two most recent congresses at that time – the 111th and 112th.

Since that time, has expanded with the additions of the Congressional Record, committee reports, direct links from bills to cost estimates from the Congressional Budget Office, legislative process videos, committee profile pages, nominations, historic access reaching back to the 103rd Congress and user accounts enabling saved personal searches. Users have been invited to provide feedback on the site’s functionality, which has been incorporated along with the data updates.

Plans are in place for ongoing enhancements in the coming year, including addition of treaties, House and Senate Executive Communications and the Congressional Record Index.

Field Value Lists:

Use search fields in the main search box (available on most pages), or via the advanced and command line search pages. Use terms or codes from the Field Value Lists with corresponding search fields: Congress [congressId], Action – Words and Phrases [billAction], Subject – Policy Area [billSubject], or Subject (All) [allBillSubjects].

Congresses (44, stops with 70th Congress (1927-1929))

Legislative Subject Terms, Subject Terms (541), Geographic Entities (279), Organizational Names (173). (total 993)

Major Action Codes (98)

Policy Area (33)

Search options:

Search Form: “Choose collections and fields from dropdown menus. Add more rows as needed. Use Major Action Codes and Legislative Subject Terms for more precise results.”

Command Line: “Combine fields with operators. Refine searches with field values: Congresses, Major Action Codes, Policy Areas, and Legislative Subject Terms. To use facets in search results, copy your command line query and paste it into the home page search box.”

Search Tips Overview: “You can search using the quick search available on most pages or via the advanced search page. Advanced search gives you the option of using a guided search form or a command line entry box.” (includes examples)


You can follow this project @congressdotgov.

Orientation to Legal Research & is available both as a seminar (in-person) and webinar (online).


I first saw this at is Out of Beta with New Features by Africa S. Hands.

A $23 million venture fund for the government tech set

Tuesday, September 16th, 2014

A $23 million venture fund for the government tech set by Nancy Scola.

Nancy tells a compelling story of a new VC firm, GovTech, which is looking for startups focused on providing governments with better technology infrastructure.

Three facts from the story stand out:

“The U.S. government buys 10 eBays’ worth of stuff just to operate,” from software to heavy-duty trucking equipment.

…working with government might be a tortuous slog, but Bouganim says that he saw that behind that red tape lay a market that could be worth in the neighborhood of $500 billion a year.

What most people don’t realize is government spends nearly $74 billion on technology annually. As a point of comparison, the video game market is a $15 billion annual market.

See Nancy’s post for the full flavor of the story but it sounds like there is gold buried in government IT.

Another way to look at it is the government is already spending $74 billion a year on technology that is largely an object of mockery and mirth. Effective software may be sufficiently novel and threatening to either attract business or a buy-out.

While you are pondering possible opportunities, existing systems, their structures and data are “subjects” in topic map terminology. Which means topic maps can protect existing contracts and relationships, while delivering improved capabilities and data.

Promote topic maps as “in addition to” existing IT systems and you will encounter less resistance both from within and without the government.

Don’t be squeamish about associating with governments, of whatever side. Their money spends just like everyone else’s. You can ask At&T and IBM about supporting both sides in a conflict.

I first saw this in a tweet by Mike Bracken.

Army can’t track spending on $4.3b system to track spending, IG finds

Sunday, September 14th, 2014

Army can’t track spending on $4.3b system to track spending, IG finds. by Mark Flatten.

From the post:

More than $725 million was spent by the Army on a high-tech network for tracking supplies and expenses that failed to comply with federal financial reporting rules meant to allow auditors to track spending, according to an inspector general’s report issued Wednesday.

The Global Combat Support System-Army, a logistical support system meant to track supplies, spare parts and other equipment, was launched in 1997. In 2003, the program switched from custom software to a web-based commercial software system.

About $95 million was spent before the switch was made, according to the report from the Department of Defense IG.

As of this February, the Army had spent $725.7 million on the system, which is ultimately expected to cost about $4.3 billion.

The problem, according to the IG, is that the Army has failed to comply with a variety of federal laws that require agencies to standardize reporting and prepare auditable financial statements.

The report is full of statements like this one:

PMO personnel provided a system change request, which they indicated would correct four account attributes in July 2014. In addition, PMO personnel provided another system change request they indicated would correct the remaining account attribute (Prior Period Adjustment) in late FY 2015.

PMO = Project Management Office (in this case, of GCSS–Army).

The lack of identification of personnel speaking on behalf of the project or various offices pervades the report. Moreover, the same is true for twenty-seven (27) other reports on issues with this project.

If the sources of statements and information were identified in these reports, then it would be possible to track people across reports and to identify who has failed to follow up on representations made in the reports.

The first step towards accountability is identification of decision makers in audit reports.

Tracking decision makers from one position to another and linking them to specific decisions is a natural application of topic maps.

I first saw this in Links I Liked by Chris Blattman, September 7, 2014.

6,482 Datasets Available

Tuesday, August 26th, 2014

6,482 Datasets Available Across 22 Federal Agencies In Data.json Files by Kin Lane.

From the post:

It has been a few months since I ran any of my federal government data.json harvesting, so I picked back up my work, and will be doing more work around datasets that federal agnecies have been making available, and telling the stories across my network.

I’m still surprised at how many people are unaware that 22 of the top federal agencies have data inventories of their public data assets, available in the root of their domain as a data.json file. This means you can go to many and there is a machine readable list of that agencies current inventory of public datasets.

See Kin’s post for links to the agency data.json files.

You may also want to read: What Happened With Federal Agencies And Their Data.json Files, which details Kin’s earlier efforts with tracking agency data.json files.

Kin points out that these data.json files are governed by: OMB M-13-13 Open Data Policy—Managing Information as an Asset. It’s pretty joyless reading but if you are interested in the the policy details or the requirements agencies must meet, it’s required reading.

If you are looking for datasets to clean up or combine together, it would be hard to imagine a more diverse set to choose from.

FDA Recall Data

Wednesday, July 16th, 2014

OpenFDA Provides Ready Access to Recall Data by Taha A. Kass-Hout.

From the post:

Every year, hundreds of foods, drugs, and medical devices are recalled from the market by manufacturers. These products may be labeled incorrectly or might pose health or safety issues. Most recalls are voluntary; in some cases they may be ordered by the U.S. Food and Drug Administration. Recalls are reported to the FDA, and compiled into its Recall Enterprise System, or RES. Every week, the FDA releases an enforcement report that catalogues these recalls. And now, for the first time, there is an Application Programming Interface (API) that offers developers and researchers direct access to all of the drug, device, and food enforcement reports, dating back to 2004.

The recalls in this dataset provide an illuminating window into both the safety of individual products and the safety of the marketplace at large. Recent reports have included such recalls as certain food products (for not containing the vitamins listed on the label), a soba noodle salad (for containing unlisted soy ingredients), and a pain reliever (for not following laboratory testing requirements).

You will get warnings that this data is “not for clinical use.”

Sounds like a treasure trove of data if you are looking for products still being sold despite being recalled.

Or if you want to advertise for “victims” of faulty products that have been recalled.

I think both of those are non-clinical uses. ;-)

Free Companies House data to boost UK economy

Tuesday, July 15th, 2014

Free Companies House data to boost UK economy

From the post:

Companies House is to make all of its digital data available free of charge. This will make the UK the first country to establish a truly open register of business information.

As a result, it will be easier for businesses and members of the public to research and scrutinise the activities and ownership of companies and connected individuals. Last year (2013/14), customers searching the Companies House website spent £8.7 million accessing company information on the register.

This is a considerable step forward in improving corporate transparency; a key strand of the G8 declaration at the Lough Erne summit in 2013.

It will also open up opportunities for entrepreneurs to come up with innovative ways of using the information.

This change will come into effect from the second quarter of 2015 (April – June).

In a side bar, Business Secretary Vince Cable said in part:

Companies House is making the UK a more transparent, efficient and effective place to do business.

I’m not sure about “efficient,” but providing incentives for lawyers and others to track down insider trading and other business as usual practices and arming them with open data would be a start in the right direction.

I first saw this in a tweet by Hadley Beeman.


Wednesday, July 9th, 2014


From the webpage:

MuckRock is an open news tool powered by state and federal Freedom of Information laws and you: Requests are based on your questions, concerns and passions, and you are free to embed, share and write about any of the verified government documents hosted here. Want to learn more? Check out our about page. MuckRock has been funded in part by grants from the Sunlight Foundation, the Freedom of the Press Foundation and the Knight Foundation.

Join Our Mailing List »

An amazing site.

I found MuckRock while looking for documents released by mistake by DHS. DHS Releases Trove of Documents Related to Wrong “Aurora” in Response to Freedom of Information Act (FOIA) Request (Maybe the DHS needs a topic map?)

I’ve signed up for their mailing list. Thinking about what government lies I want to read. ;-)

Looks like a great place to use your data mining/analysis skills.


Graphing 173 Million Taxi Rides

Thursday, June 26th, 2014

Interesting taxi rides dataset by Danny Bickson.

From the post:

I got the following from my collaborator Zach Nation. NY taxi ride dataset that was not properly anonymized and was reverse engineered to find interesting insights in the data.

Danny mapped the data using GraphLab and asks some interesting questions of the data.

BTW, Danny is offering the iPython notebook to play with!


This is the same data set I mentioned in: On Taxis and Rainbows

Friendly Fire: Death, Delay, and Dismay at the VA

Wednesday, June 25th, 2014

Friendly Fire: Death, Delay, and Dismay at the VA by Sen. Tom Coburn, M.D.

From the introduction:

Too many men and women who bravely fought for our freedom are losing their lives, not at the hands of terrorists or enemy combatants, but from friendly fire in the form of medical malpractice and neglect by the Department of Veterans Affairs (VA).

Split-second medical decisions in a war zone or in an emergency room can mean the difference between life and death. Yet at the VA, the urgency of the battlefield is lost in the lethargy of the bureaucracy. Veterans wait months just to see a doctor and the Department has systemically covered up delays and deaths they have caused. For decades, the Department has struggled to deliver timely care to veterans.

The reason veterans care has suffered for so long is Congress has failed to hold the VA accountable. Despite years of warnings from government investigators about efforts to cook the books, it took the unnecessary deaths of veterans denied care from Atlanta to Phoenix to prompt Congress to finally take action. On June 11, 2014, the Senate recently approved a bipartisan bill to
allow veterans who cannot receive a timely doctor’s appointment to go to another doctor outside of the VA.1046

But the problems at the VA are far deeper than just scheduling. After all, just getting to see a doctor does not guarantee appropriate treatment. Veterans in Boston receive top-notch care, while those treated in Phoenix suffer from subpar treatment. Over the past decade, more than 1,000 veterans may have died as a result of VA malfeasance,1 and the VA has paid out nearly $1
billion to veterans and their families for its medical malpractice.2

The waiting list cover-ups and uneven care are reflective of a much larger culture within the VA, where administrators manipulate both data and employees to give an appearance that all is well.

I am digesting the full report but I’m not sure enabling veterans to see doctors outside the VA is the same thing as holding the VA “accountable.”

From the early reports in this growing tragedy, there appear to be any number of “dark places” where data failed to be collected, where data was altered, or where the VA simply refused to collect data that might have driven better oversight.

I don’t think the VA is unique in any of these practices so mapping what is known, what could have been known and dark places in the VA data flow, could be informative both for the VA and other agencies as well.

I first saw this at Full Text Reports, Beyond the Waiting Lists, New Senate Report Reveals a Culture of Crime, Cover-Up and Coercion within the VA. Adds New Features

Tuesday, June 24th, 2014 Adds New Features

From the post:

  • User Accounts & Saved Searches: Users have the option of creating a private account that lets them save their personal searches. The feature gives users a quick and easy index from which to re-run their searches for new and updated information.
  • Congressional Record Search-by-Speaker: New metadata has been added to the Congressional Record that enables searching the daily transcript of congressional floor action by member name from 2009 – present. The member profile pages now also feature a link that returns a list of all Congressional Record articles in which that member was speaking.
  • Nominations: Users can track presidential nominees from appointment to hearing to floor votes with the new nominations function. The data goes back to 1981 and features faceted search, like the rest of, so users can narrow their searches by congressional session, type of nomination and status.

Other updates include expanded “About” and “Frequently Asked Questions” sections and the addition of committee referral and committee reports to bill-search results.

The website describes itself as: is the official source for federal legislative information. A collaboration among the Library of Congress, the U.S. Senate, the U.S. House of Representatives and the Government Printing Office, is a free resource that provides searchable access to bill status and summary, bill text, member profiles, the Congressional Record, committee reports, direct links from bills to cost estimates from the Congressional Budget Office, legislative process videos, committee profile pages and historic access reaching back to the 103rd Congress.

Before you get too excited, the 103rd Congress was in session 1993-1994. A considerable amount of material but far from complete.

The utility of topic maps is easy to demonstrate with the increased easy of tracking presidential nominations.

Rather than just tracking a bald nomination, wouldn’t it be handy to have all the political donations made by the nominee from the FEC? Or for that matter, their “friend” graph that shows their relationships with members of the president’s insider group?

All of that is easy enough to find, but then every searcher has to find the same information. If it were found and presented with the nominee, then other users would not have to work to re-find the information.


I first saw this in New Features Added to by Africa S. Hands.

Feds and Big Data

Thursday, June 12th, 2014

Federal Agencies and the Opportunities and Challenges of Big Data by Nicole Wong.

June 19, 2014
1:00 pm – 5:00 pm

From the post:

On June 19, the Obama Administration will continue the conversation on big data as we co-host our fourth big data conference, this time with the Georgetown University McCourt School of Public Policy’s Massive Data Institute.  The conference, “Improving Government Performance in the Era of Big Data; Opportunities and Challenges for Federal Agencies”,  will build on prior workshops at MIT, NYU, and Berkeley, and continue to engage both subject matter experts and the public in a national discussion about the future of data innovation and policy.

Drawing from the recent White House working group report, Big Data: Seizing Opportunities, Preserving Values, this event will focus on the opportunities and challenges posed by Federal agencies’ use of data, best practices for sharing data within and between agencies and other partners, and measures the government may use to ensure the protection of privacy and civil liberties in a big data environment.

You can find more information about the workshop and the webcast here.

We hope you will join us!

Nicole Wong is U.S. Deputy Chief Technology Officer at the White House Office of Science & Technology Policy

Approximately between 1:30 – 2:25 p.m., Panel One: Open Data and Information Sharing, Moderator: Nick Sinai, Deputy U.S. Chief Technology Officer.

Could be some useful intelligence on how sharing of data is viewed now. Perhaps you could throttle back a topic map to be just a little ahead of where agencies are now. So it would not look like such a big step.


New Data Sets Available in Census Bureau API

Monday, June 9th, 2014

New Data Sets Available in Census Bureau API

From the post:

Today the U.S. Census Bureau added several data sets to its application programming interface, including 2013 population estimates and 2012 nonemployer statistics.

The Census Bureau API allows developers to create a variety of apps and tools, such as ones that allow homebuyers to find detailed demographic information about a potential new neighborhood. By combining Census Bureau statistics with other data sets, developers can create tools for researchers to look at a variety of topics and how they impact a community.

Data sets now available in the API are:

  • July 1, 2013, national, state, county and Puerto Rico population estimates
  • 2012-2060 national population projections
  • 2007 Economic Census national, state, county, place and region economy-wide key statistics
  • 2012 Economic Census national economy-wide key statistics
  • 2011 County Business Patterns at the national, state and county level (2012 forthcoming)
  • 2012 national, state and county nonemployer statistics (businesses without paid employees)

The API also includes three decades (1990, 2000 and 2010) of census statistics and statistics from the American Community Survey covering one-, three- and five-year periods of data collection. Developers can access the API online and share ideas through the Census Bureau’s Developers Forum. Developers can use the Discovery Tool to examine the variables available in each dataset.

In case you are looking for census data to crunch!


UK Houses of Parliament launches Open Data portal

Friday, June 6th, 2014

UK Houses of Parliament launches Open Data portal

From the webpage:

Datasets related to the UK Houses of Parliament are now available via – the institution’s new dedicated Open Data portal.

Site developers are currently seeking feedback on the portal ahead of the next release, details of how to get in touch can be found by clicking here.

From the alpha release of the portal:

Welcome to the first release of – the home of Open Data from the UK Houses of Parliament. This is an alpha release and contains a limited set of features and data. We are seeking feedback from users about the platform and the data on it so please contact us.

I would have to agree that the portal presently contains “limited data.” ;-)

What would be helpful for non-U.K. data miners as well as ones in the U.K., would be some sense of what data is available?

A PDF file listing data that is currently maintained on the UK Houses of Parliament, their members, record of proceedings, transcripts, etc. would be a good starting point.

Pointers anyone?


Monday, June 2nd, 2014


Not all the news out of government is bad.

Consider openFDA which is putting

More than 3 million adverse drug event reports at your fingertips.

From the “about” page:

OpenFDA is an exciting new initiative in the Food and Drug Administration’s Office of Informatics and Technology Innovation spearheaded by FDA’s Chief Health Informatics Officer. OpenFDA offers easy access to FDA public data and highlight projects using these data in both the public and private sector to further regulatory or scientific missions, educate the public, and save lives.

What does it do?

OpenFDA provides API and raw download access to a number of high-value structured datasets. The platform is currently in public beta with one featured dataset, FDA’s publically available drug adverse event reports.

In the future, openFDA will provide a platform for public challenges issued by the FDA and a place for the community to interact with each other and FDA domain experts with the goal of spurring innovation around FDA data.

We’re currently focused on working on datasets in the following areas:

  • Adverse Events: FDA’s publically available drug adverse event reports, a database that contains millions of adverse event and medication error reports submitted to FDA covering all regulated drugs.
  • Recalls (coming soon): Enforcement Report and Product Recalls Data, containing information gathered from public notices about certain recalls of FDA-regulated products
  • Documentation (coming soon): Structured Product Labeling Data, containing detailed product label information on many FDA-regulated product

We’ll be releasing a number of updates and additional datasets throughout the upcoming months.

OK, I’m Twitter follower #522 @openFDA.

What’s your @openFDA number?

A good experience, i.e., people making good use of released data, asking for more data, etc., is what will drive more open data. Make every useful government data project count.

A New Nation Votes

Thursday, May 15th, 2014

A New Nation Votes: American Election Returns 1787-1825

From the webpage:

A New Nation Votes is a searchable collection of election returns from the earliest years of American democracy. The data were compiled by Philip Lampi. The American Antiquarian Society and Tufts University Digital Collections and Archives have mounted it online for you with funding from the National Endowment for the Humanities.

Currently there are 18040 elections that have been digitized.

Interesting data set and certainly one that could be supplemented with all manner of other materials.

Among other things, the impact or lack thereof from extension of the voting franchise would make an interesting study.



Monday, May 12th, 2014


From the homepage:

What is EnviroAtlas?

EnviroAtlas is a collection of interactive tools and resources that allows users to explore the many benefits people receive from nature, often referred to as ecosystem services. Key components of EnviroAtlas include the following:

Why is EnviroAtlas useful?

Though critically important to human well-being, ecosystem services are often overlooked. EnviroAtlas seeks to measure and communicate the type, quality, and extent of the goods and services that humans receive from nature so that their true value can be considered in decision-making processes.

Using EnviroAtlas, many types of users can access, view, and analyze diverse information to better understand how various decisions can affect an array of ecological and human health outcomes. EnviroAtlas is available to the public and houses a wealth of data and research.

EnvironAtlas integrates over 300 data layers listed in: Available EnvironAtlas data.

News about the cockroaches infesting the United States House/Senate makes me forget there are agencies laboring to provide benefits to citizens.

Whether this environmental goldmine will be enough to result in a saner environmental policy remains to be seen.

I first saw this in a tweet by Margaret Palmer.

R Client for the U.S. Federal Register API

Thursday, May 8th, 2014

R Client for the U.S. Federal Register API by Thomas Leeper.

From the webpage:

This package provides access to the API for the United States Federal Register. The API provides access to all Federal Register contents since 1994, including Executive Orders by Presidents Clinton, Bush, and Obama and all “Public Inspection” Documents made available prior to publication in the Register. The API returns basic details about each entry in the Register and provides URLs for HTML, PDF, and plain text versions of the contents thereof, and the data are fully searchable. The federalregister package provides access to all version 1 API endpoints.

If you are interested in law, policy development, or just general awareness of government activity, this is an important client for you!

More than 30 years ago I had a hard copy subscription to the Federal Register. Even then it was a mind numbing amount of detail. Today it is even worse.

This API enables any number of business models based upon quick access to current and historical Federal Register data.


IRS Data?

Wednesday, April 9th, 2014

New, Improved IRS Data Available on by Robert Maguire.

From the post:

Among the more than 160,000 comments the IRS received recently on its proposed rule dealing with candidate-related political activity by 501(c)(4) organizations, the Center for Responsive Politics was the only organization to point to deficiencies in a critical data set the IRS makes available to the public.

This month, the IRS released the newest version of that data, known as 990 extracts, which have been improved considerably. Now, the data is searchable and browseable on

“Abysmal” IRS data

Back in February, CRP had some tough words for the IRS concerning the information. In the closing pages of our comment on the agency’s proposed guidelines for candidate-related political activity, we wrote that “the data the IRS provides to the public — and the manner in which it provides it — is abysmal.”

While I am glad to see better access to 501(c) 990 data, in a very real sense, this isn’t “IRS data” is it?

This is data that the government collected under penalty of law from tax entities in the United States.

Granting it was sent in “voluntarily” but there is a lot of data that entities and individuals send to local, state and federal government “voluntarily.” Not all of it is data that most of us would want handed out because other people are curious.

As I said, I like better access to 990 data but we need to distinguish between:

  1. Government sharing data it collected from citizens or other entities, and
  2. Government sharing data about government meetings, discussions, contacts with citizens/contractors, policy making, processes and the like.

If I’m not seriously mistaken, most of the open data from government involves a great deal of #1 and very little of #2.

Is that your impression as well?

One quick example. The United States Congress, with some reluctance, seems poised on delivery of near real-time information on legislative proposals before Congress. Which is a good thing.

But there has been no discussion of tracking the final editing of bills to trace the insertion or deletion of language by who and with whose agreement? Which is a bad thing.

It makes no difference how public the process is up to final edits, if the final version is voted upon before changes can be found and charged to those responsible.

USGS Maps!

Tuesday, March 25th, 2014

USGS Maps (Google Map Gallery)

Wicked cool!

Followed a link from this post:

Maps were made for public consumption, not for safekeeping under lock and key. From the dawn of society, people have used maps to learn what’s around us, where we are and where we can go.

Since 1879, the U.S. Geological Survey (USGS) has been dedicated to providing reliable scientific information to better understand the Earth and its ecosystems. Mapping is an integral part of what we do. From the early days of mapping on foot in the field to more modern methods of satellite photography and GPS receivers, our scientists have created over 193,000 maps to understand and document changes to our environment.

Government agencies and NGOs have long used our maps everything from community planning to finding hiking trails. Farmers depend on our digital elevation data to help them produce our food. Historians look to our maps from years past to see how the terrain and built environment have changed over time.

While specific groups use USGS as a resource, we want the public at-large to find and use our maps, as well. The content of our maps—the information they convey about our land and its heritage—belongs to all Americans. Our maps are intended to serve as a public good. The more taxpayers use our maps and the more use they can find in the maps, the better.

We recognize that our expertise lies in mapping, so partnering with Google, which has expertise in Web design and delivery, is a natural fit. Google Maps Gallery helps us organize and showcase our maps in an efficient, mobile-friendly interface that’s easy for anyone to find what they’re looking for. Maps Gallery not only publishes USGS maps in high-quality detail, but makes it easy for anyone to search for and discover new maps.

My favorite line:

Maps were made for public consumption, not for safekeeping under lock and key.

Very true. Equally true for all the research and data that is produced at the behest of the government.


Sunday, March 23rd, 2014


From the about page:

The DARPA Open Catalog is a list of DARPA-sponsored open source software products and related publications. Each resource link shown on this site links back to information about each project, including links to the code repository and software license information.

This site reorganizes the resources of the Open Catalog (specifically the XDATA program) in a way that is easily sortable based on language, project or team. More information about XDATA’s open source software toolkits and peer-reviewed publications can be found on the DARPA Open Catalog, located at

For more information about this site, e-mail us at

A great public service for anyone interested in DARPA XDATA projects.

You could view this as encouragement to donate time to government hackathons.

I disagree.

Donating services to an organization that pays for IT and then accepts crap results, encourages poor IT management.

Possible Elimination of FR and CFR indexes (Pls Read, Forward, Act)

Saturday, March 22nd, 2014

Possible Elimination of FR and CFR indexes

I don’t think I have ever posted with (Pls Read, Forward, Act) in the headline, but this merits it.

From the post:

Please see the following message from Emily Feltren, Director of Government Relations for AALL, and contact her if you have any examples to share.

Hi Advocates—

Last week, the House Oversight and Government Reform Committee reported out the Federal Register Modernization Act (HR 4195). The bill, introduced the night before the mark up, changes the requirement to print the Federal Register and Code of Federal Regulations to “publish” them, eliminates the statutory requirement that the CFR be printed and bound, and eliminates the requirement to produce an index to the Federal Register and CFR. The Administrative Committee of the Federal Register governs how the FR and CFR are published and distributed to the public, and will continue to do so.

While the entire bill is troubling, I most urgently need examples of why the Federal Register and CFR indexes are useful and how you use them. Stories in the next week would be of the most benefit, but later examples will help, too. I already have a few excellent examples from our Print Usage Resource Log – thanks to all of you who submitted entries! But the more cases I can point to, the better.

Interestingly, the Office of the Federal Register itself touted the usefulness of its index when it announced the retooled index last year:

Thanks in advance for your help!

Emily Feltren
Director of Government Relations

American Association of Law Libraries

25 Massachusetts Avenue, NW, Suite 500

Washington, D.C. 20001


This is seriously bad news so I decided to look up the details.

Federal Register

Title 44, Section 1504 Federal Register, currently reads in part:

Documents required or authorized to be published by section 1505 of this title shall be printed and distributed immediately by the Government Printing Office in a serial publication designated the ”Federal Register.” The Public Printer shall make available the facilities of the Government Printing Office for the prompt printing and distribution of the Federal Register in the manner and at the times required by this chapter and the regulations prescribed under it. The contents of the daily issues shall be indexed and shall comprise all documents, required or authorized to be published, filed with the Office of the Federal Register up to the time of the day immediately preceding the day of distribution fixed by regulations under this chapter. (emphasis added)

By comparison, H.R. 4195 — 113th Congress (2013-2014) reads in relevant part:

The Public Printer shall make available the facilities of the Government Printing Office for the prompt publication of the Federal Register in the manner and at the times required by this chapter and the regulations prescribed under it. (Missing index language here.) The contents of the daily issues shall constitute all documents, required or authorized to be published, filed with the Office of the Federal Register up to the time of the day immediately preceding the day of publication fixed by regulations under this chapter.

Code of Federal Regulations (CFRs)

Title 44, Section 1510 Code of Federal Regulations, currently reads in part:

(b) (b) A codification published under subsection (a) of this section shall be printed and bound in permanent form and shall be designated as the ”Code of Federal Regulations.” The Administrative Committee shall regulate the binding of the printed codifications into separate books with a view to practical usefulness and economical manufacture. Each book shall contain an explanation of its coverage and other aids to users that the Administrative Committee may require. A general index to the entire Code of Federal Regulations shall be separately printed and bound. (emphasis added)

By comparison, H.R. 4195 — 113th Congress (2013-2014) reads in relevant part:

(b) Code of Federal Regulations.–A codification prepared under subsection (a) of this section shall be published and shall be designated as the `Code of Federal Regulations’. The Administrative Committee shall regulate the manner and forms of publishing this codification. (Missing index language here.)

I would say that indexes for the Federal Register and the Code of Federal Regulations are history should this bill pass as written.

Is this a problem?

Consider the task of tracking the number of pages in the Federal Register versus the pages in the Code of Federal Regulations that may be impacted:

Federal Register – > 70,000 pages per year.

The page count for final general and permanent rules in the 50-title CFR seems less dramatic than that of the oft-cited Federal Register, which now tops 70,000 pages each year (it stood at 79,311 pages at year-end 2013, the fourth-highest level ever). The Federal Register contains lots of material besides final rules. (emphasis added) (New Data: Code of Federal Regulations Expanding, Faster Pace under Obama by Wayne Crews.)

Code of Federal Regulations – 175,496 pages (2013) plus 1,170 page index.

Now, new data from the National Archives shows that the CFR stands at 175,496 at year-end 2013, including the 1,170-page index. (emphasis added) (New Data: Code of Federal Regulations Expanding, Faster Pace under Obama by Wayne Crews.)

The bottom line is there are 175,496 pages being impacted by more than 70,000 pages per year, published in a week-day publication.

We don’t need indexes to access that material?

Congress, I don’t think “access” means what you think it means.

PS: As a research guide, you are unlikely to do better than: A Research Guide to the Federal Register and the Code of Federal Regulations by Richard J. McKinney at the Law Librarians’ Society of Washington, DC website.

I first saw this in a tweet by Aaron Kirschenfeld.

UK statistics and open data…

Tuesday, March 18th, 2014

UK statistics and open data: MPs’ inquiry report published Owen Boswarva.

From the post:

This morning the Public Administration Select Committee (PASC), a cross-party group of MPs chaired by Bernard Jenkin, published its report on Statistics and Open Data.

This report is the product of an inquiry launched in July 2013. Witnesses gave oral evidence in three sessions; you can read the transcripts and written evidence as well.

Useful if you are looking for rhetoric and examples of use of government data.

Ironic that just last week the news broke that Google has given British security the power to censor “unsavory” (but legal) content from Youtube. UK gov wants to censor legal but “unsavoury” YouTube content by Lisa Vaas.

Lisa writes:

Last week, the Financial Times revealed that Google has given British security the power to quickly yank terrorist content offline.

The UK government doesn’t want to stop there, though – what it really wants is the power to pull “unsavoury” content, regardless of whether it’s actually illegal – in other words, it wants censorship power.

The news outlet quoted UK’s security and immigration minister, James Brokenshire, who said that the government must do more to deal with material “that may not be illegal but certainly is unsavoury and may not be the sort of material that people would want to see or receive.”

I’m not sure why the UK government wants to block content that people don’t want to see or receive. They simply won’t look at it. Yes?

But, intellectual coherence has never been a strong point of most governments and the UK in particular of late.

Is this more evidence for my contention that “open data” for government means only the data government wants you to have?

The FIRST Act, Retro Legislation?

Tuesday, March 11th, 2014

Language in FIRST act puts United States at Severe Disadvantage Against International Competitors by Ranit Schmelzer.

From the press release:

The Scholarly Publishing and Academic Research Coalition (SPARC), an international alliance of nearly 800 academic and research libraries, today announced its opposition to Section 303 of H.R. 4186, the Frontiers in Innovation, Research, Science and Technology (FIRST) Act. This provision would impose significant barriers to the public’s ability to access the results of taxpayer-funded research.

Section 303 of the bill would undercut the ability of federal agencies to effectively implement the widely supported White House Directive on Public Access to the Results of Federally Funded Research and undermine the successful public access program pioneered by the National Institutes of Health (NIH) – recently expanded through the FY14 Omnibus Appropriations Act to include the Departments Labor, Education and Health and Human Services. Adoption of Section 303 would be a step backward from existing federal policy in the directive, and put the U.S. at a severe disadvantage among our global competitors.

“This provision is not in the best interests of the taxpayers who fund scientific research, the scientists who use it to accelerate scientific progress, the teachers and students who rely on it for a high-quality education, and the thousands of U.S. businesses who depend on public access to stay competitive in the global marketplace,” said Heather Joseph, SPARC Executive Director. “We will continue to work with the many bipartisan members of the Congress who support open access to publicly funded research to improve the bill.”

[the parade of horribles follows]

SPARC‘s press release never quotes a word from H.R. 4186. Not one. Commentary but nary a part of its object.

I searched at Thomas (the Congressional information service at the Library of Congress), for H.R. 4186 and came up empty by bill number. Switching to the Congressional Record for Monday, March 10, 2014, I did find the bill being introduced and the setting of a hearing on it. The GPO as not (as of today) posted the text of H.R. 4186, but when it does, follow this link: H.R. 4186.

Even more importantly, SPARC doesn’t point out who is responsible for the objectionable section appearing in the bill. Bills don’t write themselves and as far as I know, Congress doesn’t have a random bill generator.

The bottom line is that someone, an identifiable someone, asked for longer embargo wording to be included. If the SPARC press release is accurate, the most likely someone’s asked are Chairman Lamar Smith (R-TX 21st District) or Rep. Larry Bucshon (R-IN 8th District).

The Wikipedia page on the 8th Congressional District in Illinois needs to be updated but it also fails to mention that the 8th district is to the West and North-West of Chicago. You might want to check Bucshon‘s page at Wikipedia and links there to other resources.

Wikipedia on the 21st Congressional District of Texas, places it north of San Antonio, the seventh largest city in the United States. Lamar Smith‘s page at Wikipedia has some interested reading.

Odds are in and around Chicago and San Antonio there are people interested in longer embargo periods on federally funded research.

Those are at least some starting points for effective opposition to this legislation, assuming it was reported accurately by SPARC. Let’s drop the pose of disinterested legislators trying valiantly to serve the public good. Not impossible, just highly unlikely. Let’s argue about who is getting paid and for what benefits.

Or as Captain Ahab advises:

All visible objects, man, are but as pasteboard masks. But in each event –in the living act, the undoubted deed –there, some unknown but still reasoning thing puts forth the mouldings of its features from behind the unreasoning mask. If man will strike, strike through the mask! [Melville, Moby Dick, Chapter XXXVI]

Legislation as a “pasteboard mask” is a useful image. There is not a contour, dimple, shade or expression that wasn’t bought and paid for by someone. You have to strike through the mask to discover who.

Are you game?

PS: Curious, where would you go next (data wise, I don’t have the energy to lurk in garages) in terms of searching for the buyers of longer embargoes in H.R. 4186?

Visualising UK Ministerial Lobbying…

Thursday, March 6th, 2014

Visualising UK Ministerial Lobbying & “Buddying” Over Eight Months by Roland Dunn.

From the post:


[This is a companion piece to our visualisation of ministerial lobbying – open it up and take a look!].

Eight Months Worth of Lobbying Data

Turns out that James Ball, together with the folks at Who’s Lobbying had collected together all the data regarding ministerial meetings from all the different departments across the UK’s government (during May to December 2010), tidied the data up, and put them together in one spreadsheet:

It’s important to understand that despite the current UK government stating that it is the most open and transparent ever, each department publishes its ministerial meetings in ever so slightly different formats. On that page for example you can see Dept of Health Ministerial gifts, hospitality, travel and external meetings January to March 2013, and DWP ministers’ meetings with external organisations: January to March 2013. Two lists containing slightly different sets of data. So, the work that Who’s Lobbying and James Ball did in tallying this data up is considerable. But not many people have the time to tie such data-sets together, meaning the data contained in them is somewhat more opaque than you might at first be led to believe. What’s needed is one pan-governmental set of data.

An example to follow in making “open” data a bit more “transparent.”

Not entirely transparent for as the author notes, minutes from the various meetings are not available.

Or I suppose when minutes are available, their completeness would be questionable.

I first saw this in a tweet by Steve Peters.

Data Science – Chicago

Monday, March 3rd, 2014

OK, I shortened the headline.

The full headline reads: Accenture and MIT Alliance in Business Analytics launches data science challenge in collaboration with Chicago: New annual contest for MIT students to recognize best data analytics and visualization ideas.: The Accenture and MIT Alliance in Business Analytics

Don’t try that without coffee in the morning.

From the post:

The Accenture and MIT Alliance in Business Analytics have launched an annual data science challenge for 2014 that is being conducted in collaboration with the city of Chicago.

The challenge invites MIT students to analyze Chicago’s publicly available data sets and develop data visualizations that will provide the city with insights that can help it better serve residents, visitors, and businesses. Through data visualization, or visual renderings of data sets, people with no background in data analysis can more easily understand insights from complex data sets.

The headline is longer than the first paragraph of the story.

I didn’t see an explanation for why the challenge is limited to:

The challenge is now open and ends April 30. Registration is free and open to active MIT students 18 and over (19 in Alabama and Nebraska). Register and see the full rule here:

Find a sponsor and setup an annual data mining challenge for your school or organization.

Although I would suggest you take a pass on Bogata, Mexico City, Rio de Janeiro, Moscow, Washington, D.C. and similar places where truthful auditing could be hazardous to your health.

Or as one of my favorite Dilbert cartoons had the pointy-haired boss observing:

When you find a big pot of crazy it’s best not to stir it.

One Thing Leads To Another (#NICAR2014)

Sunday, March 2nd, 2014

A tweet this morning read:

overviewproject ‏@overviewproject 1h
.@djournalismus talking about handling 2.5 million offshore leaks docs. Content equivalent to 50,000 bibles. #NICAR14

That sound interesting! Can’t ever tell when a leaked document will prove useful. But where to find this discussion?

Following #NICAR14 leaves you with the impression this is a conference. (I didn’t recognize the hashtag immediately.)

Searching on the web, the hashtag lead me to: 2014 Computer-Assisted Reporting Conference. (NICAR = National Institute for Computer-Assisted Reporting)

The handle @djournalismus offers the name Sebastian Mondia.

Checking the speakers list, I found this presentation:

Inside the global offshore money maze
Event: 2014 CAR Conference
Speakers: David Donald, Mar Cabra, Margot Williams, Sebastian Mondial
Date/Time: Saturday, March 1 at 2 p.m.
Location: Grand Ballroom West
Audio file: No audio file available.

The International Consortium of Investigative Journalists “Secrecy For Sale: Inside The Global Offshore Money Maze” is one of the largest and most complex cross-border investigative projects in journalism history. More than 110 journalists in about 60 countries analyzed a 260 GB leaked hard drive to expose the systematic use of tax havens. Learn how this multinational team mined 2.5 million files and cracked open the impenetrable offshore world by creating a web app that revealed the ownership behind more than 100,000 anonymous “shell companies” in 10 offshore jurisdictions.

Along the way I discovered the speakers list, who cover a wide range of subjects of interest to anyone mining data.

Another treasure is the Tip Sheets and Tutorial page. Here are six (6) selections out of sixty-one (61) items to pique your interest:

  • Follow the Fracking
  • Maps and charts in R: real newsroom examples
  • Wading through the sea of data on hospitals, doctors, medicine and more
  • Free the data: Getting government agencies to give up the goods
  • Campaign Finance I: Mining FEC data
  • Danger! Hazardous materials: Using data to uncover pollution

Not to mention that NICAR2012 and NICAR2013 are also accessible from the NICAR2014 page, with their own “tip” listings.

If you find this type of resource useful, be sure to check out Investigative Reporters and Editors (IRE)

About the IRE:

Investigative Reporters and Editors, Inc. is a grassroots nonprofit organization dedicated to improving the quality of investigative reporting. IRE was formed in 1975 to create a forum in which journalists throughout the world could help each other by sharing story ideas, newsgathering techniques and news sources.

IRE provides members access to thousands of reporting tip sheets and other materials through its resource center and hosts conferences and specialized training throughout the country. Programs of IRE include the National Institute for Computer Assisted Reporting, DocumentCloud and the Campus Coverage Project

Learn more about joining IRE and the benefits of membership.

Sounds like a win-win offer to me!


SEC Filings for Humans

Tuesday, February 25th, 2014

SEC Filings for Humans by Meris Jensen.

After a long and sad history of the failure of the SEC to make EDGAR useful:

Rank and Filed gathers data from EDGAR, indexes it, and returns it in formats meant to help investors research, investigate and discover companies on their own. I started googling ‘How to build a website’ seven months ago. The SEC has web developers, software developers, database administrators, XBRL experts, legions of academics who specialize in SEC filings, and all this EDGAR data already cached in the cloud. The Commission’s mission is to protect investors, maintain fair, orderly and efficient markets, and facilitate capital formation. Why did I have to build this? (emphasis added)

I don’t know the answer to Meris’ question but I can tell you that Rank and Filed is an incredible resource for financial information.

And yet another demonstration that government should not manage open data. Make it available. (full stop)

I first saw this at Nathan Yau’s A human-readable explorer for SEC filings.

R Markdown:… [Open Analysis, successor to Open Data?]

Tuesday, February 25th, 2014

R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics by Ben Baumer,


Nolan and Temple Lang argue that “the ability to express statistical computations is an essential skill.” A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as data become increasingly complex and statistical methods become increasingly sophisticated. R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation. (emphasis in original)

The author’s third point for R Markdown I would have made the first:

Third, the separation of computing from presentation is not necessarily honest… More subtly and less perniciously, the copy-and-paste paradigm enables, and in many cases even encourages, selective reporting. That is, the tabular output from R is admittedly not of presentation quality. Thus the student may be tempted or even encouraged to prettify tabular output before submitting. But while one is fi ddling with margins and headers, it is all too tempting to remove rows or columns that do not suit the student’s purpose. Since the commands used to generate the table are not present, the reader is none the wiser.

Although I have to admit that reproducibility has a lot going for it.

Can you imagine reproducible analysis from the OMB? Complete with machine readable data sets? Or for any other agency reports. Or for that matter, for all publications by registered lobbyists. That could be real interesting.

Open Analysis (OA) as a natural successor to Open Data.

That works for me.


PS: More resources:

Create Dynamic R Statistical Reports Using R Markdown

R Markdown

Using R Markdown with RStudio

Writing papers using R Markdown

If journals started requiring R Markdown as a condition for publication, some aspects of research would become more transparent.

Some will say that authors will resistl

Assume Science or Nature has accepted your article on the condition of your use of R Markdown.

Honestly, are you really going to say no?

I first saw this in a tweet by Scott Chamberlain.


Saturday, February 22nd, 2014

OpenRFPs: Open RFP Data for All 50 States by Clay Johnson.

From the post:

Tomorrow at CodeAcross we’ll be launching our first community-based project, OpenRFPs. The goal is to liberate the data inside of every state RFP listing website in the country. We hope you’ll find your own state’s RFP site, and contribute a parser.

The Department of Better Technology’s goal is to improve the way government works by making it easier for small, innovative businesses to provide great technology to government. But those businesses can barely make it through the front door when the RFPs themselves are stored in archaic systems, with sloppy user interfaces and disparate data formats, or locked behind paywalls.

I have posted to the announcement suggesting they use UBL. But in any event, mapping the semantics of RFPs, to enable wider participation would make an interesting project.

I first saw this in a tweet by Tim O’Reilly.

Fiscal Year 2015 Budget (US) Open Government?

Friday, February 21st, 2014

Fiscal Year 2015 Budget

From the description:

Each year, the Office of Management and Budget (OMB) prepares the President’s proposed Federal Government budget for the upcoming Federal fiscal year, which includes the Administration’s budget priorities and proposed funding.

For Fiscal Year (FY) 2015– which runs from October 1, 2014, through September 30, 2015– OMB has produced the FY 2015 Federal Budget in four print volumes plus an all-in-one CD-ROM:

  1. the main “Budget” document with the Budget Message of the President, information on the President’s priorities and budget overviews by agency, and summary tables;
  2. “Analytical Perspectives” that contains analyses that are designed to highlight specified subject areas;
  3. “Historical Tables” that provides data on budget receipts, outlays, surpluses or deficits, Federal debt over a time period
  4. an “Appendix” with detailed information on individual Federal agency programs and appropriation accounts that constitute the budget.
  5. A CD-ROM version of the Budget is also available which contains all the FY 2015 budget documents in PDF format along with some additional supporting material in spreadsheet format.

You will also want a “Green Book,” the 2014 version carried this description:

Each February when the President releases his proposed Federal Budget for the following year, Treasury releases the General Explanations of the Administration’s Revenue Proposals. Known as the “Green Book” (or Greenbook), the document provides a concise explanation of each of the Administration’s Fiscal Year 2014 tax proposals for raising revenue for the Government. This annual document clearly recaps each proposed change, reviewing the provisions in the Current Law, outlining the Administration’s Reasons for Change to the law, and explaining the Proposal for the new law. Ideal for anyone wanting a clear summary of the Administration’s policies and proposed tax law changes.

Did I mention that the four volumes for the budget in print with CD-ROM are $250? And last year the Green Book was $75?

For $325.00, you can have a print and pdf of the Budget plus a print copy of the Green Book.


  1. Would machine readable versions of the Budget + Green Book make it easier to explore and compare the information within?
  2. Are PDFs and print volumes what President Obama considers to be “open government?”
  3. Who has the advantage in policy debates, the OMB and Treasury with machine readable versions of these documents or the average citizen who has the PDFs and print?
  4. Do you think OMB and Treasury didn’t get the memo? Open Data Policy-Managing Information as an Asset

Public policy debates cannot be fairly conducted without meaningful access to data on public policy issues.