Archive for the ‘Government Data’ Category

USGS Maps!

Tuesday, March 25th, 2014

USGS Maps (Google Map Gallery)

Wicked cool!

Followed a link from this post:

Maps were made for public consumption, not for safekeeping under lock and key. From the dawn of society, people have used maps to learn what’s around us, where we are and where we can go.

Since 1879, the U.S. Geological Survey (USGS) has been dedicated to providing reliable scientific information to better understand the Earth and its ecosystems. Mapping is an integral part of what we do. From the early days of mapping on foot in the field to more modern methods of satellite photography and GPS receivers, our scientists have created over 193,000 maps to understand and document changes to our environment.

Government agencies and NGOs have long used our maps everything from community planning to finding hiking trails. Farmers depend on our digital elevation data to help them produce our food. Historians look to our maps from years past to see how the terrain and built environment have changed over time.

While specific groups use USGS as a resource, we want the public at-large to find and use our maps, as well. The content of our maps—the information they convey about our land and its heritage—belongs to all Americans. Our maps are intended to serve as a public good. The more taxpayers use our maps and the more use they can find in the maps, the better.

We recognize that our expertise lies in mapping, so partnering with Google, which has expertise in Web design and delivery, is a natural fit. Google Maps Gallery helps us organize and showcase our maps in an efficient, mobile-friendly interface that’s easy for anyone to find what they’re looking for. Maps Gallery not only publishes USGS maps in high-quality detail, but makes it easy for anyone to search for and discover new maps.

My favorite line:

Maps were made for public consumption, not for safekeeping under lock and key.

Very true. Equally true for all the research and data that is produced at the behest of the government.


Sunday, March 23rd, 2014


From the about page:

The DARPA Open Catalog is a list of DARPA-sponsored open source software products and related publications. Each resource link shown on this site links back to information about each project, including links to the code repository and software license information.

This site reorganizes the resources of the Open Catalog (specifically the XDATA program) in a way that is easily sortable based on language, project or team. More information about XDATA’s open source software toolkits and peer-reviewed publications can be found on the DARPA Open Catalog, located at

For more information about this site, e-mail us at

A great public service for anyone interested in DARPA XDATA projects.

You could view this as encouragement to donate time to government hackathons.

I disagree.

Donating services to an organization that pays for IT and then accepts crap results, encourages poor IT management.

Possible Elimination of FR and CFR indexes (Pls Read, Forward, Act)

Saturday, March 22nd, 2014

Possible Elimination of FR and CFR indexes

I don’t think I have ever posted with (Pls Read, Forward, Act) in the headline, but this merits it.

From the post:

Please see the following message from Emily Feltren, Director of Government Relations for AALL, and contact her if you have any examples to share.

Hi Advocates—

Last week, the House Oversight and Government Reform Committee reported out the Federal Register Modernization Act (HR 4195). The bill, introduced the night before the mark up, changes the requirement to print the Federal Register and Code of Federal Regulations to “publish” them, eliminates the statutory requirement that the CFR be printed and bound, and eliminates the requirement to produce an index to the Federal Register and CFR. The Administrative Committee of the Federal Register governs how the FR and CFR are published and distributed to the public, and will continue to do so.

While the entire bill is troubling, I most urgently need examples of why the Federal Register and CFR indexes are useful and how you use them. Stories in the next week would be of the most benefit, but later examples will help, too. I already have a few excellent examples from our Print Usage Resource Log – thanks to all of you who submitted entries! But the more cases I can point to, the better.

Interestingly, the Office of the Federal Register itself touted the usefulness of its index when it announced the retooled index last year:

Thanks in advance for your help!

Emily Feltren
Director of Government Relations

American Association of Law Libraries

25 Massachusetts Avenue, NW, Suite 500

Washington, D.C. 20001


This is seriously bad news so I decided to look up the details.

Federal Register

Title 44, Section 1504 Federal Register, currently reads in part:

Documents required or authorized to be published by section 1505 of this title shall be printed and distributed immediately by the Government Printing Office in a serial publication designated the ”Federal Register.” The Public Printer shall make available the facilities of the Government Printing Office for the prompt printing and distribution of the Federal Register in the manner and at the times required by this chapter and the regulations prescribed under it. The contents of the daily issues shall be indexed and shall comprise all documents, required or authorized to be published, filed with the Office of the Federal Register up to the time of the day immediately preceding the day of distribution fixed by regulations under this chapter. (emphasis added)

By comparison, H.R. 4195 — 113th Congress (2013-2014) reads in relevant part:

The Public Printer shall make available the facilities of the Government Printing Office for the prompt publication of the Federal Register in the manner and at the times required by this chapter and the regulations prescribed under it. (Missing index language here.) The contents of the daily issues shall constitute all documents, required or authorized to be published, filed with the Office of the Federal Register up to the time of the day immediately preceding the day of publication fixed by regulations under this chapter.

Code of Federal Regulations (CFRs)

Title 44, Section 1510 Code of Federal Regulations, currently reads in part:

(b) (b) A codification published under subsection (a) of this section shall be printed and bound in permanent form and shall be designated as the ”Code of Federal Regulations.” The Administrative Committee shall regulate the binding of the printed codifications into separate books with a view to practical usefulness and economical manufacture. Each book shall contain an explanation of its coverage and other aids to users that the Administrative Committee may require. A general index to the entire Code of Federal Regulations shall be separately printed and bound. (emphasis added)

By comparison, H.R. 4195 — 113th Congress (2013-2014) reads in relevant part:

(b) Code of Federal Regulations.–A codification prepared under subsection (a) of this section shall be published and shall be designated as the `Code of Federal Regulations’. The Administrative Committee shall regulate the manner and forms of publishing this codification. (Missing index language here.)

I would say that indexes for the Federal Register and the Code of Federal Regulations are history should this bill pass as written.

Is this a problem?

Consider the task of tracking the number of pages in the Federal Register versus the pages in the Code of Federal Regulations that may be impacted:

Federal Register – > 70,000 pages per year.

The page count for final general and permanent rules in the 50-title CFR seems less dramatic than that of the oft-cited Federal Register, which now tops 70,000 pages each year (it stood at 79,311 pages at year-end 2013, the fourth-highest level ever). The Federal Register contains lots of material besides final rules. (emphasis added) (New Data: Code of Federal Regulations Expanding, Faster Pace under Obama by Wayne Crews.)

Code of Federal Regulations – 175,496 pages (2013) plus 1,170 page index.

Now, new data from the National Archives shows that the CFR stands at 175,496 at year-end 2013, including the 1,170-page index. (emphasis added) (New Data: Code of Federal Regulations Expanding, Faster Pace under Obama by Wayne Crews.)

The bottom line is there are 175,496 pages being impacted by more than 70,000 pages per year, published in a week-day publication.

We don’t need indexes to access that material?

Congress, I don’t think “access” means what you think it means.

PS: As a research guide, you are unlikely to do better than: A Research Guide to the Federal Register and the Code of Federal Regulations by Richard J. McKinney at the Law Librarians’ Society of Washington, DC website.

I first saw this in a tweet by Aaron Kirschenfeld.

UK statistics and open data…

Tuesday, March 18th, 2014

UK statistics and open data: MPs’ inquiry report published Owen Boswarva.

From the post:

This morning the Public Administration Select Committee (PASC), a cross-party group of MPs chaired by Bernard Jenkin, published its report on Statistics and Open Data.

This report is the product of an inquiry launched in July 2013. Witnesses gave oral evidence in three sessions; you can read the transcripts and written evidence as well.

Useful if you are looking for rhetoric and examples of use of government data.

Ironic that just last week the news broke that Google has given British security the power to censor “unsavory” (but legal) content from Youtube. UK gov wants to censor legal but “unsavoury” YouTube content by Lisa Vaas.

Lisa writes:

Last week, the Financial Times revealed that Google has given British security the power to quickly yank terrorist content offline.

The UK government doesn’t want to stop there, though – what it really wants is the power to pull “unsavoury” content, regardless of whether it’s actually illegal – in other words, it wants censorship power.

The news outlet quoted UK’s security and immigration minister, James Brokenshire, who said that the government must do more to deal with material “that may not be illegal but certainly is unsavoury and may not be the sort of material that people would want to see or receive.”

I’m not sure why the UK government wants to block content that people don’t want to see or receive. They simply won’t look at it. Yes?

But, intellectual coherence has never been a strong point of most governments and the UK in particular of late.

Is this more evidence for my contention that “open data” for government means only the data government wants you to have?

The FIRST Act, Retro Legislation?

Tuesday, March 11th, 2014

Language in FIRST act puts United States at Severe Disadvantage Against International Competitors by Ranit Schmelzer.

From the press release:

The Scholarly Publishing and Academic Research Coalition (SPARC), an international alliance of nearly 800 academic and research libraries, today announced its opposition to Section 303 of H.R. 4186, the Frontiers in Innovation, Research, Science and Technology (FIRST) Act. This provision would impose significant barriers to the public’s ability to access the results of taxpayer-funded research.

Section 303 of the bill would undercut the ability of federal agencies to effectively implement the widely supported White House Directive on Public Access to the Results of Federally Funded Research and undermine the successful public access program pioneered by the National Institutes of Health (NIH) – recently expanded through the FY14 Omnibus Appropriations Act to include the Departments Labor, Education and Health and Human Services. Adoption of Section 303 would be a step backward from existing federal policy in the directive, and put the U.S. at a severe disadvantage among our global competitors.

“This provision is not in the best interests of the taxpayers who fund scientific research, the scientists who use it to accelerate scientific progress, the teachers and students who rely on it for a high-quality education, and the thousands of U.S. businesses who depend on public access to stay competitive in the global marketplace,” said Heather Joseph, SPARC Executive Director. “We will continue to work with the many bipartisan members of the Congress who support open access to publicly funded research to improve the bill.”

[the parade of horribles follows]

SPARC‘s press release never quotes a word from H.R. 4186. Not one. Commentary but nary a part of its object.

I searched at Thomas (the Congressional information service at the Library of Congress), for H.R. 4186 and came up empty by bill number. Switching to the Congressional Record for Monday, March 10, 2014, I did find the bill being introduced and the setting of a hearing on it. The GPO as not (as of today) posted the text of H.R. 4186, but when it does, follow this link: H.R. 4186.

Even more importantly, SPARC doesn’t point out who is responsible for the objectionable section appearing in the bill. Bills don’t write themselves and as far as I know, Congress doesn’t have a random bill generator.

The bottom line is that someone, an identifiable someone, asked for longer embargo wording to be included. If the SPARC press release is accurate, the most likely someone’s asked are Chairman Lamar Smith (R-TX 21st District) or Rep. Larry Bucshon (R-IN 8th District).

The Wikipedia page on the 8th Congressional District in Illinois needs to be updated but it also fails to mention that the 8th district is to the West and North-West of Chicago. You might want to check Bucshon‘s page at Wikipedia and links there to other resources.

Wikipedia on the 21st Congressional District of Texas, places it north of San Antonio, the seventh largest city in the United States. Lamar Smith‘s page at Wikipedia has some interested reading.

Odds are in and around Chicago and San Antonio there are people interested in longer embargo periods on federally funded research.

Those are at least some starting points for effective opposition to this legislation, assuming it was reported accurately by SPARC. Let’s drop the pose of disinterested legislators trying valiantly to serve the public good. Not impossible, just highly unlikely. Let’s argue about who is getting paid and for what benefits.

Or as Captain Ahab advises:

All visible objects, man, are but as pasteboard masks. But in each event –in the living act, the undoubted deed –there, some unknown but still reasoning thing puts forth the mouldings of its features from behind the unreasoning mask. If man will strike, strike through the mask! [Melville, Moby Dick, Chapter XXXVI]

Legislation as a “pasteboard mask” is a useful image. There is not a contour, dimple, shade or expression that wasn’t bought and paid for by someone. You have to strike through the mask to discover who.

Are you game?

PS: Curious, where would you go next (data wise, I don’t have the energy to lurk in garages) in terms of searching for the buyers of longer embargoes in H.R. 4186?

Visualising UK Ministerial Lobbying…

Thursday, March 6th, 2014

Visualising UK Ministerial Lobbying & “Buddying” Over Eight Months by Roland Dunn.

From the post:


[This is a companion piece to our visualisation of ministerial lobbying – open it up and take a look!].

Eight Months Worth of Lobbying Data

Turns out that James Ball, together with the folks at Who’s Lobbying had collected together all the data regarding ministerial meetings from all the different departments across the UK’s government (during May to December 2010), tidied the data up, and put them together in one spreadsheet:

It’s important to understand that despite the current UK government stating that it is the most open and transparent ever, each department publishes its ministerial meetings in ever so slightly different formats. On that page for example you can see Dept of Health Ministerial gifts, hospitality, travel and external meetings January to March 2013, and DWP ministers’ meetings with external organisations: January to March 2013. Two lists containing slightly different sets of data. So, the work that Who’s Lobbying and James Ball did in tallying this data up is considerable. But not many people have the time to tie such data-sets together, meaning the data contained in them is somewhat more opaque than you might at first be led to believe. What’s needed is one pan-governmental set of data.

An example to follow in making “open” data a bit more “transparent.”

Not entirely transparent for as the author notes, minutes from the various meetings are not available.

Or I suppose when minutes are available, their completeness would be questionable.

I first saw this in a tweet by Steve Peters.

Data Science – Chicago

Monday, March 3rd, 2014

OK, I shortened the headline.

The full headline reads: Accenture and MIT Alliance in Business Analytics launches data science challenge in collaboration with Chicago: New annual contest for MIT students to recognize best data analytics and visualization ideas.: The Accenture and MIT Alliance in Business Analytics

Don’t try that without coffee in the morning.

From the post:

The Accenture and MIT Alliance in Business Analytics have launched an annual data science challenge for 2014 that is being conducted in collaboration with the city of Chicago.

The challenge invites MIT students to analyze Chicago’s publicly available data sets and develop data visualizations that will provide the city with insights that can help it better serve residents, visitors, and businesses. Through data visualization, or visual renderings of data sets, people with no background in data analysis can more easily understand insights from complex data sets.

The headline is longer than the first paragraph of the story.

I didn’t see an explanation for why the challenge is limited to:

The challenge is now open and ends April 30. Registration is free and open to active MIT students 18 and over (19 in Alabama and Nebraska). Register and see the full rule here:

Find a sponsor and setup an annual data mining challenge for your school or organization.

Although I would suggest you take a pass on Bogata, Mexico City, Rio de Janeiro, Moscow, Washington, D.C. and similar places where truthful auditing could be hazardous to your health.

Or as one of my favorite Dilbert cartoons had the pointy-haired boss observing:

When you find a big pot of crazy it’s best not to stir it.

One Thing Leads To Another (#NICAR2014)

Sunday, March 2nd, 2014

A tweet this morning read:

overviewproject ‏@overviewproject 1h
.@djournalismus talking about handling 2.5 million offshore leaks docs. Content equivalent to 50,000 bibles. #NICAR14

That sound interesting! Can’t ever tell when a leaked document will prove useful. But where to find this discussion?

Following #NICAR14 leaves you with the impression this is a conference. (I didn’t recognize the hashtag immediately.)

Searching on the web, the hashtag lead me to: 2014 Computer-Assisted Reporting Conference. (NICAR = National Institute for Computer-Assisted Reporting)

The handle @djournalismus offers the name Sebastian Mondia.

Checking the speakers list, I found this presentation:

Inside the global offshore money maze
Event: 2014 CAR Conference
Speakers: David Donald, Mar Cabra, Margot Williams, Sebastian Mondial
Date/Time: Saturday, March 1 at 2 p.m.
Location: Grand Ballroom West
Audio file: No audio file available.

The International Consortium of Investigative Journalists “Secrecy For Sale: Inside The Global Offshore Money Maze” is one of the largest and most complex cross-border investigative projects in journalism history. More than 110 journalists in about 60 countries analyzed a 260 GB leaked hard drive to expose the systematic use of tax havens. Learn how this multinational team mined 2.5 million files and cracked open the impenetrable offshore world by creating a web app that revealed the ownership behind more than 100,000 anonymous “shell companies” in 10 offshore jurisdictions.

Along the way I discovered the speakers list, who cover a wide range of subjects of interest to anyone mining data.

Another treasure is the Tip Sheets and Tutorial page. Here are six (6) selections out of sixty-one (61) items to pique your interest:

  • Follow the Fracking
  • Maps and charts in R: real newsroom examples
  • Wading through the sea of data on hospitals, doctors, medicine and more
  • Free the data: Getting government agencies to give up the goods
  • Campaign Finance I: Mining FEC data
  • Danger! Hazardous materials: Using data to uncover pollution

Not to mention that NICAR2012 and NICAR2013 are also accessible from the NICAR2014 page, with their own “tip” listings.

If you find this type of resource useful, be sure to check out Investigative Reporters and Editors (IRE)

About the IRE:

Investigative Reporters and Editors, Inc. is a grassroots nonprofit organization dedicated to improving the quality of investigative reporting. IRE was formed in 1975 to create a forum in which journalists throughout the world could help each other by sharing story ideas, newsgathering techniques and news sources.

IRE provides members access to thousands of reporting tip sheets and other materials through its resource center and hosts conferences and specialized training throughout the country. Programs of IRE include the National Institute for Computer Assisted Reporting, DocumentCloud and the Campus Coverage Project

Learn more about joining IRE and the benefits of membership.

Sounds like a win-win offer to me!


SEC Filings for Humans

Tuesday, February 25th, 2014

SEC Filings for Humans by Meris Jensen.

After a long and sad history of the failure of the SEC to make EDGAR useful:

Rank and Filed gathers data from EDGAR, indexes it, and returns it in formats meant to help investors research, investigate and discover companies on their own. I started googling ‘How to build a website’ seven months ago. The SEC has web developers, software developers, database administrators, XBRL experts, legions of academics who specialize in SEC filings, and all this EDGAR data already cached in the cloud. The Commission’s mission is to protect investors, maintain fair, orderly and efficient markets, and facilitate capital formation. Why did I have to build this? (emphasis added)

I don’t know the answer to Meris’ question but I can tell you that Rank and Filed is an incredible resource for financial information.

And yet another demonstration that government should not manage open data. Make it available. (full stop)

I first saw this at Nathan Yau’s A human-readable explorer for SEC filings.

R Markdown:… [Open Analysis, successor to Open Data?]

Tuesday, February 25th, 2014

R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics by Ben Baumer,


Nolan and Temple Lang argue that “the ability to express statistical computations is an essential skill.” A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as data become increasingly complex and statistical methods become increasingly sophisticated. R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation. (emphasis in original)

The author’s third point for R Markdown I would have made the first:

Third, the separation of computing from presentation is not necessarily honest… More subtly and less perniciously, the copy-and-paste paradigm enables, and in many cases even encourages, selective reporting. That is, the tabular output from R is admittedly not of presentation quality. Thus the student may be tempted or even encouraged to prettify tabular output before submitting. But while one is fi ddling with margins and headers, it is all too tempting to remove rows or columns that do not suit the student’s purpose. Since the commands used to generate the table are not present, the reader is none the wiser.

Although I have to admit that reproducibility has a lot going for it.

Can you imagine reproducible analysis from the OMB? Complete with machine readable data sets? Or for any other agency reports. Or for that matter, for all publications by registered lobbyists. That could be real interesting.

Open Analysis (OA) as a natural successor to Open Data.

That works for me.


PS: More resources:

Create Dynamic R Statistical Reports Using R Markdown

R Markdown

Using R Markdown with RStudio

Writing papers using R Markdown

If journals started requiring R Markdown as a condition for publication, some aspects of research would become more transparent.

Some will say that authors will resistl

Assume Science or Nature has accepted your article on the condition of your use of R Markdown.

Honestly, are you really going to say no?

I first saw this in a tweet by Scott Chamberlain.


Saturday, February 22nd, 2014

OpenRFPs: Open RFP Data for All 50 States by Clay Johnson.

From the post:

Tomorrow at CodeAcross we’ll be launching our first community-based project, OpenRFPs. The goal is to liberate the data inside of every state RFP listing website in the country. We hope you’ll find your own state’s RFP site, and contribute a parser.

The Department of Better Technology’s goal is to improve the way government works by making it easier for small, innovative businesses to provide great technology to government. But those businesses can barely make it through the front door when the RFPs themselves are stored in archaic systems, with sloppy user interfaces and disparate data formats, or locked behind paywalls.

I have posted to the announcement suggesting they use UBL. But in any event, mapping the semantics of RFPs, to enable wider participation would make an interesting project.

I first saw this in a tweet by Tim O’Reilly.

Fiscal Year 2015 Budget (US) Open Government?

Friday, February 21st, 2014

Fiscal Year 2015 Budget

From the description:

Each year, the Office of Management and Budget (OMB) prepares the President’s proposed Federal Government budget for the upcoming Federal fiscal year, which includes the Administration’s budget priorities and proposed funding.

For Fiscal Year (FY) 2015– which runs from October 1, 2014, through September 30, 2015– OMB has produced the FY 2015 Federal Budget in four print volumes plus an all-in-one CD-ROM:

  1. the main “Budget” document with the Budget Message of the President, information on the President’s priorities and budget overviews by agency, and summary tables;
  2. “Analytical Perspectives” that contains analyses that are designed to highlight specified subject areas;
  3. “Historical Tables” that provides data on budget receipts, outlays, surpluses or deficits, Federal debt over a time period
  4. an “Appendix” with detailed information on individual Federal agency programs and appropriation accounts that constitute the budget.
  5. A CD-ROM version of the Budget is also available which contains all the FY 2015 budget documents in PDF format along with some additional supporting material in spreadsheet format.

You will also want a “Green Book,” the 2014 version carried this description:

Each February when the President releases his proposed Federal Budget for the following year, Treasury releases the General Explanations of the Administration’s Revenue Proposals. Known as the “Green Book” (or Greenbook), the document provides a concise explanation of each of the Administration’s Fiscal Year 2014 tax proposals for raising revenue for the Government. This annual document clearly recaps each proposed change, reviewing the provisions in the Current Law, outlining the Administration’s Reasons for Change to the law, and explaining the Proposal for the new law. Ideal for anyone wanting a clear summary of the Administration’s policies and proposed tax law changes.

Did I mention that the four volumes for the budget in print with CD-ROM are $250? And last year the Green Book was $75?

For $325.00, you can have a print and pdf of the Budget plus a print copy of the Green Book.


  1. Would machine readable versions of the Budget + Green Book make it easier to explore and compare the information within?
  2. Are PDFs and print volumes what President Obama considers to be “open government?”
  3. Who has the advantage in policy debates, the OMB and Treasury with machine readable versions of these documents or the average citizen who has the PDFs and print?
  4. Do you think OMB and Treasury didn’t get the memo? Open Data Policy-Managing Information as an Asset

Public policy debates cannot be fairly conducted without meaningful access to data on public policy issues.

Islamic Finance: A Quest for Publically Available Bank-level Data

Wednesday, February 12th, 2014

Islamic Finance: A Quest for Publically Available Bank-level Data by Amin Mohseni-Cheraghlou.

From the post:

Attend a seminar or read a report on Islamic finance and chances are you will come across a figure between $1 trillion and $1.6 trillion, referring to the estimated size of the global Islamic assets. While these aggregate global figures are frequently mentioned, publically available bank-level data have been much harder to come by.

Considering the rapid growth of Islamic finance, its growing popularity in both Muslim and non-Muslim countries, and its emerging role in global financial industry, especially after the recent global financial crisis, it is imperative to have up-to-date and reliable bank-level data on Islamic financial institutions from around the globe.

To date, there is a surprising lack of publically available, consistent and up-to-date data on the size of Islamic assets on a bank-by-bank basis. In fairness, some subscription-based datasets, such Bureau Van Dijk’s Bankscope, do include annual financial data on some of the world’s leading Islamic financial institutions. Bank-level data are also compiled by The Banker’s Top Islamic Financial Institutions Report and Ernst & Young’s World Islamic Banking Competitiveness Report, but these are not publically available and require subscription premiums, making it difficult for many researchers and experts to access. As a result, data on Islamic financial institutions are associated with some level of opaqueness, creating obstacles and challenges for empirical research on Islamic finance.

The recent opening of the Global Center for Islamic Finance by World Bank Group President Jim Young Kim may lead to exciting venues and opportunities for standardization, data collection, and empirical research on Islamic finance. In the meantime, the Global Financial Development Report (GFDR) team at the World Bank has also started to take some initial steps towards this end.

I can think of two immediate benefits from publicly available data on Islamic financial institutions:

First, hopefully it will increase demands for meaningful transparency in Western financial institutions.

Second, it will blunt government hand waving and propaganda about the purposes of Islamic financial institutions. Which on a par with financial institutions everywhere want to remain solvent, serve the needs of their customers and play active roles in their communities. Nothing more sinister than that.

Perhaps the best way to vanquish suspicion is with transparency. Except for the fringe cases who treat lack of evidence as proof of secret evil doing.

…Desperately Seeking Data Integration

Tuesday, January 21st, 2014

Why the US Government is Desperately Seeking Data Integration by David Linthicum.

From the post:

“When it comes to data, the U.S. federal government is a bit of a glutton. Federal agencies manage on average 209 million records, or approximately 8.4 billion records for the entire federal government, according to Steve O’Keeffe, founder of the government IT network site, MeriTalk.”

Check out these stats, in a December 2013 MeriTalk survey of 100 federal records and information management professionals. Among the findings:

  • Only 18 percent said their agency had made significant progress toward managing records and email in electronic format, and are ready to report.
  • One in five federal records management professionals say they are “completely prepared” to handle the growing volume of government records.
  • 92 percent say their agency “has a lot of work to do to meet the direction.”
  • 46 percent say they do not believe or are unsure about whether the deadlines are realistic and obtainable.
  • Three out of four say the Presidential Directive on Managing Government Records will enable “modern, high-quality records and information management.”

I’ve been working with the US government for years, and I can tell that these facts are pretty accurate. Indeed, the paper glut is killing productivity. Even the way they manage digital data needs a great deal of improvement.

I don’t doubt a word of David’s post. Do you?

What I do doubt is the ability of the government to integrate its data. At least unless and until it makes some fundamental choices about the route it will take to data integration.

First, replacement of existing information systems is a non-goal. Unless that is an a prioriassumption, the politics, both on Capital Hill and internal to any agency, program, etc. will doom a data integration effort before it begins.

The first non-goal means that the ROI of data integration must be high enough to be evident even with current systems in place.

Second, integration of the most difficult cases is not the initial target for any data integration project. It would be offensive to cite all the “boil the ocean” projects that have failed in Washington, D.C. Let’s just agree that judicious picking of high value and reasonable effort integration cases are a good proving ground.

Third, the targets and costs for meeting those targets of data integration, along with expected ROI, will be agreed upon by all parties before any work starts. Avoidance of mission creep is essential to success. Not to mention that public goals and metrics will enable everyone to decide if the goals have been meet.

Fourth, employment of traditional vendors, unemployed programmers, geographically dispersed staff, etc. are also non-goals of the project. With the money that can be saved by robust data integration, departments can feather their staffs as much as they like.

If you need proof of the fourth requirement, consider the various Apache projects that are now the the underpinnings for “big data” in its many forms.

It is possible to solve the government’s data integration issues. But not without some hard choices being made up front about the project.

Sorry, forgot one:

Fifth, the project leader should seek a consensus among the relevant parties but ultimately has the authority to make decisions for the project. If every dispute can have one or more parties running to their supervisor or congressional backer, the project is doomed before it starts. The buck stops with the project manager and no where else.

Extracting Insights – FBO.Gov

Tuesday, January 21st, 2014

Extracting Insights from FBO.Gov data – Part 1

Extracting Insights from FBO.Gov data – Part 2

Extracting Insights from FBO.Gov data – Part 3

Dave Fauth has written a great three part series on extracting “insights” from large amounts of data.

From the third post in the series:

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

In the first part of this series, I looked at the data and munged the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. In part two of this series, I created some bubble charts looking at awards by Agency and also the most popular Awardees.

In the third part of the series, I am going to look at awards by date and then displaying that information in a calendar view. Then we will look at the types of awards.

For the date analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field, and then calculate the counts by date for the awards. The goal is to see when awards are being made.

The most compelling lesson from this series is that data doesn’t always easily give up its secrets.

If you make it to the end of the series, you will find the government, on occasion, does the right thing. I’ll admit it, I was very surprised. ;-)

Medicare Spending Data…

Sunday, January 19th, 2014

Medicare Spending Data May Be Publicly Available Under New Policy by Gavin Baker.

From the post:

On Jan. 14, the Centers for Medicare & Medicaid Services (CMS) announced a new policy that could bring greater transparency to Medicare, one of the largest programs in the federal government. CMS revoked its long-standing policy not to release publicly any information about Medicare’s payments to doctors. Under the new policy, the agency will evaluate requests for such information on a case-by-case basis. Although the impact of the change is not yet clear, it creates an opportunity for a welcome step forward for data transparency and open government.

Medicare’s tremendous size and impact – expending an estimated $551 billion and covering roughly 50 million beneficiaries in 2012 – mean that increased transparency in the program could have big effects. Better access to Medicare spending data could permit consumers to evaluate doctor quality, allow journalists to identify waste or fraud, and encourage providers to improve health care delivery.

Until now, the public hasn’t been able to learn how much Medicare pays to particular medical businesses. In 1979, a court blocked Medicare from releasing such information after doctors fought to keep it secret. However, the court lifted the injunction in May 2013, freeing CMS to consider whether to release the data.

In turn, CMS asked for public comments about what it should do and received more than 130 responses. The Center for Effective Government was among the organizations that filed comments, calling for more transparency in Medicare spending and urging CMS to revoke its previous policy implementing the injunction. After considering those comments, CMS adopted its new policy.

The change may allow the public to examine the reimbursement amounts paid to medical providers under Medicare. Under the new approach, CMS will not release those records wholesale. Instead, the agency will wait for specific requests for the data and then evaluate each to consider if disclosure would invade personal privacy. While information about patients is clearly off-limits, it’s not clear what kind of information about doctors CMS will consider private, so it remains to be seen how much information is ultimately disclosed under the new policy. It should be noted, however, that the U.S. Supreme Court has held that businesses don’t have “personal privacy” under the Freedom of Information Act (FOIA), and the government already discloses the amounts it pays to other government contractors.

The announcement from CMS: Modified Policy on Freedom of Information Act Disclosure of Amounts Paid to Individual Physicians under the Medicare Program

The case by case determination of a physician’s privacy rights is an attempt to discourage requests for public information.

If all physician payment data, say by procedure, were available in state by state data sets, local residents in a town of 500 would know a 2,000 x-rays a year is on the high side. Without every knowing any patient’s identity.

If you are a U.S. resident, take this opportunity to push for greater transparency in Medicare spending. Be polite and courteous but also be persistent. You need no more reason than an interest in how Medicare is being spent.

Let’s have an FOIA (Freedom of Information Act) request pending for every physician in the United States within 90 days of the CMS rule becoming final.

It’s not final yet, but when it is, let slip the lease on the dogs of FOAI.

Data Analytic Recidivism Tool (DART) [DAFT?]

Sunday, December 29th, 2013

Data Analytic Recidivism Tool (DART)

From the website:

The Data Analytic Recidivism Tool (DART) helps answer questions about recidivism in New York City.

  • Are people that commit a certain type of crime more likely to be re-arrested?
  • What about people in a certain age group or those with prior convictions?

DART lets users look at recidivism rates for selected groups defined by characteristics of defendants and their cases.

A direct link to the DART homepage.

After looking at the interface, which groups recidivists in groups of 250, I’m not sure DART is all that useful.

It did spark an idea that might help with the federal government’s acquisition problems.

Why not create the equivalent of DART but call it:

Data Analytic Failure Tool (DAFT).

And in DAFT track federal contractors, their principals, contracts, and the program officers who play any role in those contracts.

So that when contractors fail, as so many of them do, it will be easy to track the individuals involved on both sides of the failure.

And every contract will have a preamble that recites any prior history of failure and the people involved in that failure, on all sides.

Such that any subsequent supervisor has to sign off with full knowledge of the prior lack of performance.

If criminal recidivism is to be avoided, shouldn’t failure recidivism be avoided as well?

Discover Your Neighborhood with Census Explorer

Wednesday, December 25th, 2013

Discover Your Neighborhood with Census Explorer by Michael Ratcliffe.

From the post:

Our customers often want to explore neighborhood-level statistics and see how their communities have changed over time. Our new Census Explorer interactive mapping tool makes this easier than ever. It provides statistics on a variety of topics, such as percent of people who are foreign-born, educational attainment and homeownership rate. Statistics from the 2008 to 2012 American Community Survey power Census Explorer.

While you may be familiar with other ways to find neighborhood-level statistics, Census Explorer provides an interactive map for states, counties and census tracts. You can even look at how these neighborhoods have changed over time because the tool includes information from the 1990 and 2000 censuses in addition to the latest American Community Survey statistics. Seeing these changes is possible because the annual American Community Survey replaced the decennial census long form, giving communities throughout the nation more timely information than just once every 10 years.

Topics currently available in Census Explorer:

  • Total population
  • Percent 65 and older
  • Foreign-born population percentage
  • Percent of the population with a high school degree or higher
  • Percent with a bachelor’s degree or higher
  • Labor force participation rate
  • Home ownership rate
  • Median household income

Fairly coarse (census tract level) data but should be useful for any number of planning purposes.

For example, you could cross this data with traffic ticket and arrest data to derive “police presence” statistics.

Or add “citizen watcher” data from tweets about police car # and locations.

Different data sets often use different boundaries for areas.

Consider creating topic map based filters so when the boundaries change (a favorite activity of local governments) so will your summaries of that data.

…2013 World Ocean Database…

Sunday, December 22nd, 2013

NOAA releases 2013 World Ocean Database: The largest collection of scientific information about the oceans

From the post:

NOAA has released the 2013 World Ocean Database, the largest, most comprehensive collection of scientific information about the oceans, with records dating as far back as 1772. The 2013 database updates the 2009 version and contains nearly 13 million temperature profiles, compared with 9.1 in the 2009 database, and just fewer than six million salinity measurements, compared with 3.5 in the previous database. It integrates ocean profile data from approximately 90 countries around the world, collected from buoys, ships, gliders, and other instruments used to measure the “pulse” of the ocean.

Profile data of the ocean are measurements taken at many depths, from the surface to the floor, at a single location, during the time it takes to lower and raise the measuring instruments through the water. “This product is a powerful tool being used by scientists around the globe to study how changes in the ocean can impact weather and climate,” said Tim Boyer, an oceanographer with NOAA’s National Oceanographic Data Center.

In addition to using the vast amount of temperature and salinity measurements to monitor changes in heat and salt content, the database captures other measurements, including: oxygen, nutrients, chlorofluorocarbons and chlorophyll, which all reveal the oceans’ biological structure.

For the details on this dataset see: WOD Introduction.

The introduction notes under 1.1.5 Data Fusion:

It is not uncommon in oceanography that measurements of different variables made from the same sea water samples are often maintained as separate databases by different principal investigators. In fact, data from the same oceanographic cast may be located at different institutions in different countries. From its inception, NODC recognized the importance of building oceanographic databases in which as much data from each station and each cruise as possible are placed into standard formats, accompanied by appropriate metadata that make the data useful to future generations of scientists. It was the existence of such databases that allowed the International Indian Ocean Expedition Atlas (Wyrtki, 1971) and Climatological Atlas of the World Ocean (Levitus, 1982) to be produced without the time-consuming, laborious task of gathering data from many different sources. Part of the development of WOD13 has been to expand this data fusion activity by increasing the number of variables that NODC/WDC makes available as part of standardized databases.

As the NODC (National Oceanographic Data Center) demonstrates, it is possible to curate data sources in order to present a uniform data collection.

But curated data set remains inconsistent with data sets not curated by the same authority.

And combining curated data with non-curated data requires effort with the curated data, again.

Hard to map towards a destination without knowing its location.

Topic maps can capture the basis for curation, which will enable faster and more accurate integration of foreign data sets in the future.

UNESCO Open Access Publications [Update]

Thursday, December 19th, 2013

UNESCO Open Access Publications

From the webpage:

Building peaceful, democratic and inclusive knowledge societies across the world is at the heart of UNESCO’s mandate. Universal access to information is one of the fundamental conditions to achieve global knowledge societies. This condition is not a reality in all regions of the world.

In order to help reduce the gap between industrialized countries and those in the emerging economy, UNESCO has decided to adopt an Open Access Policy for its publications by making use of a new dimension of knowledge sharing – Open Access.

Open Access means free access to scientific information and unrestricted use of electronic data for everyone. With Open Access, expensive prices and copyrights will no longer be obstacles to the dissemination of knowledge. Everyone is free to add information, modify contents, translate texts into other languages, and disseminate an entire electronic publication.

For UNESCO, adopting an Open Access Policy means to make thousands of its publications freely available to the public. Furthermore, Open Access is also a way to provide the public with an insight into the work of the Organization so that everyone is able to discover and share what UNESCO is doing.

You can access and use our resources for free by clicking here.

In May of 2013 UNESCO announced its Open Access policy.

Many organizations profess a belief in “Open Access.”

The real test is whether they practice “Open Access.”


Thursday, December 19th, 2013


I don’t know enough about the Brazilian economy to say if the visualizations are helpful or not.

What I can tell you is the visualizations are impressive!

Thoughts on the site as an interface to open data?

PS: This appears to be a government supported website so not all government sponsored websites are poor performers.

Aberdeen – 1398 to Present

Sunday, December 15th, 2013

A Text Analytic Approach to Rural and Urban Legal Histories

From the post:

Aberdeen has the earliest and most complete body of surviving records of any Scottish town, running in near-unbroken sequence from 1398 to the present day. Our central focus is on the ‘provincial town’, especially its articulations and interactions with surrounding rural communities, infrastructure and natural resources. In this multi-disciplinary project, we apply text analytical tools to digitised Aberdeen Burgh Records, which are a UNESCO listed cultural artifact. The meaningful content of the Records is linguistically obscured, so must be interpreted. Moreover, to extract and reuse the content with Semantic Web and linked data technologies, it must be machine readable and richly annotated. To accomplish this, we develop a text analytic tool that specifically relates to the language, content, and structure of the Records. The result is an accessible, flexible, and essential precursor to the development of Semantic Web and linked data applications related to the Records. The applications will exploit the artifact to promote Aberdeen Burgh and Shire cultural tourism, curriculum development, and scholarship.

The scholarly objective of this project is to develop the analytic framework, methods, and resource materials to apply a text analytic tool to annotate and access the content of the Burgh records. Amongst the text analytic issues to address in historical perspective are: the identification and analysis of legal entities, events, and roles; and the analysis of legal argumentation and reasoning. Amongst the legal historical issues are: the political and legal culture and authority in the Burgh and Shire, particularly pertaining to the management and use of natural resources. Having an understanding of these issues and being able to access them using Semantic Web/linked data technologies will then facilitate exploitation in applications.

This project complements a distinct, existing collaboration between the Aberdeen City & Aberdeenshire Archives (ACAA) and the University (Connecting and Projecting Aberdeen’s Burgh Records, jointly led by Andrew Mackillop and Jackson Armstrong) (the RIISS Project), which will both make a contribution to the project (see details on application form). This multi-disciplinary application seeks funding from Dot.Rural chiefly for the time of two specialist researchers: a Research Fellow to interpret the multiple languages, handwriting scripts, archaic conventions, and conceptual categories emerging from these records; and subcontracting the A-I to carry out the text analytic and linked data tasks on a given corpus of previously transcribed council records, taking the RF’s interpretation as input.

Now there’s a project for tracking changing semantics over the hills and valleys of time!

Will be interesting to see how they capture semantics that are alien to our own.

Or how they preserve relationships between ancient semantic concepts.

Requesting Datasets from the Federal Government

Friday, December 13th, 2013

Requesting Datasets from the Federal Government by Eruditio Loginquitas.

From the post:

Much has been made of “open government” of late, with the U.S.’s federal government releasing tens of thousands of data sets from pretty much all public-facing offices. Many of these sets are available off of their respective websites. Many are offered in a centralized way at I finally spent some time on this site in search of datasets with location data to continue my learning of Tableau Public (with an eventual planned move to ArcMap).

I’ve been appreciating how much data are required to govern effectively but also how much data are created in the work of governance, particularly in an open and transparent society. There are literally billions of records and metrics required to run an efficient modern government. In a democracy, the tendency is to make information available—through sunshine laws and open meetings laws and data requests. The openness is particularly pronounced in cases of citizen participation, academic research, and journalistic requests. These are all aspects of a healthy interchange between citizens and their government…and further, digital government.

Public Requests for Data

One of the more charming aspects of the site involves a public thread which enables people to make requests for the creation of certain data sets by developers. People would make the case for the need for certain information. Some would offer “trades” by making promises about how they would use the data and what they would make available to the larger public. Others would simply make a request for the data. Still others would just post “requests,” which were actually just political or personal statements. (The requests site may be viewed here:;=1 .)

What datasets would you like to see?

The rejected requests can interesting, for example:

Properties Owned by Congressional Members Rejected

Congressional voting records Rejected

I don’t think the government has detailed information sufficient to answer the one about property owned by members of Congress.

On the other hand there are only 535 members so manual data mining in each state should turn up most of the public information fairly easily. The not public information could be more difficult.

The voting records request is puzzling since that is public record. And various rant groups print up their own analysis of voting records.

I don’t know, given the number of requests “Under Review” if it would be a good use of time but requesting the data behind opaque reports might illuminate the areas being hidden from transparency.

Scout [NLP, Move up from Twitter Feeds to Court Opinions]

Tuesday, December 3rd, 2013


From the about page:

Scout is a free service that provides daily insight to how our laws and regulations are shaped in Washington, DC and our state capitols.

These days, you can receive electronic alerts to know when a company is in the news, when a TV show is scheduled to air or when a sports team wins. Now, you can also be alerted when our elected officials take action on an issue you care about.

Scout allows anyone to subscribe to customized email or text alerts on what Congress is doing around an issue or a specific bill, as well as bills in the state legislature and federal regulations. You can also add external RSS feeds to complement a Scout subscription, such as press releases from a member of Congress or an issue-based blog.

Anyone can create a collection of Scout alerts around a topic, for personal organization or to make it easy for others to easily follow a whole topic at once.

Researchers can use Scout to see when Congress talks about an issue over time. Members of the media can use Scout to track when legislation important to their beat moves ahead in Congress or in state houses. Non-profits can use Scout as a tool to keep tabs on how federal and state lawmakers are making policy around a specific issue.

Early testing of Scout during its open beta phase alerted Sunlight and allies in time to successfully stop an overly broad exemption to the Freedom of Information Act from being applied to legislation that was moving quickly in Congress. Read more about that here.

Thank you to the Stanton Foundation, who contributed generous support to Scout’s development.

What kind of alerts?

If your manager suggests a Twitter feed to test NLP, classification, sentiment, etc. code, ask to use Federal Court (U.S.) Court Opinion Feed instead.

Not all data is written in one hundred and forty (140) character chunks. ;-)

PS: Be sure to support/promote the Sunlight Foundation for making this data available.

Casualty Count for Obamacare (0)

Wednesday, November 20th, 2013

5 lessons IT leaders can learn from Obamacare rollout mistakes by Teena Hammond.

Teena reports on five lessons to be learned from the rollout:

  1. If you’re going to launch a new website, decide whether to use in-house talent or outsource. If you opt to outsource, hire a good contractor.
  2. Follow the right steps to hire the best vendor for the project, and properly manage the relationship.
  3. Have one person in charge of the project with absolute veto power.
  4. Do not gloss over any problems along the way. Be open and honest about the progress of the project. And test the site.
  5. Be ready for success or failure. Hope for the best but prepare for the worst and have guidelines to manage any potential failure.

There is a sixth lesson that emerges from Vaughn Bullard, CEO and founder of Build.Automate Inc., who is quoted in part saying:

The contractor telling the government that it was ready despite the obvious major flaws in the system is just baffling to me. If I had an employee that did something similar, I would have terminated their employment. It’s pretty simple.”

What it comes down to in the end, Bullard said, is that, “Quality and integrity count in all things.”

To avoid repeated failures in the future (sixth lesson), terminate those responsible for the current failure.

All contractors and their staffs. Track the staffs in order to avoid the same staff moving to other contractors.

Termination all appointed or hired staff who responsible for the contract and/or management of the project.

Track former staff employment by contractors and refuse contracts wherever they are employed.

You may have noticed that the reported casualty count for the Obamacare failure has been zero.

What incentive exists for the next group of contract/project managers and/or contractors for “quality and integrity?”

That would be the same as the casualty count, zero.

PS: Before you protest the termination and ban of failures as cruel, consider its advantages as a wealth redistribution program.

The government may not get better service but it will provide opportunities for fraud and poor quality work from new participants.

Not to mention there are IT service providers who exhibit quality and integrity. Absent traditional mis-management, the government could happen upon one of those.

The tip for semantic technologies is to under-promise and over-deliver. Always.

Free Access to EU Satellite Data

Thursday, November 14th, 2013

Free Access to EU Satellite Data (Press Release, Brussels, 13 November 2013).

From the release:

The European Commission will provide free, full and open access to a wealth of important environmental data gathered by Copernicus, Europe’s Earth observation system. The new open data dissemination regime, which will come into effect next month, will support the vital task of monitoring the environment and will also help Europe’s enterprises, creating new jobs and business opportunities. Sectors positively stimulated by Copernicus are likely to be services for environmental data production and dissemination, as well as space manufacturing. Indirectly, a variety of other economic segments will see the advantages of accurate earth observation, such as transport, oil and gas, insurance and agriculture. Studies show that Copernicus – which includes six dedicated satellite missions, the so-called Sentinels, to be launched between 2014 and 2021 – could generate a financial benefit of some € 30 billion and create around 50.000 jobs by 2030. Moreover, the new open data dissemination regime will help citizens, businesses, researchers and policy makers to integrate an environmental dimension into all their activities and decision making procedures.

To make maximum use of this wealth of information, researchers, citizens and businesses will be able to access Copernicus data and information through dedicated Internet-based portals. This free access will support the development of useful applications for a number of different industry segments (e.g. agriculture, insurance, transport, and energy). Other examples include precision agriculture or the use of data for risk modelling in the insurance industry. It will fulfil a crucial role, meeting societal, political and economic needs for the sustainable delivery of accurate environmental data.

More information on the Copernicus web site at:

The “€ 30 billion” financial benefit seems a bit soft after looking at the study reports on the economic value of Copernicus.

For example, if Copernicus is used to monitor illegal dumping (D. Drimaco, Waste monitoring service to improve waste management practices and detect illegal landfills), how is a financial benefit calculated for illegal dumping prevented?

If you are the Office of Management and Budget (U.S.), you could simply make up the numbers and report them in near indecipherable documents. (Free Sequester Data Here!)

I don’t doubt there will be economic benefits from Copernicus but questions remain: how much and for who?

I first saw this in a tweet by Stefano Bertolo.

Implementations of Data Catalog Vocabulary

Tuesday, November 5th, 2013

Implementations of Data Catalog Vocabulary

From the post:

The Government Linked Data (GLD) Working Group today published the Data Catalog Vocabulary (DCAT) as a Candidate Recommendation. DCAT allows governmental and non-governmental data catalogs to publish their entries in a standard machine-readable format so they can be managed, aggregated, and presented in other catalogs.

Originally developed at DERI, DCAT has evolved with input from a variety of stakeholders and is now stable and ready for widespread use. If you have a collection of data sources, please consider publishing DCAT metadata for it, and if you run a data catalog or portal, please consider making use of DCAT metadata you find. The Working Group is eager to receive comments reports of use at and is maintaining an Implementation Report.

If you know anyone in the United States government, please suggest this to them.

The more time the U.S. government spends on innocuous data, the less time it has to spy on its citizens and the citizens and governments of other countries.

I say innocuous data because I have yet to see any government release information that would discredit the current regime.

Wasn’t true for the Pentagon Papers, the Watergate tapes or the Snowden releases.

Can you think of any voluntary release of data by any government that discredited a current regime?

The reason for secrecy isn’t to protect techniques or sources.

Guess whose incompetence would be exposed by transparency?

Open Data Index

Monday, November 4th, 2013

Open Data Index by Armin Grossenbacher.

From the post:

There are lots of indexes.

The most famous one may be the Index Librorum Prohibitorum listing books prohibited by the cathoilic church. It contained eminent scientists and intellectuals (see the list in Wikipedia) and was abolished after more than 400 years in 1966 only.

Open Data Index

One index everybody would like to be registered in and this with a high rank is the Open Data Index.

‘An increasing number of governments have committed to open up data, but how much key information is actually being released? …. Which countries are the most advanced and which are lagging in relation to open data? The Open Data Index has been developed to help answer such questions by collecting and presenting information on the state of open data around the world – to ignite discussions between citizens and governments.’

I haven’t seen the movie review guide that appeared in Our Sunday Visitor in years but when I was in high school it was the best movie guide around. Just pick the ones rated as morally condemned. ;-)

There are two criteria I don’t see mentioned for rating open data:

  1. How easy/hard is it to integrate a particular data set with other data from the same source or organization?
  2. Is the data supportive, neutral or negative with regard to established government policies?

Do you know of any open data sets where those questions are used to rate them?

the /unitedstates project

Tuesday, October 29th, 2013

the /unitedstates project

From the webpage:

/unitedstates is a shared commons of data and tools for the United States. Made by the public, used by the public.

There you will find:

bill-nicknames Tiny spreadsheet of common nicknames for bills and laws.

citation Stand-alone legal citation detector. Text in, citations out.

congress-legislators Detailed data on members of Congress, past and present.

congress Scrapers and parsers for the work of Congress, all day, every day.

glossary A public domain glossary for the United States.

licensing Policy guidelines for the licensing of US government information.

uscode Parser for the US Code.

wish-list Post ideas for new projects.

Can you guess what the #1 wish on the project list is?

Campaign finance donor de-duplicator

Semantics and Delivery of Useful Information [Bills Before the U.S. House]

Monday, October 21st, 2013

Lars Marius Garshol pointed out in Semantic Web adoption and the users the question of “What do semantic technologies do better than non-semantic technologies?” has yet to be answered.

Tim O’Reilly tweeted about Madison Federal today, a resource that raises the semantic versus non-semantic technology question.

In a nutshell, Madison Federal has all the bills pending before the U.S. House of Representatives online.

If you login with Facebook, you can:

  • Add a bill edit / comment
  • Enter a community suggestion
  • Enter a community comment
  • Subscribe to future edits/comments on a bill

So far, so good.

You can pick any bill but the one I chose as an example is: Postal Executive Accountability Act.

I will quote just a few lines of the bill:

2. Limits on executive pay

    (a) Limitation on compensation Section 1003 of title 39, United States Code, 
         is amended:

         (1) in subsection (a), by striking the last sentence; and
         (2) by adding at the end the following:

                  (1) Subject to paragraph (2), an officer or employee of the Postal 
                      Service may not be paid at a rate of basic pay that exceeds 
                      the rate of basic pay for level II of the Executive Schedule 
                      under section 5312 of title 5.

What would be the first thing you want to know?

Hmmm, what about subsection (a) of title 39 of the United States Code since we are striking the last sentence?

39 USC § 1003 – Employment policy [Legal Information Institute], which reads:

(a) Except as provided under chapters 2 and 12 of this title, section 8G of the Inspector General Act of 1978, or other provision of law, the Postal Service shall classify and fix the compensation and benefits of all officers and employees in the Postal Service. It shall be the policy of the Postal Service to maintain compensation and benefits for all officers and employees on a standard of comparability to the compensation and benefits paid for comparable levels of work in the private sector of the economy. No officer or employee shall be paid compensation at a rate in excess of the rate for level I of the Executive Schedule under section 5312 of title 5.

OK, so now we know that (1) is striking:

No officer or employee shall be paid compensation at a rate in excess of the rate for level I of the Executive Schedule under section 5312 of title 5.

Semantics? No, just a hyperlink.

For the added text, we want to know what is meant by:

… rate of basic pay that exceeds the rate of basic pay for level II of the Executive Schedule under section 5312 of title 5.

The Legal Information Institute is already ahead of Congress because their system provides the hyperlink we need: 5312 of title 5.

If you notice something amiss when you follow that link, congratulations! You have discovered your first congressional typo and/or error.

5312 of title 5 defines Schedule I of the Executive Schedule, which includes the Secretary of State, Secretary of the Treasury, Secretary of Defense, Attorney General and others. Base rate for Executive Schedule Level I is $199,700.

On the other hand, 5313 of title 5 defines Schedule II of the Executive Schedule, which includes Department of Agriculture, Deputy Secretary of Agriculture; Department of Defense, Deputy Secretary of Defense, Secretary of the Army, Secretary of the Navy, Secretary of the Air Force, Under Secretary of Defense for Acquisition, Technology and Logistics; Department of Education, Deputy Secretary of Education; Department of Energy, Deputy Secretary of Energy and others. Base rate for Executive Schedule Level II is $178,700.

Assuming someone catches or comments that 5312 should be 5313, top earners at the Postal Service may be about to take a $21,000.00 pay reduction.

We got all that from mechanical hyperlinks, no semantic technology required.

Where you might need semantic technology is when reading 39 USC § 1003 – Employment policy [Legal Information Institute] where it says (in part):

…It shall be the policy of the Postal Service to maintain compensation and benefits for all officers and employees on a standard of comparability to the compensation and benefits paid for comparable levels of work in the private sector of the economy….

Some questions:

Question: What are “comparable levels of work in the private sector of the economy?”

Question: On what basis is work for the Postal Service compared to work in the private economy?

Question: Examples of comparable jobs in the private economy and their compensation?

Question: What policy or guideline documents have been developed by the Postal Service for evaluation of Postal Service vs. work in the private economy?

Question: What studies have been done, by who, using what practices, on comparing compensation for Postal Service work to work in the private economy?

That would be a considerable amount of information with what I suspect would be a large amount of duplication as reports or studies are cited by numerous sources.

Semantic technology would be necessary for the purpose of deduping and navigating such a body of information effectively.

Pick a bill. Where would you put the divide between mechanical hyperlinks and semantic technologies?

PS: You may remember that the House of Representatives had their own “post office” which they ran as a slush fund. The thought of the House holding someone “accountable” is too bizarre for words.