Archive for the ‘Government Data’ Category

SEC Filings for Humans

Tuesday, February 25th, 2014

SEC Filings for Humans by Meris Jensen.

After a long and sad history of the failure of the SEC to make EDGAR useful:

Rank and Filed gathers data from EDGAR, indexes it, and returns it in formats meant to help investors research, investigate and discover companies on their own. I started googling ‘How to build a website’ seven months ago. The SEC has web developers, software developers, database administrators, XBRL experts, legions of academics who specialize in SEC filings, and all this EDGAR data already cached in the cloud. The Commission’s mission is to protect investors, maintain fair, orderly and efficient markets, and facilitate capital formation. Why did I have to build this? (emphasis added)

I don’t know the answer to Meris’ question but I can tell you that Rank and Filed is an incredible resource for financial information.

And yet another demonstration that government should not manage open data. Make it available. (full stop)

I first saw this at Nathan Yau’s A human-readable explorer for SEC filings.

R Markdown:… [Open Analysis, successor to Open Data?]

Tuesday, February 25th, 2014

R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics by Ben Baumer,


Nolan and Temple Lang argue that “the ability to express statistical computations is an essential skill.” A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as data become increasingly complex and statistical methods become increasingly sophisticated. R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation. (emphasis in original)

The author’s third point for R Markdown I would have made the first:

Third, the separation of computing from presentation is not necessarily honest… More subtly and less perniciously, the copy-and-paste paradigm enables, and in many cases even encourages, selective reporting. That is, the tabular output from R is admittedly not of presentation quality. Thus the student may be tempted or even encouraged to prettify tabular output before submitting. But while one is fi ddling with margins and headers, it is all too tempting to remove rows or columns that do not suit the student’s purpose. Since the commands used to generate the table are not present, the reader is none the wiser.

Although I have to admit that reproducibility has a lot going for it.

Can you imagine reproducible analysis from the OMB? Complete with machine readable data sets? Or for any other agency reports. Or for that matter, for all publications by registered lobbyists. That could be real interesting.

Open Analysis (OA) as a natural successor to Open Data.

That works for me.


PS: More resources:

Create Dynamic R Statistical Reports Using R Markdown

R Markdown

Using R Markdown with RStudio

Writing papers using R Markdown

If journals started requiring R Markdown as a condition for publication, some aspects of research would become more transparent.

Some will say that authors will resistl

Assume Science or Nature has accepted your article on the condition of your use of R Markdown.

Honestly, are you really going to say no?

I first saw this in a tweet by Scott Chamberlain.


Saturday, February 22nd, 2014

OpenRFPs: Open RFP Data for All 50 States by Clay Johnson.

From the post:

Tomorrow at CodeAcross we’ll be launching our first community-based project, OpenRFPs. The goal is to liberate the data inside of every state RFP listing website in the country. We hope you’ll find your own state’s RFP site, and contribute a parser.

The Department of Better Technology’s goal is to improve the way government works by making it easier for small, innovative businesses to provide great technology to government. But those businesses can barely make it through the front door when the RFPs themselves are stored in archaic systems, with sloppy user interfaces and disparate data formats, or locked behind paywalls.

I have posted to the announcement suggesting they use UBL. But in any event, mapping the semantics of RFPs, to enable wider participation would make an interesting project.

I first saw this in a tweet by Tim O’Reilly.

Fiscal Year 2015 Budget (US) Open Government?

Friday, February 21st, 2014

Fiscal Year 2015 Budget

From the description:

Each year, the Office of Management and Budget (OMB) prepares the President’s proposed Federal Government budget for the upcoming Federal fiscal year, which includes the Administration’s budget priorities and proposed funding.

For Fiscal Year (FY) 2015– which runs from October 1, 2014, through September 30, 2015– OMB has produced the FY 2015 Federal Budget in four print volumes plus an all-in-one CD-ROM:

  1. the main “Budget” document with the Budget Message of the President, information on the President’s priorities and budget overviews by agency, and summary tables;
  2. “Analytical Perspectives” that contains analyses that are designed to highlight specified subject areas;
  3. “Historical Tables” that provides data on budget receipts, outlays, surpluses or deficits, Federal debt over a time period
  4. an “Appendix” with detailed information on individual Federal agency programs and appropriation accounts that constitute the budget.
  5. A CD-ROM version of the Budget is also available which contains all the FY 2015 budget documents in PDF format along with some additional supporting material in spreadsheet format.

You will also want a “Green Book,” the 2014 version carried this description:

Each February when the President releases his proposed Federal Budget for the following year, Treasury releases the General Explanations of the Administration’s Revenue Proposals. Known as the “Green Book” (or Greenbook), the document provides a concise explanation of each of the Administration’s Fiscal Year 2014 tax proposals for raising revenue for the Government. This annual document clearly recaps each proposed change, reviewing the provisions in the Current Law, outlining the Administration’s Reasons for Change to the law, and explaining the Proposal for the new law. Ideal for anyone wanting a clear summary of the Administration’s policies and proposed tax law changes.

Did I mention that the four volumes for the budget in print with CD-ROM are $250? And last year the Green Book was $75?

For $325.00, you can have a print and pdf of the Budget plus a print copy of the Green Book.


  1. Would machine readable versions of the Budget + Green Book make it easier to explore and compare the information within?
  2. Are PDFs and print volumes what President Obama considers to be “open government?”
  3. Who has the advantage in policy debates, the OMB and Treasury with machine readable versions of these documents or the average citizen who has the PDFs and print?
  4. Do you think OMB and Treasury didn’t get the memo? Open Data Policy-Managing Information as an Asset

Public policy debates cannot be fairly conducted without meaningful access to data on public policy issues.

Islamic Finance: A Quest for Publically Available Bank-level Data

Wednesday, February 12th, 2014

Islamic Finance: A Quest for Publically Available Bank-level Data by Amin Mohseni-Cheraghlou.

From the post:

Attend a seminar or read a report on Islamic finance and chances are you will come across a figure between $1 trillion and $1.6 trillion, referring to the estimated size of the global Islamic assets. While these aggregate global figures are frequently mentioned, publically available bank-level data have been much harder to come by.

Considering the rapid growth of Islamic finance, its growing popularity in both Muslim and non-Muslim countries, and its emerging role in global financial industry, especially after the recent global financial crisis, it is imperative to have up-to-date and reliable bank-level data on Islamic financial institutions from around the globe.

To date, there is a surprising lack of publically available, consistent and up-to-date data on the size of Islamic assets on a bank-by-bank basis. In fairness, some subscription-based datasets, such Bureau Van Dijk’s Bankscope, do include annual financial data on some of the world’s leading Islamic financial institutions. Bank-level data are also compiled by The Banker’s Top Islamic Financial Institutions Report and Ernst & Young’s World Islamic Banking Competitiveness Report, but these are not publically available and require subscription premiums, making it difficult for many researchers and experts to access. As a result, data on Islamic financial institutions are associated with some level of opaqueness, creating obstacles and challenges for empirical research on Islamic finance.

The recent opening of the Global Center for Islamic Finance by World Bank Group President Jim Young Kim may lead to exciting venues and opportunities for standardization, data collection, and empirical research on Islamic finance. In the meantime, the Global Financial Development Report (GFDR) team at the World Bank has also started to take some initial steps towards this end.

I can think of two immediate benefits from publicly available data on Islamic financial institutions:

First, hopefully it will increase demands for meaningful transparency in Western financial institutions.

Second, it will blunt government hand waving and propaganda about the purposes of Islamic financial institutions. Which on a par with financial institutions everywhere want to remain solvent, serve the needs of their customers and play active roles in their communities. Nothing more sinister than that.

Perhaps the best way to vanquish suspicion is with transparency. Except for the fringe cases who treat lack of evidence as proof of secret evil doing.

…Desperately Seeking Data Integration

Tuesday, January 21st, 2014

Why the US Government is Desperately Seeking Data Integration by David Linthicum.

From the post:

“When it comes to data, the U.S. federal government is a bit of a glutton. Federal agencies manage on average 209 million records, or approximately 8.4 billion records for the entire federal government, according to Steve O’Keeffe, founder of the government IT network site, MeriTalk.”

Check out these stats, in a December 2013 MeriTalk survey of 100 federal records and information management professionals. Among the findings:

  • Only 18 percent said their agency had made significant progress toward managing records and email in electronic format, and are ready to report.
  • One in five federal records management professionals say they are “completely prepared” to handle the growing volume of government records.
  • 92 percent say their agency “has a lot of work to do to meet the direction.”
  • 46 percent say they do not believe or are unsure about whether the deadlines are realistic and obtainable.
  • Three out of four say the Presidential Directive on Managing Government Records will enable “modern, high-quality records and information management.”

I’ve been working with the US government for years, and I can tell that these facts are pretty accurate. Indeed, the paper glut is killing productivity. Even the way they manage digital data needs a great deal of improvement.

I don’t doubt a word of David’s post. Do you?

What I do doubt is the ability of the government to integrate its data. At least unless and until it makes some fundamental choices about the route it will take to data integration.

First, replacement of existing information systems is a non-goal. Unless that is an a prioriassumption, the politics, both on Capital Hill and internal to any agency, program, etc. will doom a data integration effort before it begins.

The first non-goal means that the ROI of data integration must be high enough to be evident even with current systems in place.

Second, integration of the most difficult cases is not the initial target for any data integration project. It would be offensive to cite all the “boil the ocean” projects that have failed in Washington, D.C. Let’s just agree that judicious picking of high value and reasonable effort integration cases are a good proving ground.

Third, the targets and costs for meeting those targets of data integration, along with expected ROI, will be agreed upon by all parties before any work starts. Avoidance of mission creep is essential to success. Not to mention that public goals and metrics will enable everyone to decide if the goals have been meet.

Fourth, employment of traditional vendors, unemployed programmers, geographically dispersed staff, etc. are also non-goals of the project. With the money that can be saved by robust data integration, departments can feather their staffs as much as they like.

If you need proof of the fourth requirement, consider the various Apache projects that are now the the underpinnings for “big data” in its many forms.

It is possible to solve the government’s data integration issues. But not without some hard choices being made up front about the project.

Sorry, forgot one:

Fifth, the project leader should seek a consensus among the relevant parties but ultimately has the authority to make decisions for the project. If every dispute can have one or more parties running to their supervisor or congressional backer, the project is doomed before it starts. The buck stops with the project manager and no where else.

Extracting Insights – FBO.Gov

Tuesday, January 21st, 2014

Extracting Insights from FBO.Gov data – Part 1

Extracting Insights from FBO.Gov data – Part 2

Extracting Insights from FBO.Gov data – Part 3

Dave Fauth has written a great three part series on extracting “insights” from large amounts of data.

From the third post in the series:

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

In the first part of this series, I looked at the data and munged the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. In part two of this series, I created some bubble charts looking at awards by Agency and also the most popular Awardees.

In the third part of the series, I am going to look at awards by date and then displaying that information in a calendar view. Then we will look at the types of awards.

For the date analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field, and then calculate the counts by date for the awards. The goal is to see when awards are being made.

The most compelling lesson from this series is that data doesn’t always easily give up its secrets.

If you make it to the end of the series, you will find the government, on occasion, does the right thing. I’ll admit it, I was very surprised. ;-)

Medicare Spending Data…

Sunday, January 19th, 2014

Medicare Spending Data May Be Publicly Available Under New Policy by Gavin Baker.

From the post:

On Jan. 14, the Centers for Medicare & Medicaid Services (CMS) announced a new policy that could bring greater transparency to Medicare, one of the largest programs in the federal government. CMS revoked its long-standing policy not to release publicly any information about Medicare’s payments to doctors. Under the new policy, the agency will evaluate requests for such information on a case-by-case basis. Although the impact of the change is not yet clear, it creates an opportunity for a welcome step forward for data transparency and open government.

Medicare’s tremendous size and impact – expending an estimated $551 billion and covering roughly 50 million beneficiaries in 2012 – mean that increased transparency in the program could have big effects. Better access to Medicare spending data could permit consumers to evaluate doctor quality, allow journalists to identify waste or fraud, and encourage providers to improve health care delivery.

Until now, the public hasn’t been able to learn how much Medicare pays to particular medical businesses. In 1979, a court blocked Medicare from releasing such information after doctors fought to keep it secret. However, the court lifted the injunction in May 2013, freeing CMS to consider whether to release the data.

In turn, CMS asked for public comments about what it should do and received more than 130 responses. The Center for Effective Government was among the organizations that filed comments, calling for more transparency in Medicare spending and urging CMS to revoke its previous policy implementing the injunction. After considering those comments, CMS adopted its new policy.

The change may allow the public to examine the reimbursement amounts paid to medical providers under Medicare. Under the new approach, CMS will not release those records wholesale. Instead, the agency will wait for specific requests for the data and then evaluate each to consider if disclosure would invade personal privacy. While information about patients is clearly off-limits, it’s not clear what kind of information about doctors CMS will consider private, so it remains to be seen how much information is ultimately disclosed under the new policy. It should be noted, however, that the U.S. Supreme Court has held that businesses don’t have “personal privacy” under the Freedom of Information Act (FOIA), and the government already discloses the amounts it pays to other government contractors.

The announcement from CMS: Modified Policy on Freedom of Information Act Disclosure of Amounts Paid to Individual Physicians under the Medicare Program

The case by case determination of a physician’s privacy rights is an attempt to discourage requests for public information.

If all physician payment data, say by procedure, were available in state by state data sets, local residents in a town of 500 would know a 2,000 x-rays a year is on the high side. Without every knowing any patient’s identity.

If you are a U.S. resident, take this opportunity to push for greater transparency in Medicare spending. Be polite and courteous but also be persistent. You need no more reason than an interest in how Medicare is being spent.

Let’s have an FOIA (Freedom of Information Act) request pending for every physician in the United States within 90 days of the CMS rule becoming final.

It’s not final yet, but when it is, let slip the lease on the dogs of FOAI.

Data Analytic Recidivism Tool (DART) [DAFT?]

Sunday, December 29th, 2013

Data Analytic Recidivism Tool (DART)

From the website:

The Data Analytic Recidivism Tool (DART) helps answer questions about recidivism in New York City.

  • Are people that commit a certain type of crime more likely to be re-arrested?
  • What about people in a certain age group or those with prior convictions?

DART lets users look at recidivism rates for selected groups defined by characteristics of defendants and their cases.

A direct link to the DART homepage.

After looking at the interface, which groups recidivists in groups of 250, I’m not sure DART is all that useful.

It did spark an idea that might help with the federal government’s acquisition problems.

Why not create the equivalent of DART but call it:

Data Analytic Failure Tool (DAFT).

And in DAFT track federal contractors, their principals, contracts, and the program officers who play any role in those contracts.

So that when contractors fail, as so many of them do, it will be easy to track the individuals involved on both sides of the failure.

And every contract will have a preamble that recites any prior history of failure and the people involved in that failure, on all sides.

Such that any subsequent supervisor has to sign off with full knowledge of the prior lack of performance.

If criminal recidivism is to be avoided, shouldn’t failure recidivism be avoided as well?

Discover Your Neighborhood with Census Explorer

Wednesday, December 25th, 2013

Discover Your Neighborhood with Census Explorer by Michael Ratcliffe.

From the post:

Our customers often want to explore neighborhood-level statistics and see how their communities have changed over time. Our new Census Explorer interactive mapping tool makes this easier than ever. It provides statistics on a variety of topics, such as percent of people who are foreign-born, educational attainment and homeownership rate. Statistics from the 2008 to 2012 American Community Survey power Census Explorer.

While you may be familiar with other ways to find neighborhood-level statistics, Census Explorer provides an interactive map for states, counties and census tracts. You can even look at how these neighborhoods have changed over time because the tool includes information from the 1990 and 2000 censuses in addition to the latest American Community Survey statistics. Seeing these changes is possible because the annual American Community Survey replaced the decennial census long form, giving communities throughout the nation more timely information than just once every 10 years.

Topics currently available in Census Explorer:

  • Total population
  • Percent 65 and older
  • Foreign-born population percentage
  • Percent of the population with a high school degree or higher
  • Percent with a bachelor’s degree or higher
  • Labor force participation rate
  • Home ownership rate
  • Median household income

Fairly coarse (census tract level) data but should be useful for any number of planning purposes.

For example, you could cross this data with traffic ticket and arrest data to derive “police presence” statistics.

Or add “citizen watcher” data from tweets about police car # and locations.

Different data sets often use different boundaries for areas.

Consider creating topic map based filters so when the boundaries change (a favorite activity of local governments) so will your summaries of that data.

…2013 World Ocean Database…

Sunday, December 22nd, 2013

NOAA releases 2013 World Ocean Database: The largest collection of scientific information about the oceans

From the post:

NOAA has released the 2013 World Ocean Database, the largest, most comprehensive collection of scientific information about the oceans, with records dating as far back as 1772. The 2013 database updates the 2009 version and contains nearly 13 million temperature profiles, compared with 9.1 in the 2009 database, and just fewer than six million salinity measurements, compared with 3.5 in the previous database. It integrates ocean profile data from approximately 90 countries around the world, collected from buoys, ships, gliders, and other instruments used to measure the “pulse” of the ocean.

Profile data of the ocean are measurements taken at many depths, from the surface to the floor, at a single location, during the time it takes to lower and raise the measuring instruments through the water. “This product is a powerful tool being used by scientists around the globe to study how changes in the ocean can impact weather and climate,” said Tim Boyer, an oceanographer with NOAA’s National Oceanographic Data Center.

In addition to using the vast amount of temperature and salinity measurements to monitor changes in heat and salt content, the database captures other measurements, including: oxygen, nutrients, chlorofluorocarbons and chlorophyll, which all reveal the oceans’ biological structure.

For the details on this dataset see: WOD Introduction.

The introduction notes under 1.1.5 Data Fusion:

It is not uncommon in oceanography that measurements of different variables made from the same sea water samples are often maintained as separate databases by different principal investigators. In fact, data from the same oceanographic cast may be located at different institutions in different countries. From its inception, NODC recognized the importance of building oceanographic databases in which as much data from each station and each cruise as possible are placed into standard formats, accompanied by appropriate metadata that make the data useful to future generations of scientists. It was the existence of such databases that allowed the International Indian Ocean Expedition Atlas (Wyrtki, 1971) and Climatological Atlas of the World Ocean (Levitus, 1982) to be produced without the time-consuming, laborious task of gathering data from many different sources. Part of the development of WOD13 has been to expand this data fusion activity by increasing the number of variables that NODC/WDC makes available as part of standardized databases.

As the NODC (National Oceanographic Data Center) demonstrates, it is possible to curate data sources in order to present a uniform data collection.

But curated data set remains inconsistent with data sets not curated by the same authority.

And combining curated data with non-curated data requires effort with the curated data, again.

Hard to map towards a destination without knowing its location.

Topic maps can capture the basis for curation, which will enable faster and more accurate integration of foreign data sets in the future.

UNESCO Open Access Publications [Update]

Thursday, December 19th, 2013

UNESCO Open Access Publications

From the webpage:

Building peaceful, democratic and inclusive knowledge societies across the world is at the heart of UNESCO’s mandate. Universal access to information is one of the fundamental conditions to achieve global knowledge societies. This condition is not a reality in all regions of the world.

In order to help reduce the gap between industrialized countries and those in the emerging economy, UNESCO has decided to adopt an Open Access Policy for its publications by making use of a new dimension of knowledge sharing – Open Access.

Open Access means free access to scientific information and unrestricted use of electronic data for everyone. With Open Access, expensive prices and copyrights will no longer be obstacles to the dissemination of knowledge. Everyone is free to add information, modify contents, translate texts into other languages, and disseminate an entire electronic publication.

For UNESCO, adopting an Open Access Policy means to make thousands of its publications freely available to the public. Furthermore, Open Access is also a way to provide the public with an insight into the work of the Organization so that everyone is able to discover and share what UNESCO is doing.

You can access and use our resources for free by clicking here.

In May of 2013 UNESCO announced its Open Access policy.

Many organizations profess a belief in “Open Access.”

The real test is whether they practice “Open Access.”


Thursday, December 19th, 2013


I don’t know enough about the Brazilian economy to say if the visualizations are helpful or not.

What I can tell you is the visualizations are impressive!

Thoughts on the site as an interface to open data?

PS: This appears to be a government supported website so not all government sponsored websites are poor performers.

Aberdeen – 1398 to Present

Sunday, December 15th, 2013

A Text Analytic Approach to Rural and Urban Legal Histories

From the post:

Aberdeen has the earliest and most complete body of surviving records of any Scottish town, running in near-unbroken sequence from 1398 to the present day. Our central focus is on the ‘provincial town’, especially its articulations and interactions with surrounding rural communities, infrastructure and natural resources. In this multi-disciplinary project, we apply text analytical tools to digitised Aberdeen Burgh Records, which are a UNESCO listed cultural artifact. The meaningful content of the Records is linguistically obscured, so must be interpreted. Moreover, to extract and reuse the content with Semantic Web and linked data technologies, it must be machine readable and richly annotated. To accomplish this, we develop a text analytic tool that specifically relates to the language, content, and structure of the Records. The result is an accessible, flexible, and essential precursor to the development of Semantic Web and linked data applications related to the Records. The applications will exploit the artifact to promote Aberdeen Burgh and Shire cultural tourism, curriculum development, and scholarship.

The scholarly objective of this project is to develop the analytic framework, methods, and resource materials to apply a text analytic tool to annotate and access the content of the Burgh records. Amongst the text analytic issues to address in historical perspective are: the identification and analysis of legal entities, events, and roles; and the analysis of legal argumentation and reasoning. Amongst the legal historical issues are: the political and legal culture and authority in the Burgh and Shire, particularly pertaining to the management and use of natural resources. Having an understanding of these issues and being able to access them using Semantic Web/linked data technologies will then facilitate exploitation in applications.

This project complements a distinct, existing collaboration between the Aberdeen City & Aberdeenshire Archives (ACAA) and the University (Connecting and Projecting Aberdeen’s Burgh Records, jointly led by Andrew Mackillop and Jackson Armstrong) (the RIISS Project), which will both make a contribution to the project (see details on application form). This multi-disciplinary application seeks funding from Dot.Rural chiefly for the time of two specialist researchers: a Research Fellow to interpret the multiple languages, handwriting scripts, archaic conventions, and conceptual categories emerging from these records; and subcontracting the A-I to carry out the text analytic and linked data tasks on a given corpus of previously transcribed council records, taking the RF’s interpretation as input.

Now there’s a project for tracking changing semantics over the hills and valleys of time!

Will be interesting to see how they capture semantics that are alien to our own.

Or how they preserve relationships between ancient semantic concepts.

Requesting Datasets from the Federal Government

Friday, December 13th, 2013

Requesting Datasets from the Federal Government by Eruditio Loginquitas.

From the post:

Much has been made of “open government” of late, with the U.S.’s federal government releasing tens of thousands of data sets from pretty much all public-facing offices. Many of these sets are available off of their respective websites. Many are offered in a centralized way at I finally spent some time on this site in search of datasets with location data to continue my learning of Tableau Public (with an eventual planned move to ArcMap).

I’ve been appreciating how much data are required to govern effectively but also how much data are created in the work of governance, particularly in an open and transparent society. There are literally billions of records and metrics required to run an efficient modern government. In a democracy, the tendency is to make information available—through sunshine laws and open meetings laws and data requests. The openness is particularly pronounced in cases of citizen participation, academic research, and journalistic requests. These are all aspects of a healthy interchange between citizens and their government…and further, digital government.

Public Requests for Data

One of the more charming aspects of the site involves a public thread which enables people to make requests for the creation of certain data sets by developers. People would make the case for the need for certain information. Some would offer “trades” by making promises about how they would use the data and what they would make available to the larger public. Others would simply make a request for the data. Still others would just post “requests,” which were actually just political or personal statements. (The requests site may be viewed here:;=1 .)

What datasets would you like to see?

The rejected requests can interesting, for example:

Properties Owned by Congressional Members Rejected

Congressional voting records Rejected

I don’t think the government has detailed information sufficient to answer the one about property owned by members of Congress.

On the other hand there are only 535 members so manual data mining in each state should turn up most of the public information fairly easily. The not public information could be more difficult.

The voting records request is puzzling since that is public record. And various rant groups print up their own analysis of voting records.

I don’t know, given the number of requests “Under Review” if it would be a good use of time but requesting the data behind opaque reports might illuminate the areas being hidden from transparency.

Scout [NLP, Move up from Twitter Feeds to Court Opinions]

Tuesday, December 3rd, 2013


From the about page:

Scout is a free service that provides daily insight to how our laws and regulations are shaped in Washington, DC and our state capitols.

These days, you can receive electronic alerts to know when a company is in the news, when a TV show is scheduled to air or when a sports team wins. Now, you can also be alerted when our elected officials take action on an issue you care about.

Scout allows anyone to subscribe to customized email or text alerts on what Congress is doing around an issue or a specific bill, as well as bills in the state legislature and federal regulations. You can also add external RSS feeds to complement a Scout subscription, such as press releases from a member of Congress or an issue-based blog.

Anyone can create a collection of Scout alerts around a topic, for personal organization or to make it easy for others to easily follow a whole topic at once.

Researchers can use Scout to see when Congress talks about an issue over time. Members of the media can use Scout to track when legislation important to their beat moves ahead in Congress or in state houses. Non-profits can use Scout as a tool to keep tabs on how federal and state lawmakers are making policy around a specific issue.

Early testing of Scout during its open beta phase alerted Sunlight and allies in time to successfully stop an overly broad exemption to the Freedom of Information Act from being applied to legislation that was moving quickly in Congress. Read more about that here.

Thank you to the Stanton Foundation, who contributed generous support to Scout’s development.

What kind of alerts?

If your manager suggests a Twitter feed to test NLP, classification, sentiment, etc. code, ask to use Federal Court (U.S.) Court Opinion Feed instead.

Not all data is written in one hundred and forty (140) character chunks. ;-)

PS: Be sure to support/promote the Sunlight Foundation for making this data available.

Casualty Count for Obamacare (0)

Wednesday, November 20th, 2013

5 lessons IT leaders can learn from Obamacare rollout mistakes by Teena Hammond.

Teena reports on five lessons to be learned from the rollout:

  1. If you’re going to launch a new website, decide whether to use in-house talent or outsource. If you opt to outsource, hire a good contractor.
  2. Follow the right steps to hire the best vendor for the project, and properly manage the relationship.
  3. Have one person in charge of the project with absolute veto power.
  4. Do not gloss over any problems along the way. Be open and honest about the progress of the project. And test the site.
  5. Be ready for success or failure. Hope for the best but prepare for the worst and have guidelines to manage any potential failure.

There is a sixth lesson that emerges from Vaughn Bullard, CEO and founder of Build.Automate Inc., who is quoted in part saying:

The contractor telling the government that it was ready despite the obvious major flaws in the system is just baffling to me. If I had an employee that did something similar, I would have terminated their employment. It’s pretty simple.”

What it comes down to in the end, Bullard said, is that, “Quality and integrity count in all things.”

To avoid repeated failures in the future (sixth lesson), terminate those responsible for the current failure.

All contractors and their staffs. Track the staffs in order to avoid the same staff moving to other contractors.

Termination all appointed or hired staff who responsible for the contract and/or management of the project.

Track former staff employment by contractors and refuse contracts wherever they are employed.

You may have noticed that the reported casualty count for the Obamacare failure has been zero.

What incentive exists for the next group of contract/project managers and/or contractors for “quality and integrity?”

That would be the same as the casualty count, zero.

PS: Before you protest the termination and ban of failures as cruel, consider its advantages as a wealth redistribution program.

The government may not get better service but it will provide opportunities for fraud and poor quality work from new participants.

Not to mention there are IT service providers who exhibit quality and integrity. Absent traditional mis-management, the government could happen upon one of those.

The tip for semantic technologies is to under-promise and over-deliver. Always.

Free Access to EU Satellite Data

Thursday, November 14th, 2013

Free Access to EU Satellite Data (Press Release, Brussels, 13 November 2013).

From the release:

The European Commission will provide free, full and open access to a wealth of important environmental data gathered by Copernicus, Europe’s Earth observation system. The new open data dissemination regime, which will come into effect next month, will support the vital task of monitoring the environment and will also help Europe’s enterprises, creating new jobs and business opportunities. Sectors positively stimulated by Copernicus are likely to be services for environmental data production and dissemination, as well as space manufacturing. Indirectly, a variety of other economic segments will see the advantages of accurate earth observation, such as transport, oil and gas, insurance and agriculture. Studies show that Copernicus – which includes six dedicated satellite missions, the so-called Sentinels, to be launched between 2014 and 2021 – could generate a financial benefit of some € 30 billion and create around 50.000 jobs by 2030. Moreover, the new open data dissemination regime will help citizens, businesses, researchers and policy makers to integrate an environmental dimension into all their activities and decision making procedures.

To make maximum use of this wealth of information, researchers, citizens and businesses will be able to access Copernicus data and information through dedicated Internet-based portals. This free access will support the development of useful applications for a number of different industry segments (e.g. agriculture, insurance, transport, and energy). Other examples include precision agriculture or the use of data for risk modelling in the insurance industry. It will fulfil a crucial role, meeting societal, political and economic needs for the sustainable delivery of accurate environmental data.

More information on the Copernicus web site at:

The “€ 30 billion” financial benefit seems a bit soft after looking at the study reports on the economic value of Copernicus.

For example, if Copernicus is used to monitor illegal dumping (D. Drimaco, Waste monitoring service to improve waste management practices and detect illegal landfills), how is a financial benefit calculated for illegal dumping prevented?

If you are the Office of Management and Budget (U.S.), you could simply make up the numbers and report them in near indecipherable documents. (Free Sequester Data Here!)

I don’t doubt there will be economic benefits from Copernicus but questions remain: how much and for who?

I first saw this in a tweet by Stefano Bertolo.

Implementations of Data Catalog Vocabulary

Tuesday, November 5th, 2013

Implementations of Data Catalog Vocabulary

From the post:

The Government Linked Data (GLD) Working Group today published the Data Catalog Vocabulary (DCAT) as a Candidate Recommendation. DCAT allows governmental and non-governmental data catalogs to publish their entries in a standard machine-readable format so they can be managed, aggregated, and presented in other catalogs.

Originally developed at DERI, DCAT has evolved with input from a variety of stakeholders and is now stable and ready for widespread use. If you have a collection of data sources, please consider publishing DCAT metadata for it, and if you run a data catalog or portal, please consider making use of DCAT metadata you find. The Working Group is eager to receive comments reports of use at and is maintaining an Implementation Report.

If you know anyone in the United States government, please suggest this to them.

The more time the U.S. government spends on innocuous data, the less time it has to spy on its citizens and the citizens and governments of other countries.

I say innocuous data because I have yet to see any government release information that would discredit the current regime.

Wasn’t true for the Pentagon Papers, the Watergate tapes or the Snowden releases.

Can you think of any voluntary release of data by any government that discredited a current regime?

The reason for secrecy isn’t to protect techniques or sources.

Guess whose incompetence would be exposed by transparency?

Open Data Index

Monday, November 4th, 2013

Open Data Index by Armin Grossenbacher.

From the post:

There are lots of indexes.

The most famous one may be the Index Librorum Prohibitorum listing books prohibited by the cathoilic church. It contained eminent scientists and intellectuals (see the list in Wikipedia) and was abolished after more than 400 years in 1966 only.

Open Data Index

One index everybody would like to be registered in and this with a high rank is the Open Data Index.

‘An increasing number of governments have committed to open up data, but how much key information is actually being released? …. Which countries are the most advanced and which are lagging in relation to open data? The Open Data Index has been developed to help answer such questions by collecting and presenting information on the state of open data around the world – to ignite discussions between citizens and governments.’

I haven’t seen the movie review guide that appeared in Our Sunday Visitor in years but when I was in high school it was the best movie guide around. Just pick the ones rated as morally condemned. ;-)

There are two criteria I don’t see mentioned for rating open data:

  1. How easy/hard is it to integrate a particular data set with other data from the same source or organization?
  2. Is the data supportive, neutral or negative with regard to established government policies?

Do you know of any open data sets where those questions are used to rate them?

the /unitedstates project

Tuesday, October 29th, 2013

the /unitedstates project

From the webpage:

/unitedstates is a shared commons of data and tools for the United States. Made by the public, used by the public.

There you will find:

bill-nicknames Tiny spreadsheet of common nicknames for bills and laws.

citation Stand-alone legal citation detector. Text in, citations out.

congress-legislators Detailed data on members of Congress, past and present.

congress Scrapers and parsers for the work of Congress, all day, every day.

glossary A public domain glossary for the United States.

licensing Policy guidelines for the licensing of US government information.

uscode Parser for the US Code.

wish-list Post ideas for new projects.

Can you guess what the #1 wish on the project list is?

Campaign finance donor de-duplicator

Semantics and Delivery of Useful Information [Bills Before the U.S. House]

Monday, October 21st, 2013

Lars Marius Garshol pointed out in Semantic Web adoption and the users the question of “What do semantic technologies do better than non-semantic technologies?” has yet to be answered.

Tim O’Reilly tweeted about Madison Federal today, a resource that raises the semantic versus non-semantic technology question.

In a nutshell, Madison Federal has all the bills pending before the U.S. House of Representatives online.

If you login with Facebook, you can:

  • Add a bill edit / comment
  • Enter a community suggestion
  • Enter a community comment
  • Subscribe to future edits/comments on a bill

So far, so good.

You can pick any bill but the one I chose as an example is: Postal Executive Accountability Act.

I will quote just a few lines of the bill:

2. Limits on executive pay

    (a) Limitation on compensation Section 1003 of title 39, United States Code, 
         is amended:

         (1) in subsection (a), by striking the last sentence; and
         (2) by adding at the end the following:

                  (1) Subject to paragraph (2), an officer or employee of the Postal 
                      Service may not be paid at a rate of basic pay that exceeds 
                      the rate of basic pay for level II of the Executive Schedule 
                      under section 5312 of title 5.

What would be the first thing you want to know?

Hmmm, what about subsection (a) of title 39 of the United States Code since we are striking the last sentence?

39 USC § 1003 – Employment policy [Legal Information Institute], which reads:

(a) Except as provided under chapters 2 and 12 of this title, section 8G of the Inspector General Act of 1978, or other provision of law, the Postal Service shall classify and fix the compensation and benefits of all officers and employees in the Postal Service. It shall be the policy of the Postal Service to maintain compensation and benefits for all officers and employees on a standard of comparability to the compensation and benefits paid for comparable levels of work in the private sector of the economy. No officer or employee shall be paid compensation at a rate in excess of the rate for level I of the Executive Schedule under section 5312 of title 5.

OK, so now we know that (1) is striking:

No officer or employee shall be paid compensation at a rate in excess of the rate for level I of the Executive Schedule under section 5312 of title 5.

Semantics? No, just a hyperlink.

For the added text, we want to know what is meant by:

… rate of basic pay that exceeds the rate of basic pay for level II of the Executive Schedule under section 5312 of title 5.

The Legal Information Institute is already ahead of Congress because their system provides the hyperlink we need: 5312 of title 5.

If you notice something amiss when you follow that link, congratulations! You have discovered your first congressional typo and/or error.

5312 of title 5 defines Schedule I of the Executive Schedule, which includes the Secretary of State, Secretary of the Treasury, Secretary of Defense, Attorney General and others. Base rate for Executive Schedule Level I is $199,700.

On the other hand, 5313 of title 5 defines Schedule II of the Executive Schedule, which includes Department of Agriculture, Deputy Secretary of Agriculture; Department of Defense, Deputy Secretary of Defense, Secretary of the Army, Secretary of the Navy, Secretary of the Air Force, Under Secretary of Defense for Acquisition, Technology and Logistics; Department of Education, Deputy Secretary of Education; Department of Energy, Deputy Secretary of Energy and others. Base rate for Executive Schedule Level II is $178,700.

Assuming someone catches or comments that 5312 should be 5313, top earners at the Postal Service may be about to take a $21,000.00 pay reduction.

We got all that from mechanical hyperlinks, no semantic technology required.

Where you might need semantic technology is when reading 39 USC § 1003 – Employment policy [Legal Information Institute] where it says (in part):

…It shall be the policy of the Postal Service to maintain compensation and benefits for all officers and employees on a standard of comparability to the compensation and benefits paid for comparable levels of work in the private sector of the economy….

Some questions:

Question: What are “comparable levels of work in the private sector of the economy?”

Question: On what basis is work for the Postal Service compared to work in the private economy?

Question: Examples of comparable jobs in the private economy and their compensation?

Question: What policy or guideline documents have been developed by the Postal Service for evaluation of Postal Service vs. work in the private economy?

Question: What studies have been done, by who, using what practices, on comparing compensation for Postal Service work to work in the private economy?

That would be a considerable amount of information with what I suspect would be a large amount of duplication as reports or studies are cited by numerous sources.

Semantic technology would be necessary for the purpose of deduping and navigating such a body of information effectively.

Pick a bill. Where would you put the divide between mechanical hyperlinks and semantic technologies?

PS: You may remember that the House of Representatives had their own “post office” which they ran as a slush fund. The thought of the House holding someone “accountable” is too bizarre for words.

G8 countries must work harder to open up essential data

Monday, June 17th, 2013

G8 countries must work harder to open up essential data by Rufus Pollock.

From the post:

Open data and transparency will be one of the three main topics at the G8 Summit in Northern Ireland next week. Today transparency campaigners released preview results from the global Open Data Census showing that G8 countries still have a long way to go in releasing essential information as open data.

The Open Data Census is run by the Open Knowledge Foundation, with the help of a network of local data experts around the globe. It measures the openness of data in ten key areas including those essential for transparency and accountability (such as election results and government spending data), and those vital for providing critical services to citizens (such as maps and transport timetables). Full results for the 2013 Open Data Census will be released later this year.

open data

The preview results show that while both the UK and the US (who top the table of G8 countries) have made significant progress towards opening up key datasets, both countries still have work to do. Postcode data, which is required for almost all location-based applications and services, remains a major issue for all G8 countries except Germany. No G8 country scored the top mark for company registry data. Russia is the only G8 country not to have published any of the information included in the census as open data. The full results for G8 countries are online at:

Apologies for the graphic, it is too small to read. See the original post for a more legible version.

The U.S. came in first with a score of 54 out of a possible 60.

I assume this evaluation was done prior the the revelation of the NSA data snooping?

The U.S. government has massive collections of data that not only isn’t visible, its existence is denied.

How is that for government transparency?

The most disappointing part is that other major players, China, Russia, you take your pick, has largely the small secret data as the United States. Probably not full sets of the day to day memos but the data that really counts, they all have.

So, who is it they are keeping information from?

Ah, that would be their citizens.

Who might not approve of their privileges, goals, tactics, and favoritism.

For example, despite the U.S. government’s disapproval/criticism of many other countries (or rather their governments), I can’t think of any reason for me to dislike unknown citizens of another country.

Whatever goals the U.S. government is pursuing in disadvantaging citizens of another country, it’s not on my behalf.

If the public knew who was benefiting from U.S. policy, perhaps new officials would change those policies.

But that isn’t the goal of the specter of government transparency that the United States leads.

The Banality of ‘Don’t Be Evil’

Monday, June 3rd, 2013

The Banality of ‘Don’t Be Evil’ by Julian Assange.

From the post:

“THE New Digital Age” is a startlingly clear and provocative blueprint for technocratic imperialism, from two of its leading witch doctors, Eric Schmidt and Jared Cohen, who construct a new idiom for United States global power in the 21st century. This idiom reflects the ever closer union between the State Department and Silicon Valley, as personified by Mr. Schmidt, the executive chairman of Google, and Mr. Cohen, a former adviser to Condoleezza Rice and Hillary Clinton who is now director of Google Ideas.

The authors met in occupied Baghdad in 2009, when the book was conceived. Strolling among the ruins, the two became excited that consumer technology was transforming a society flattened by United States military occupation. They decided the tech industry could be a powerful agent of American foreign policy.

The book proselytizes the role of technology in reshaping the world’s people and nations into likenesses of the world’s dominant superpower, whether they want to be reshaped or not. The prose is terse, the argument confident and the wisdom — banal. But this isn’t a book designed to be read. It is a major declaration designed to foster alliances.

“The New Digital Age” is, beyond anything else, an attempt by Google to position itself as America’s geopolitical visionary — the one company that can answer the question “Where should America go?” It is not surprising that a respectable cast of the world’s most famous warmongers has been trotted out to give its stamp of approval to this enticement to Western soft power. The acknowledgments give pride of place to Henry Kissinger, who along with Tony Blair and the former C.I.A. director Michael Hayden provided advance praise for the book.

In the book the authors happily take up the white geek’s burden. A liberal sprinkling of convenient, hypothetical dark-skinned worthies appear: Congolese fisherwomen, graphic designers in Botswana, anticorruption activists in San Salvador and illiterate Masai cattle herders in the Serengeti are all obediently summoned to demonstrate the progressive properties of Google phones jacked into the informational supply chain of the Western empire.


I am less concerned with privacy and more concerned with the impact of technological imperialism.

I see no good coming from the infliction of Western TV and movies on other cultures.

Or in making local farmers part of the global agriculture market.

Or infecting Iraq with sterile wheat seeds.

Compared to those results, privacy is a luxury of the bourgeois who worry about such issues.

I first saw this at Chris Blattman’s Links I liked.

CIA, Solicitation and Government Transparency

Monday, June 3rd, 2013

IBM battles Amazon over $600M CIA cloud deal by Frank Konkel, reports that IBM has protested a contract award for cloud computing by the CIA to Amazon.

The “new age” of government transparency looks a lot like the old age in that:

  • How Amazon obtained the award is not public.
  • The nature of the cloud to be built by Amazon is not public.
  • Whether Amazon has started construction on the proposed cloud is not public.
  • The basis for the protest by IBM is not public.

“Not public” means opportunities for incompetence in contract drafting and/or fraud by contractors.

How are members of the public or less well-heeled potential bidders suppose to participate in this discussion?

Or should I say “meaningfully participate” in the discussion over the cloud computing award to Amazon?

And what if others know the terms of the contract? CIA CTO Gus Hunt is reported as saying:

It is very nearly within our grasp to be able to compute on all human generated information,

If the proposed system is supposed to “compute on all human generated information,” so what?

How does knowing that aid any alleged enemies of the United States?

Other than the comfort that the U.S. makes bad technology decisions?

Keeping the content of such a system secret might disadvantage enemies of the U.S.

Keeping the contract for such a system secret disadvantages the public and other contractors.


White House Releases New Tools… [Bank Robber’s Defense]

Sunday, June 2nd, 2013

White House Releases New Tools For Digital Strategy Anniversary by Caitlin Fairchild.

From the post:

The White House marked the one-year anniversary of its digital government strategy Thursday with a slate of new releases, including a catalog of government APIs, a toolkit for developing government mobile apps and a new framework for ensuring the security of government mobile devices.

Those releases correspond with three main goals for the digital strategy: make more information available to the public; serve customers better; and improve the security of federal computing.

Just scanning down the API list, it is a very mixed bag.

For example, there are four hundred and ten (410) individual APIs listed, the National Library of Medicine has twenty-four (24) and the U.S. Senate has one (1).

Defenders of this release will say we should not talk about the lack of prior efforts but focus on what’s coming.

I call that the bank robber’s defense.

All prosecutors want to talk about is what a bank robber did in the past. They never want to focus on the future.

Bank robbers would love to have the “let’s talk about tomorrow” defense.

As far as I know, it isn’t allowed anywhere.

Question: Why do we allow avoidance of responsibility with the “let’s talk about tomorrow” defense for government and others?

If you review the APIs for semantic diversity I would appreciate a pointer to your paper/post.


Saturday, May 25th, 2013

Cato’s “Deepbills” Project Advances Government Transparency by Jim Harper.

From the post:

But there’s no sense in sitting around waiting for things to improve. Given the incentives, transparency is something that we will have to force on government. We won’t receive it like a gift.

So with software we acquired and modified for the purpose, we’ve been adding data to the bills in Congress, making it possible to learn automatically more of what they do. The bills published by the Government Printing Office have data about who introduced them and the committees to which they were referred. We are adding data that reflects:

- What agencies and bureaus the bills in Congress affect;

- What laws the bills in Congress effect: by popular name, U.S. Code section, Statutes at Large citation, and more;

- What budget authorities bills include, the amount of this proposed spending, its purpose, and the fiscal year(s).

We are capturing proposed new bureaus and programs, proposed new sections of existing law, and other subtleties in legislation. Our “Deepbills” project is documented at

This data can tell a more complete story of what is happening in Congress. Given the right Web site, app, or information service, you will be able to tell who proposed to spend your taxpayer dollars and in what amounts. You’ll be able to tell how your member of Congress and senators voted on each one. You might even find out about votes you care about before they happen!

Two important points:

First, transparency must be forced upon government (I would add businesses).

Second, transparency is up to us.

Do you know something the rest of us should know?

On your mark!

Get set!


I first saw this at: Harper: Cato’s “Deepbills” Project Advances Government Transparency.

US rendition map: what it means, and how to use it

Wednesday, May 22nd, 2013

US rendition map: what it means, and how to use it by James Ball.

From the post:

The Rendition Project, a collaboration between UK academics and the NGO Reprieve, has produced one of the most detailed and illuminating research projects shedding light on the CIA’s extraordinary rendition project to date. Here’s how to use it.

Truly remarkable project to date, but could be even more successful with your assistance.

Not likely that any of the principals will wind up in the dock at the Hague.

On the other hand, exposing their crimes may deter others from similar adventures.

U.S. Senate Panel Discovers Nowhere Man [Apple As Tax Dodger]

Monday, May 20th, 2013

Forty-seven years after Nowhere Man by the Beatles, a U.S. Senate panel discovers several nowhere men.

A Wall Street Journal Technology Alert:

Apple has set up corporate structures that have allowed it to pay little or no corporate tax–in any country–on much of its overseas income, according to the findings of a U.S. Senate examination.

The unusual result is possible because the iPhone maker’s key foreign subsidiaries argue they are residents of nowhere, according to the investigators’ report, which will be discussed at a hearing Tuesday where Apple CEO Tim Cook will testify. The finding comes from a lengthy investigation into the technology giant’s tax practices by the Senate Permanent Subcommittee on Investigations, led by Sens. Carl Levin (D., Mich.) and John McCain (R., Ariz.).

In additional coverage, Apple says:

Apple’s testimony also includes a call to overhaul: “Apple welcomes an objective examination of the US corporate tax system, which has not kept pace with the advent of the digital age and the rapidly changing global economy.”

Tax reform will be useful only if “transparent” tax reform.

Transparent tax reform mean every provision with more than a $100,000 impact on any taxpayer, names all the taxpayers impacted. Whether more or less taxes.

We have the data, we need the will to apply the analysis.

A tax-impact topic map anyone?

UNESCO Publications and Data (Open Access)

Sunday, May 19th, 2013

UNESCO to make its publications available free of charge as part of a new Open Access policy

From the post:

The United Nations Education Scientific and Cultural Organisation (UNESCO) has announced that it is making available to the public free of charge its digital publications and data. This comes after UNESCO has adopted an Open Access Policy, becoming the first agency within the United Nations to do so.

The new policy implies that anyone can freely download, translate , adapt, and distribute UNESCO’s publications and data. The policy also states that from July 2013, hundreds of downloadable digital UNESCO publications will be available to users through a new Open Access Repository with a multilingual interface. The policy seeks also to apply retroactively to works that have been published.

There’s a treasure trove of information for mapping, say against the New York Times historical archives.

If presidential libraries weren’t concerned with helping former administration officials avoid accountability, digitizing presidential libraries for complete access, would be another great treasure trove.