Archive for the ‘Government Data’ Category

Accessing IRS 990 Filings (Old School)

Monday, July 25th, 2016

Like many others, I was glad to see: IRS 990 Filings on AWS.

From the webpage:

Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present are available for anyone to use via Amazon S3.

Form 990 is the form used by the United States Internal Revenue Service to gather financial information about nonprofit organizations. Data for each 990 filing is provided in an XML file that contains structured information that represents the main 990 form, any filed forms and schedules, and other control information describing how the document was filed. Some non-disclosable information is not included in the files.

This data set includes Forms 990, 990-EZ and 990-PF which have been electronically filed with the IRS and is updated regularly in an XML format. The data can be used to perform research and analysis of organizations that have electronically filed Forms 990, 990-EZ and 990-PF. Forms 990-N (e-Postcard) are not available withing this data set. Forms 990-N can be viewed and downloaded from the IRS website.

I could use AWS but I’m more interested in deep analysis of a few returns than analysis of the entire dataset.

Fortunately the webpage continues:

An index listing all of the available filings is available at s3://irs-form-990/index.json. This file includes basic information about each filing including the name of the filer, the Employer Identificiation Number (EIN) of the filer, the date of the filing, and the path to download the filing.

All of the data is publicly accessible via the S3 bucket’s HTTPS endpoint at No authentication is required to download data over HTTPS. For example, the index file can be accessed at and the example filing mentioned above can be accessed at (emphasis in original).

I open a terminal window and type:


which as of today, results in:

-rw-rw-r-- 1 patrick patrick 1036711819 Jun 16 10:23 index.json

A trial grep:

grep "NATIONAL RIFLE" index.json > nra.txt

Which produces:

{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “396056607”, “SubmittedOn”: “2011-05-12”, “TaxPeriod”: “201012”, “FormType”: “990EZ”, “LastUpdated”: “2016-06-14T01:22:09.915971Z”, “OrganizationName”: “EAU CLAIRE NATIONAL RIFLE CLUB”, “IsElectronic”: false, “IsAvailable”: false},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

We have one errant result, the “EAU CLAIRE NATIONAL RIFLE CLUB,” so let’s delete that, re-order by year and the NATIONAL RIFLE ASSOCIATION OF AMERICA result reads (most recent to oldest):

{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

Of course, now you want the XML 990 returns, so extract the URLs for the 990s to a file, here nra-urls.txt (I would use awk if it is more than a handful):

Back to wget:

wget -i nra-urls.txt


-rw-rw-r– 1 patrick patrick 111798 Mar 21 16:12 201132709349300508_public.xml
-rw-rw-r– 1 patrick patrick 123490 Mar 21 19:47 201203119349301120_public.xml
-rw-rw-r– 1 patrick patrick 116786 Mar 21 22:12 201302609349300520_public.xml
-rw-rw-r– 1 patrick patrick 122071 Mar 21 15:20 201423099349300417_public.xml
-rw-rw-r– 1 patrick patrick 132081 Apr 29 10:10 201532599349300503_public.xml

Ooooh, it’s in XML! 😉

For the XML you are going to need: Current Valid XML Schemas and Business Rules for Exempt Organizations Modernized e-File, not to mention a means of querying the data (may I suggest XQuery?).

Once you have the index.json file, with grep, a little awk and wget, you can quickly explore IRS 990 filings for further analysis or to prepare queries for running on AWS (such as discovery of common directors, etc.).


What’s the “CFR” and Why Is It So Important to Me?

Wednesday, July 20th, 2016

What’s the “CFR” and Why Is It So Important to Me? Government Printing Office (GPO) blog, GovernmentBookTalk.

From the post:

If you’re a GPO Online Bookstore regular or public official you probably know we’re speaking about the “Code of Federal Regulations.” CFRs are produced routinely by all federal departments and agencies to inform the public and government officials of regulatory changes and updates for literally every subject that the federal government has jurisdiction to manage.

For the general public these constantly updated federal regulations can spell fantastic opportunity. Farmer, lawyer, construction owner, environmentalist, it makes no difference. Within the 50 codes are a wide variety of regulations that impact citizens from all walks of life. Federal Rules, Regulations, Processes, or Procedures on the surface can appear daunting, confusing, and even may seem to impede progress. In fact, the opposite is true. By codifying critical steps to anyone who operates within the framework of any of these sectors, the CFR focused on a particular issue can clarify what’s legal, how to move forward, and how to ultimately successfully translate one’s projects or ideas into reality.

Without CFR documentation the path could be strewn with uncertainty, unknown liabilities, and lost opportunities, especially regarding federal development programs, simply because an interested party wouldn’t know where or how to find what’s available within their area of interest.

The authors of CFRs are immersed in the technical and substantive issues associated within their areas of expertise. For a private sector employer or entrepreneur who becomes familiar with the content of CFRs relative to their field of work, it’s like having an expert staff on board.

I like the CFRs but I stumbled on:

For a private sector employer or entrepreneur who becomes familiar with the content of CFRs relative to their field of work, it’s like having an expert staff on board.

I don’t doubt the expertise of the CFR authors, but their writing often requires an expert for accurate interpretation. If you doubt that statement, test your reading skills on any section of CFR Title 26, Internal Revenue.

Try your favorite NLP parser out on any of the CFRs.

The post lists a number of ways to acquire the CFRs but personally I would use the free Electronic Code of Federal Regulations unless you need to impress clients with the paper version.


IRS E-File Bucket – Internet Archive

Saturday, June 18th, 2016

IRS E-File Bucket courtesy of Carl Malamud and Public.Resource.Org.

From the webpage:

This bucket contains a mirror of the IRS e-file release as of June 16, 2016. You may access the source files at The present bucket may or may not be updated in the future.

To access this bucket, use the download links.

Note that tarballs is image scans from 2002-2015 are also available in this IRS 990 Forms collection.

Many thanks to the Internal Revenue Service for making this information available. Here is their announcement on June 16, 2016. Here is a statement from Public.Resource.Org congratulating the IRS on a job well done.

As I noted in IRS 990 Filing Data (2001 to date):

990* disclosures aren’t detailed enough to pinch but when combined with other data, say leaked data, the results can be remarkable.

It’s up to you to see that public disclosures pinch.

IRS 990 Filing Data (2001 to date)

Thursday, June 16th, 2016

IRS 990 Filing Data Now Available as an AWS Public Data Set

From the post:

We are excited to announce that over one million electronic IRS 990 filings are available via Amazon Simple Storage Service (Amazon S3). Filings from 2011 to the present are currently available and the IRS will add new 990 filing data each month.

(image omitted)

Form 990 is the form used by the United States Internal Revenue Service (IRS) to gather financial information about nonprofit organizations. By making electronic 990 filing data available, the IRS has made it possible for anyone to programmatically access and analyze information about individual nonprofits or the entire nonprofit sector in the United States. This also makes it possible to analyze it in the cloud without having to download the data or store it themselves, which lowers the cost of product development and accelerates analysis.

Each electronic 990 filing is available as a unique XML file in the “irs-form-990” S3 bucket in the AWS US East (N. Virginia) region. Information on how the data is organized and what it contains is available on the IRS 990 Filings on AWS Public Data Set landing page.

Some of the forms and instructions that will help you make sense of the data reported:

990 – Form 990 Return of Organization Exempt from Income Tax, Annual Form 990 Requirements for Tax-Exempt Organizations

990-EZ – 2015 Form 990-EZ, Instructions for IRS 990 EZ – Internal Revenue Service

990-PF – 2015 Form 990-PF, 2015 Instructions for Form 990-PF

As always, use caution with law related data as words may have unusual nuances and/or unexpected meanings.

These forms and instructions are only a tiny part of a vast iceberg of laws, regulations, rulings, court decisions and the like.

990* disclosures aren’t detailed enough to pinch but when combined with other data, say leaked data, the results can be remarkable.

Breaking Californication (An Act Performed On The Public)

Monday, June 6th, 2016

Law Enforcement Lobby Succeeds In Killing California Transparency Bill by Kit O’Connell.

From the post:

A California Senate committee killed a bill to increase transparency in police misconduct investigations, hampering victims’ efforts to obtain justice.

Chauncee Smith, legislative advocate at the ACLU of California, told MintPress News that the state Legislature “caved to the tremendous influence and power of the law enforcement lobby” and “failed to listen to the demands and concerns of everyday Californian people.”

California has some of the most secretive rules in the country when it comes to investigations into police misconduct and excessive use of force. Records are kept sealed, regardless of the outcome, as the ACLU of Northern California explains on its website:

“In places like Texas, Kentucky, and Utah, peace officer records are made public when an officer is found guilty of misconduct. Other states make records public regardless of whether misconduct is found. This is not the case in California.”

“Right now, there is a tremendous cloud of secrecy that is unparalleled compared to many other states,” Smith added. “California is in the minority in which the public do not know basic information when someone is killed or potentially harmed by those are sworn to serve and protect them.”

In February, Sen. Mark Leno, a Democrat from San Francisco, introduced SB 1286, the “Enhance Community Oversight on Police Misconduct and Serious Uses of Force” bill. It would have allowed “public access to investigations, findings and discipline information on serious uses of force by police” and would have increased transparency in other cases of police misconduct, according to an ACLU fact sheet. Polling data cited by the ACLU suggests about 80 percent of Californians would support the measure.

But the bill’s progress through the legislature ended on May 27, when it failed to pass out of the Senate Appropriations committee.

“Today is a sad day for transparency, accountability, and justice in California,” said Peter Bibring, police practices director for the ACLU of California, in a May 27 press release.

Mistrust between police officers and citizens makes the job of police officers more difficult and dangerous, while denying citizens the full advantages of a trained police force, paid for by their tax dollars.

The state legislature, finding sowing and fueling mistrust between police officers and citizens has election upsides for them, fans those flames with secrecy over police misconduct investigations.

Open, not secret (read grand jury) proceedings where witnesses can be fairly examined (unlike the deliberately thrown Michael Brown investigation), can go a long way to re-establishing trust between the police and the public.

Members of the community know when someone was a danger to police officers and others. Whether their family members will admit it or not. Likewise, police officers know which officers are far to quick to escalate to deadly force. Want better community policing? What better citizen cooperation? That’s not going to happen with completely secret police misconduct investigations.

So the State of California is going to collect the evidence, statements, etc., in police misconduct investigations, but won’t share that information with the public. At least not willingly.

Official attempts to break illegitimate government secrecy failed. Even if it had succeeded you’d be paying least $0.25 per page plus a service fee.

Two observations about government networks:

  • Secret (and otherwise) government documents are usually printed on networked printers.
  • Passively capturing Ethernet traffic (network tap) captures printer traffic too.

Whistle blowers don’t have to hack heavily monitored systems, steal logins/passwords, leaking illegally withheld documents is within the reach of anyone who can plug in an Ethernet cable.

There’s a bit more to it than that, but remember all those network cables running through the ceiling, walls, closets, the next time your security consultant, assures you of your network’s security.

As a practical matter, if you start leaking party menus and football pools, someone will start looking for a network tap.

Leak when it makes a significant difference to public discussion and/or legal proceedings. Even then, look for ways to attribute the leak to factions within the government.

Remember the DoD’s amused reaction to State’s huffing and puffing over the Afghan diplomatic cables? That sort of rivalry exists at every level of government. You should use it to your advantage.

The State of California would have you believe that government information sharing is at its sufferance.

I beg to differ.

So should you.

11 Million Pages of CIA Files [+ Allen Dulles, war criminal]

Thursday, March 3rd, 2016

11 Million Pages of CIA Files May Soon Be Shared By This Kickstarter by Joseph Cox.

From the post:

Millions of pages of CIA documents are stored in Room 3000. The CIA Records Search Tool (CREST), the agency’s database of declassified intelligence files, is only accessible via four computers in the National Archives Building in College Park, MD, and contains everything from Cold War intelligence, research and development files, to images.

Now one activist is aiming to get those documents more readily available to anyone who is interested in them, by methodically printing, scanning, and then archiving them on the internet.

“It boils down to freeing information and getting as much of it as possible into the hands of the public, not to mention journalists, researchers and historians,” Michael Best, analyst and freedom of information activist told Motherboard in an online chat.

Best is trying to raise $10,000 on Kickstarter in order to purchase the high speed scanner necessary for such a project, a laptop, office supplies, and to cover some other costs. If he raises more than the main goal, he might be able to take on the archiving task full-time, as well as pay for FOIAs to remove redactions from some of the files in the database. As a reward, backers will help to choose what gets archived first, according to the Kickstarter page.

“Once those “priority” documents are done, I’ll start going through the digital folders more linearly and upload files by section,” Best said. The files will be hosted on the Internet Archive, which converts documents into other formats too, such as for Kindle devices, and sometimes text-to-speech for e-books. The whole thing has echoes of Cryptome—the freedom of information duo John Young and Deborah Natsios, who started off scanning documents for the infamous cypherpunk mailing list in the 1990s.

Good news! Kickstarter has announced this project funded!

Additional funding will help make this archive of documents available sooner rather than later.

As opposed to an attempt to boil the ocean of 11 million pages of CIA files, what about smaller topic mapping/indexing projects that focus on bounded sub-sets of documents of interest to particular communities?

I don’t have any interest in the STAR GATE project (clairvoyance, precognition, or telepathy, continued now by the DHS at airport screening facilities) but would be very interested in the records of Allen Dulles, a war criminal of some renown.

Just so you know, Michael has already uploaded documents on Allen Dulles from the CIA Records Search Tool (CREST) tool:

History of Allen Welsh Dulles as CIA Director – Volume I: The Man

History of Allen Welsh Dulles as CIA Director – Volume II: Coordination of Intelligence

History of Allen Welsh Dulles as CIA Director – Volume III: Covert Activities

History of Allen Welsh Dulles as CIA Director – Volume IV: Congressional Oversight and Internal Administration

History of Allen Welsh Dulles as CIA Director – Volume V: Intelligence Support of Policy

To describe Allen Dulles as a war criminal is no hyperbole. Among his other crimes, overthrow of President Jacobo Arbenz Guzman of Guatemala (think United Fruit Company), removal of Mohammad Mossadeq, prime minister of Iran (think Shah of Iran), are only two of his crimes, the full extent of which will probably never be known.

Files are being uploaded to That 1 Archive.

Scripting FOIA Requests

Thursday, March 3rd, 2016

An Activist Wrote a Script to FOIA the Files of 7,000 Dead FBI Officials by Joseph Cox.

From the post:

One of the best times to file a Freedom of Information request with the FBI is when someone dies; after that, any files that the agency holds on them can be requested. Asking for FBI files on the deceased is therefore pretty popular, with documents released on Steve Jobs, Malcolm X and even the Insane Clown Posse.

One activist is turning this back onto the FBI itself, by requesting files on nearly 7,000 dead FBI employees en masse, and releasing a script that allows anyone else to do the same.

“At the very least, it’ll be like having an extensive ‘Who’s Who in the FBI’ to consult, without worrying that anyone in there is still alive and might face retaliation for being in law enforcement,” Michael Best told Motherboard in an online chat. “For some folks, they’ll probably show allegations of wrongdoing while others probably highlight some of the FBI’s best and brightest.”

On Monday, Best will file FOIAs for FBI records and files relating to 6,912 employees named in the FBI’s own “Dead List,” a list of people that the FBI understands to be deceased. A recent copy of the list, which includes special agents and section chiefs, was FOIA’d by MuckRock editor JPat Brown in January.

Points to remember:

  • Best’s script works for any FOIA office that accepts email requests (not just the FBI, be creative)
  • Get 3 or more people to file the same FOIA requests
  • Publicize your spreadsheet of FOIA targets

Don’t forget the need to scan, OCR and index (topic map) the results of your FOIA requests.

Information that cannot be found may as well still be concealed by the FBI (and others).

Earthdata Search – Smells Like A Topic Map?*

Sunday, February 28th, 2016

Earthdata Search

From the webpage:

Search NASA Earth Science data by keyword and filter by time or space.

After choosing tour:

Keyword Search

Here you can enter search terms to find relevant data. Search terms can be science terms, instrument names, or even collection IDs. Let’s start by searching for Snow Cover NRT to find near real-time snow cover data. Type Snow Cover NRT in the keywords box and press Enter.

Which returns a screen in three sections, left to right: Browse Collections, 21 Matching Collections (Add collections to your project to compare and retrieve their data), and the third section displays a world map (navigate by grabbing the view)

Under Browse Collections:

In addition to searching for keywords, you can narrow your search through this list of terms. Click Platform to expand the list of platforms (still in a tour box)

Next step:

Now click Terra to select the Terra satellite.

Comment: Wondering how I will know which “platform” or “instrument” to select? There may be more/better documentation but I haven’t seen it yet.

The data follows the Unified Metadata Model (UMM):

NASA’s Common Metadata Repository (CMR) is a high-performance, high-quality repository for earth science metadata records that is designed to handle metadata at the Concept level. Collections and Granules are common metadata concepts in the Earth Observation (EO) world, but this can be extended out to Visualizations, Parameters, Documentation, Services, and more. The CMR metadata records are supplied by a diverse array of data providers, using a variety of supported metadata standards, including:


Initially, designers of the CMR considered standardizing all CMR metadata to a single, interoperable metadata format – ISO 19115. However, NASA decided to continue supporting multiple metadata standards in the CMR — in response to concerns expressed by the data provider community over the expense involved in converting existing metadata systems to systems capable of generating ISO 19115. In order to continue supporting multiple metadata standards, NASA designed a method to easily translate from one supported standard to another and constructed a model to support the process. Thus, the Unified Metadata Model (UMM) for EOSDIS metadata was born as part of the EOSDIS Metadata Architecture Studies (MAS I and II) conducted between 2012 and 2013.

What is the UMM?

The UMM is an extensible metadata model which provides a ‘Rosetta stone’ or cross-walk for mapping between CMR-supported metadata standards. Rather than create mappings from each CMR-supported metadata standard to each other, each standard is mapped centrally to the UMM model, thus reducing the number of translations required from n x (n-1) to 2n.

Here the mapping graphic:


Granting profiles don’t make the basis for mappings explicit, but the mappings have the same impact post mapping as a topic map would post merging.

The site could use better documentation for the interface and data, at least in the view of this non-expert in the area.

Thoughts on documentation for the interface or making the mapping more robust via use of a topic map?

I first saw this in a tweet by Kirk Borne.

*Smells Like A Topic Map – Sorry, culture bound reference to a routine on the first Cheech & Chong album. No explanation would do it justice.

No Patriotic Senators, Senate Staffers, Agency Heads – CIA Torture Report

Saturday, February 20th, 2016

I highly recommend your reading The CIA torture report belongs to the public by Lauren Harper.

I have quoted from Lauren’s introduction below as an inducement to read the article in full, but she fails to explore why a patriotic senator, staffer or agency head has not already leaked the CIA Torture Report?

It has already been printed, bound, etc., and who knows how many people were involved in every step of that process.

Do you seriously believe that report has gone unread except for its authors?

So far as I know, member of Congress, that “other” branch of the government, are entitled to make their own decisions about the handling of their reports.

What America needs now is a Senator or even a Senate staffer with more loyalty to the USA than to the bed wetters and torturers (same group) in the DoJ.

If you remember a part of the constitution that grants the DoJ the role of censor for the American public, please point it out in comments below.

From the post:

The American public’s ability to read the Senate Intelligence Committee’s full, scathing report on the Central Intelligence Agency’s torture program is in danger because David Ferriero, the archivist of the United States, will not call the report what it is, a federal record. He is refusing to use his clear statutory authority to label the report a federal record, which would be subject to Freedom of Information Act (FOIA) disclosure requirements, because the Justice Department has told the National Archives and Records Administration (NARA) not to. The DOJ has a long history of breaking the law to avoid releasing information in response to FOIA requests. The NARA does not have such a legacy and should not allow itself to be bullied by the DOJ.

The DOJ instructed the NARA not to make any determination on the torture report’s status as a federal record, ostensibly because it would jeopardize the government’s position in a FOIA lawsuit seeking the report’s release. The DOJ, however, has no right to tell the NARA not to weigh in on the record’s status, and the Presidential and Federal Records Act Amendments of 2014 gives the archivist of the United States the binding legal authority to make precisely that determination.

Democratic Sens. Patrick Leahy of Vermont and Dianne Feinstein of California revealed the DOJ’s insistence that the archivist of the United States not faithfully fulfill his duty in a Nov. 5, 2015, letter to Attorney General Loretta Lynch. They protested the DOJ’s refusal to allow its officials as well as those of the Defense Department, the CIA and the State Department to read the report. Leahy and Feinstein’s letter notes that “personnel at the National Archives and Records Administration have stated that, based on guidance from the Department of Justice, they will not respond to questions about whether the study constitutes a federal record under the Federal Records Act because the FOIA case is pending.” Rather than try to win the FOIA case on a technicality and step on the NARA’s statutory toes, the DOJ should allow the FOIA review process to determine on the case’s merits whether the document may be released.

Not even officials with security clearances may read the document while its status as a congressional or federal record is debated. The New York Times reported in November 2015 that in December of the previous year, a Senate staffer delivered envelopes containing the 6,700-page top secret report to the DOJ, the State Department, the Federal Bureau of Investigation and the Pentagon. Yet a year later, none of the envelopes had been opened, and none of the country’s top officials had read the report’s complete findings. This is because the DOJ, the Times wrote, “prohibited officials from the government agencies that possess it from even opening the report, effectively keeping the people in charge of America’s counterterrorism future from reading about its past.” The DOJ contends that if any agency officials read the report, it could alter the outcome of the FOIA lawsuit.

American war criminals who are identified or who can be discovered because of the CIA Torture Report should be prosecuted to the full extent of national and international law.

Anyone who has participated in attempts to conceal those events or to prevent disclosure of the CIA Torture Report, should be tried as accomplices after the fact to those war crimes.

The facility at Guantanamo Bay can be converted into a holding facility for DoJ staffers who tried to conceal war crimes. Poetic justice I would say.

For What It’s Worth: CIA Releases Declassified Documents to National Archives

Wednesday, February 17th, 2016

CIA Releases Declassified Documents to National Archives

From the webpage:

Today, CIA released about 750,000 pages of declassified intelligence papers, records, research files and other content which are now accessible through CIA’s Records Search Tool (CREST) at the National Archives in College Park, MD. This release will include nearly 100,000 pages of analytic intelligence publication files, and about 20,000 pages of research and development files from CIA’s Directorate of Science and Technology, among others.

The newly available documents are being released in partnership with the National Geospatial Intelligence Agency (NGA) and are available by accessing CREST at the National Archives. This release continues CIA’s efforts to systematically review and release documents under Executive Order 13526. With this release, the CIA collection of records on the CREST system increases to nearly 13 million declassified pages.

That was posted on 16 February 2016.

Disclaimer: No warranty express or implied is made with regards to the accuracy of the notice quoted above or as to the accuracy of anything you may or may not find in the released documents, if they in fact exist.

I merely report that the quoted material was posted to the CIA website at the location and on the date recited.

Sunlight launches Hall of Justice… [ Topic Map “like” features?]

Tuesday, February 2nd, 2016

Sunlight launches Hall of Justice, a massive data inventory on criminal justice across the U.S. by Josh Stewart.

From the post:

Today, Sunlight is launching Hall of Justice, a robust, searchable data inventory of nearly 10,000 datasets and research documents from across all 50 states, the District of Columbia and the federal government. Hall of Justice is the culmination of 18 months of work gathering data and refining technology.

The process was no easy task: Building Hall of Justice required manual entry of publicly available data sources from a multitude of locations across the country.

Sunlight’s team went from state to state, meeting and calling local officials to inquire about and find data related to criminal justice. Some states like California have created a data portal dedicated to making criminal justice data easily accessible to the public; others had their data buried within hard to find websites. We also found data collected by state departments of justice, police forces, court systems, universities and everything in between.

“Data is shaping the future of how we address some of our most pressing problems,” said John Wonderlich, executive director of the Sunlight Foundation. “This new resource is an experiment in how a robust snapshot of data can inform policy and research decisions.”

In addition to being a great data collection, the Hall of Justice attempts to deliver topic map like capability for searches:

The resource attempts to consolidate different terminology across multiple states, which is far from uniform or standardized. For example, if you search solitary confinement you will return results for data around solitary confinement, but also for the terms “segregated housing unit,” “SHU,” “administrative segregation” and “restrictive housing.” This smart search functionality makes finding datasets much easier and accessible.


Looking at all thirteen results for a search on “solitary confinement,” I don’t see the mapping in question. Or certainly no mapping based on characteristics of the subject, “solitary confinement.”

As close as Georgia’s 2013 Juvenile Justice Reform is using the word “restrictive” as in:

Create a two-class system within the Designated Felony Act. Designated felony offenses are divided into two classes, based on severity—Class A and Class B—that continue to allow restrictive custody while also adjusting available sanctions to account for both offense severity and risk level.

Restrictive custody is what jail systems are about so that doesn’t trip the wire for “solitary confinement.”

Of course, the links are to entire reports/documents/data sets so each researcher will have to extract and collate content individually. When that happens, a means to contribute that collation/mapping to the Hall of Justice would be a boon for other researchers. (Can you say “topic map?”)

As I write this, you will need to prefer Mozilla over Chrome, at least on Ubuntu.

Trigger Warning: If you are sensitive to traumatic events and/or reports of traumatic events, you may want to ask someone less sensitive to review these data sources.

The only difference between a concentration camp and American prisons is the lack of mass gas chambers. Every horror and abuse that you can imagine and some you probably can’t, are visited on people in U.S. prisons everyday.

As Joan Baez says in Prison Triology:

Sunlight’s Hall of Justice is a great step forward in documenting the chambers of horror we call American prisons.

And we’re gonna raze, raze the prisons

To the ground

Help us raze, raze the prisons

To the ground

Are you ready?

Voter Record Privacy? WTF?

Monday, December 28th, 2015

Leaky database tramples privacy of 191 million American voters by Dell Cameron.

From the post:

The voter information of more than 191 million Americans—including full names, dates of birth, home addresses, and more—was exposed online for anyone who knew the right IP address.

The misconfigured database, which was reportedly shut down at around 7pm ET Monday night, was discovered by security researcher Chris Vickery. Less than two weeks ago, Vickery also exposed a flaw in MacKeeper’s database, similarly exposing 13 million customer records.

What amazes me about this “leak” is the outrage is focused on the 191+ million records being online.


What about the six or seven organizations who denied being the owners of the IP address in question?

I take it none of them denied having possession of the same or essentially the same data, just that they didn’t “leak” it.

Quick question: Was voter privacy breached when these six or seven organizations got the same data or when it went online?

I would say when the Gang of Six or Seven got the same data.

You don’t have any meaningful voter privacy, aside from your actual ballot, and with your credit record (also for sale), you voting behavior can be nailed too.

You don’t have privacy but the Gang of Six or Seven do.

Attempting to protect lost privacy is pointless.

Making corporate overlords lose their privacy as well has promise.

PS: Torrents of corporate overlord data? Much more interesting than voter data. Enhancements: Quick Search, Congressional Record Index, and More

Monday, December 14th, 2015

New End of Year Enhancements: Quick Search, Congressional Record Index, and More by Andrew Weber.

From the post:

In our quest to retire THOMAS, we have made many enhancements to this year.  Our first big announcement was the addition of email alerts, which notify users of the status of legislation, new issues of the Congressional Record, and when Members of Congress sponsor and cosponsor legislation.  That development was soon followed by the addition of treaty documents and better default bill text in early spring; improved search, browse, and accessibility in late spring; user driven feedback in the summer; and Senate Executive Communications and a series of Two-Minute Tip videos in the fall.

Today’s update on end of year enhancements includes a new Quick Search for legislation, the Congressional Record Index (back to 1995), and the History of Bills from the Congressional Record Index (available from the Actions tab).  We have also brought over the State Legislature Websites page from THOMAS, which has links to state level websites similar to

Text of legislation from the 101st and 102nd Congresses (1989-1992) has been migrated to The Legislative Process infographic that has been available from the homepage as a JPG and PDF is now available in Spanish as a JPG and PDF (translated by Francisco Macías). Margaret and Robert added Fiscal Year 2003 and 2004 to the Appropriations Table. There is also a new About page on the site for XML Bulk Data.

The Quick Search provides a form-based search with fields similar to those available from the Advanced Legislation Search on THOMAS.  The Advanced Search on is still there with many additional fields and ways to search for those who want to delve deeper into the data.  We are providing the new Quick Search interface based on user feedback, which highlights selected fields most likely needed for a search.

There’s an impressive summary of changes!

Speaking of practicing programming, are you planning on practicing XQuery on congressional data in the coming year?

Why the Open Government Partnership Needs a Reboot [Governments Too]

Saturday, December 12th, 2015

Why the Open Government Partnership Needs a Reboot by Steve Adler.

From the post:

The Open Government Partnership was created in 2011 as an international forum for nations committed to implementing Open Government programs for the advancement of their societies. The idea of open government started in the 1980s after CSPAN was launched to broadcast U.S. Congressional proceedings and hearings to the American public on TV. While the galleries above the House of Representatives and Senate had been “open” to the “public” (if you got permission from your representative to attend) for decades, never before had all public democratic deliberations been broadcast on TV for the entire nation to behold at any time they wished to tune in.

I am a big fan of OGP and feel that the ideals and ambition of this partnership are noble and essential to the survival of democracy in this millennium. But OGP is a startup, and every startup business or program faces a chasm it must cross from early adopters and innovators to early majority market implementation and OGP is very much at this crossroads today. It has expanded membership at a furious pace the past three years and it’s clear to me that expansion is now far more important to OGP than the delivery of the benefits of open government to the hundreds of millions of citizens who need transparent transformation.

OGP needs a reboot.

The structure of a system produces its own behavior. OGP needs a new organizational structure with new methods for evaluating national commitments. But that reboot needs to happen within its current mission. We should see clearly that the current structure is straining due to the rapid expansion of membership. There aren’t enough support unit resources to manage the expansion. We have to rethink how we manage national commitments and how we evaluate what it means to be an open government. It’s just not right that countries can celebrate baby steps at OGP events while at the same time passing odious legislation, sidestepping OGP accomplishments, buckling to corruption, and cracking down on journalists.

Unlike Steve I didn’t and don’t have a lot of faith in governments being voluntarily transparent.

As I pointed out in Congress: More XQuery Fodder, sometime in 2016, full bill status data will be available for all legislation before the United States Congress.

A lot more data than is easy to access now but it is more smoke than fire.

With legislation status data, you can track the civics lesson progression of a bill through Congress, but that leaves you at least 3 to 4 degrees short of knowing who was behind the legislation.

Just a short list of what more would be useful:

  • Visitor/caller list for everyone who spoke to a member of Congress and their staff. With date and subject of the call.
  • All visitors and calls tied to particular legislation and/or classes of legislation
  • All fund raising calls made by members of Congress and/or their staffs, date, results, substance of call.
  • Representative conversations with reconciliation committee members or their staffers about legislation and requested “corrections.”
  • All conversations between a representative or member of their staff and agency staff, identifying all parties and the substance of the conversation
  • Notes, proposals, discussion notes for all agencies decisions

Current transparency proposals are sufficient to confuse the public with mounds of nearly useless data. None of it reflects the real decision making processes of government.

Before someone shouts “privacy,” I would point out that no citizen has a right to privacy when their request is for a government representative to favor them over other citizens of the same government.

Real government transparency will require breaking the mini-star chamber proceedings from the lowest to the highest levels of government.

What we need is a rebooting of governments.

Congress: More XQuery Fodder

Tuesday, December 8th, 2015

Congress Poised for Leap to Open Up Legislative Data by Daniel Schuman.

From the post:

Following bills in Congress requires three major pieces of information: the text of the bill, a summary of what the bill is about, and the status information associated with the bill. For the last few years, Congress has been publishing the text and summaries for all legislation moving in Congress, but has not published bill status information. This key information is necessary to identify the bill author, where the bill is in the legislative process, who introduced the legislation, and so on.

While it has been in the works for a while, this week Congress confirmed it will make “Bill Statuses in XML format available through the GPO’s Federal Digital System (FDsys) Bulk Data repository starting with the 113th Congress,” (i.e. January 2013). In “early 2016,” bill status information will be published online in bulk– here. This should mean that people who wish to use the legislative information published on and THOMAS will no longer need to scrape those websites for current legislative information, but instead should be able to access it automatically.

Congress isn’t just going to pull the plug without notice, however. Through the good offices of the Bulk Data Task Force, Congress will hold a public meeting with power users of legislative information to review how this will work. Eight sample bill status XML files and draft XML User Guides were published on GPO’s GitHub page this past Monday. Based on past positive experiences with the Task Force, the meeting is a tremendous opportunity for public feedback to make sure the XML files serve their intended purposes. It will take place next Tuesday, Dec. 15, from 1-2:30. RSVP details below.

If all goes as planned, this milestone has great significance.

  • It marks the publication of essential legislative information in a format that supports unlimited public reuse, analysis, and republication. It will be possible to see much of a bill’s life cycle.
  • It illustrates the positive relationship that has grown between Congress and the public on access to legislative information, where there is growing open dialog and conversation about how to best meet our collective needs.
  • It is an example of how different components within the legislative branch are engaging with one another on a range of data-related issues, sometimes for the first time ever, under the aegis of the Bulk Data Task Force.
  • It means the Library of Congress and GPO will no longer be tied to the antiquated THOMAS website and can focus on more rapid technological advancement.
  • It shows how a diverse community of outside organizations and interests came together and built a community to work with Congress for the common good.

To be sure, this is not the end of the story. There is much that Congress needs to do to address its antiquated technological infrastructure. But considering where things were a decade ago, the bulk publication of information about legislation is a real achievement, the culmination of a process that overcame high political barriers and significant inertia to support better public engagement with democracy and smarter congressional processes.

Much credit is due in particular to leadership in both parties in the House who have partnered together to push for public access to legislative information, as well as the staff who worked tireless to make it happen.

If you look at the sample XML files, pay close attention to the <bioguideID> element and its contents. Is is the same value as you will find for roll-call votes, but there the value appears in the name-id attribute of the <legislator> element. See: and do view source.

Oddly, the <bioguideID> element does not appear in the documentation on GitHub, you just have to know the correspondence to the name-id attribute of the <legislator> element

As I said in the title, this is going to be XQuery fodder.

Beta Testing eFOIA (FBI)

Thursday, December 3rd, 2015

Want to Obtain FBI Records a Little Quicker? Try New eFOIA System

From the post:

The FBI recently began open beta testing of eFOIA, a system that puts Freedom of Information Act (FOIA) requests into a medium more familiar to an ever-increasing segment of the population. This new system allows the public to make online FOIA requests for FBI records and receive the results from a website where they have immediate access to view and download the released information.

Previously, FOIA requests have only been made through regular mail, fax, or e-mail, and all responsive material was sent to the requester through regular mail either in paper or disc format. “The eFOIA system,” says David Hardy, chief of the FBI’s Record/Information Dissemination Section, “is for a new generation that’s not paper-based.” Hardy also notes that the new process should increase FBI efficiency and decrease administrative costs.

The eFOIA system continues in an open beta format to optimize the process for requesters. The Bureau encourages requesters to try eFOIA and to e-mail with any questions or difficulties encountered while using it. In several months, the FBI plans to move eFOIA into full production mode.

The post gives a list of things you need to know/submit in order to help with beta testing of the eFOIA system.

Why help the FBI?

It’s true, I often chide the FBI for its padding of terrorism statistics by framing the mentally ill and certainly its project management skills are nothing to write home about.

Still, there are men and women in the FBI who do capture real criminals and not just the gullible or people who have offended the recording or movie industries. There are staffers, like the ones behind the eFOIA project, who are trying to do a public service, despite the bad apples in the FBI barrel.

Let’s give them a hand, even though decisions on particular FOIA requests may be quite questionable. Not the fault of the technology or the people who are trying to make it work.

What are you going to submit an FOIA about?

I first saw this in a tweet by Nieman Lab.

Progress on Connecting Votes and Members of Congress (XQuery)

Tuesday, December 1st, 2015

Not nearly to the planned end point but I have corrected a file I generated with XQuery that provides the name-id numbers for members of the House and a link to their websites at

It is a rough draft but you can find it at:

While I was casting about for the resources for this posting, I had the sinking feeling that I had wasted a lot of time and effort when I found:

But, if you read that file carefully, what is the one thing it lacks?

A link to every members’s website at “…”

Isn’t that interesting?

Of all the things to omit, why that one?

Especially since you can’t auto-generate the website names from the member names. What appear to be older names use just the last name of members. But, that strategy must have fallen pretty quickly when members with the same last names appeared.

The conflicting names and even some non-conflicting names follow a new naming protocol that appears to be

That will work for a while until the next generation starts inheriting positions in the House.

Anyway, that is as far as I got today but at least it is a useful list for invoking the name-id of members of the House and obtaining their websites.

The next step will be hitting the websites to extract contact information.

Yes, I know that has the “official” contact information, along with their forms for email, etc.

If I wanted to throw my comment into a round file I could do that myself.

No, what I want to extract is their local office data so when they are “back home” meeting with constituents, the average voter has a better chance of being one of those constituents. Not just those who maxed out on campaign donations limits.

Connecting Roll Call Votes to Members of Congress (XQuery)

Monday, November 30th, 2015

Apologies for the lack of posting today but I have been trying to connect up roll call votes in the House of Representatives to additional information on members of Congress.

In case you didn’t know, roll call votes are reported in XML and have this form:

<recorded-vote><legislator name-id="A000374" sort-field="Abraham" 
unaccented-name="Abraham" party="R" state="LA"
<recorded-vote><legislator name-id="A000370" sort-field="Adams" 
unaccented-name="Adams" party="D" state="NC" 
<recorded-vote><legislator name-id="A000055" sort-field="Aderholt" 
unaccented-name="Aderholt" party="R" state="AL" 
<recorded-vote><legislator name-id="A000371" sort-field="Aguilar" 
unaccented-name="Aguilar" party="D" state="CA"

For a full example:

With the name-id attribute value, I can automatically construct URIs to the Biographical Directory of the United States Congress, for example, the entry on Abraham, Ralph.

More information than a poke with a sharp stick would give you but its only self-serving cant.

One of the things that would be nice to link up with roll call votes would be the homepages of those voting.

Continuing with Ralph Abraham, mapping A000374 to would be helpful in gathering other information, such as the various offices where Representative Abraham can be contacted.

If you are reading the URIs, you might think just prepending the last name of each representative to “” would be sufficient. Well, it would be except that there are eight-three cases where representatives share last names and/or a new naming scheme has more than the last name +

After I was satisfied that there wasn’t a direct mapping between the current uses of name-id and House member websites, I started creating such a mapping that you can drop into XQuery as a lookup table and/or use as an external file.

The lookup table should be finished tomorrow so check back.

PS: Yes, I am aware there are tables of contact information for members of Congress but I have yet to see one that lists all their local offices. Moreover, a lookup table for XQuery may encourage people to connect more data to their representatives. Such as articles in local newspapers, property deeds and other such material.

Now over 1,000,000 Items to Search on [Cause to Celebrate?]

Wednesday, October 7th, 2015

Now over 1,000,000 Items to Search on Communications and More Added by Andrew Weber.

From the post:

This has been a great year as we continue our push to develop and refine  There were email alerts added in February, treaties and better default text in March, the Federalist Papers and more browse options in May, and accessibility and user requested features in July.  With this October update, Senate Executive Communications from THOMAS have migrated to  There is an About Executive Communications page that provides more detail about the scope of coverage, searching, viewing, and obtaining copies.

Not to mention a new video “help” series, Legislative Subject Terms and Popular and Short Titles.

All good and from one of the few government institutions that merits respect, the Library of Congress.

Why the “Cause to Celebrate?”

This is an excellent start and certainly has shown itself to be far more responsive to user requests than vendors are to reports of software vulnerabilities.

But we are still at the higher level of data, legislation, regulations, etc.

Where needs to follow is a dive downward to identify who obtains the benefits of legislation/regulations? Who obtains permits, for what and at what market value? Who obtains benefits, credits, allowances? Who wins contracts and where does that money go as it tracks down the prime contractor -> sub-prime contractor -> etc. pipeline?

It is ironic that when candidates for president talk about tax reform they tend to focus on the tax tables. Which are two (2) pages out of the current 6,455 pages of the IRC (in pdf,

Knowing who benefits and by how much for the rest of the pages of the IRC isn’t going to make government any cleaner.

But, when paired with campaign contributions, it will give everyone an even footing on buying favors from the government.

Not unlike public disclosure enables a relatively fair stock exchange, in the case of government it will enable relative fairness in corruption.

Disclosing Government Contracts

Friday, August 21st, 2015

The More the Merrier? How much information on government contracts should be published and who will use it by Gavin Hayman.

From the post:

A huge bunch of flowers to Rick Messick for his excellent post asking two key questions about open contracting. And some luxury cars, expensive seafood and a vat or two of cognac.

Our lavish offerings all come from Slovakia, where in 2013 the Government Public Procurement Office launched a new portal publishing all its government contracts. All these items were part of the excessive government contracting uncovered by journalists, civil society and activists. In the case of the flowers, teachers investigating spending at the Department of Education uncovered florists’ bills for thousands of euros. Spending on all of these has subsequently declined: a small victory for fiscal probity.

The flowers, cars, and cognac help to answer the first of two important questions that Rick posed: Will anyone look at contracting information? In the case of Slovakia, it is clear that lowering the barriers to access information did stimulate some form of response and oversight.

The second question was equally important: “How much contracting information should be disclosed?”, especially in commercially sensitive circumstances.

These are two of key questions that we have been grappling with in our strategy at the Open Contracting Partnership. We thought that we would share our latest thinking below, in a post that is a bit longer than usual. So grab a cup of tea and have a read. We’ll be definitely looking forward to your continued thoughts on these issues.

Not a short read so do grab some coffee (outside of Europe) and settle in for a good read.

Disclosure: I’m financially interested in government disclosure in general and contracts in particular. With openness there comes more effort to conceal semantics and increase the need for topic maps to pierce the darkness.

I don’t think openness reduces the amount of fraud and misconduct in government, it only gives an alignment between citizens and the career interests of a prosecutor a sporting chance to catch someone out.

Disclosure should be as open as possible and what isn’t disclosed voluntarily, well, one hopes for brave souls who will leak the remainder.

Support disclosure of government contracts and leakers of the same.

If you need help “connecting the dots,” consider topic maps. (beta)

Wednesday, July 15th, 2015 (beta)

From the announcement post:

In February this year we announced that we will be iteratively improving the user experience. Today we are launching the new Beta site. There are many changes and we hope you will like them.

  • Dataset pages have been greatly simplified so that you can get to your data within two clicks.
  • We have re-written many of the descriptions to simply explanations.
  • We have launched which is aimed at non-developers to search and then download data.
  • We have also greatly improved and revised our API documentation. For example have a look here
  • We have added content from our blog and twitter feeds into the home page and I hope you agree that we are now presenting a more cohesive offering.

We are still working on datasets, and those in the pipeline waiting for release imminently are

  • Bills meta-data for bills going through Parliamentary process.)
  • Commons Select Committee meta-data.
  • Deposited Papers
  • Lords Attendance data

Let us know what you think.

There could be some connection between what the government says publicly and what it does privately. As they say, “anything is possible.”

Curious, what do you make of the Thesaurus?

Typing the “related” link to say how they are related would be a step in the right direction. Apparently there is an organization with the title: “‘Sdim Curo Plant!” (other sources report Welsh for “Children are Unbeatable”.) Which turns out to be the preferred label.

The entire set has 107,337 records and can be downloaded, albeit in 500 record chunks. That should improve over time according to: Downloading data from data.parliment.

I have always been interested in what terms other people use and this looks like an interesting data set, that is part of a larger interesting data set.


Nominations by the U.S. President

Monday, July 13th, 2015

Nominations by the U.S. President

The Library of Congress created this resource which enables you to search for nominations by U.S. Presidents starting in 1981. There information about the nomination process, the records and related nomination resources at About Nominations of the U.S. Congress.

Unfortunately I did not find a link to bulk data for presidential nominations nor an API for the search engine behind this webpage.

I say that because matching up nominees and/or their sponsors with campaign contributions would help get a price range on becoming the ambassador to Uraguay, etc.

I wrote to Ask a Law Librarian to check on the status of bulk data and/or an API. Will amend this post when I get a response.

Oh, there will be a response. For all the ills and failures of the U.S. government, which are legion, it is capable of assembling vast amounts of information and training people to perform research on it. Not in every case but if it falls within the purview of the Law Library of Congress, I am confident of a useful answer.

World Factbook 2015 (paper, online, downloadable)

Wednesday, June 24th, 2015

World Factbook 2015 (GPO)

From the webpage:

The Central Intelligence Agency’s World Factbook provides brief information on the history, geography, people, government, economy, communications, transportation, military, and transnational issues for 267 countries and regions around world.

The CIA’s World Factbook also contains several appendices and maps of major world regions, which are located at the very end of the publication. The appendices cover abbreviations, international organizations and groups, selected international environmental agreements, weights and measures, cross-reference lists of country and hydrographic data codes, and geographic names.

For maps, it provides a country map for each country entry and a total of 12 regional reference maps that display the physical features and political boundaries of each world region. It also includes a pull-out Flags of the World, a Physical Map of the World, a Political Map of the World, and a Standard Time Zones of the World map.

Who should read The World Factbook? It is a great one-stop reference for anyone looking for an expansive body of international data on world statistics, and has been a must-have publication for:

  • US Government officials and diplomats
  • News organizations and researchers
  • Corporations and geographers
  • Teachers, professors, librarians, and students
  • Anyone who travels abroad or who is interested in foreign countries

The print version is $89.00 (U.S.), is 923 pages long and weighs in at 5.75 lb. in paperback.

A convenient and frequently updated alternative is the online CIA World Factbook.

I can’t compare the two versions because I am not going to spend $89.00 for an arm wrecker. 😉

You can also download a copy of the HTML version.

I downloaded and unzipped the file, only to find that the last update was in June, 2014.

That may be updated soon or it may not. I really don’t know.

If you just need background information that is unlikely to change or you want to avoid surveillance on what countries you look at and for how long, download the 2014 HTML version or pony up for the 2015 paper version.

Saudi Cables (or file dump?)

Saturday, June 20th, 2015

WikiLeaks publishes the Saudi Cables

From the post:

Today, Friday 19th June at 1pm GMT, WikiLeaks began publishing The Saudi Cables: more than half a million cables and other documents from the Saudi Foreign Ministry that contain secret communications from various Saudi Embassies around the world. The publication includes “Top Secret” reports from other Saudi State institutions, including the Ministry of Interior and the Kingdom’s General Intelligence Services. The massive cache of data also contains a large number of email communications between the Ministry of Foreign Affairs and foreign entities. The Saudi Cables are being published in tranches of tens of thousands of documents at a time over the coming weeks. Today WikiLeaks is releasing around 70,000 documents from the trove as the first tranche.

Julian Assange, WikiLeaks publisher, said: “The Saudi Cables lift the lid on a increasingly erratic and secretive dictatorship that has not only celebrated its 100th beheading this year, but which has also become a menace to its neighbours and itself.

The Kingdom of Saudi Arabia is a hereditary dictatorship bordering the Persian Gulf. Despite the Kingdom’s infamous human rights record, Saudi Arabia remains a top-tier ally of the United States and the United Kingdom in the Middle East, largely owing to its globally unrivalled oil reserves. The Kingdom frequently tops the list of oil-producing countries, which has given the Kingdom disproportionate influence in international affairs. Each year it pushes billions of petro-dollars into the pockets of UK banks and US arms companies. Last year it became the largest arms importer in the world, eclipsing China, India and the combined countries of Western Europe. The Kingdom has since the 1960s played a major role in the Organization of Petroleum Exporting Countries (OPEC) and the Cooperation Council for the Arab States of the Gulf (GCC) and dominates the global Islamic charity market.

For 40 years the Kingdom’s Ministry of Foreign Affairs was headed by one man: Saud al Faisal bin Abdulaziz, a member of the Saudi royal family, and the world’s longest-serving foreign minister. The end of Saud al Faisal’s tenure, which began in 1975, coincided with the royal succession upon the death of King Abdullah in January 2015. Saud al Faisal’s tenure over the Ministry covered its handling of key events and issues in the foreign relations of Saudi Arabia, from the fall of the Shah and the second Oil Crisis to the September 11 attacks and its ongoing proxy war against Iran. The Saudi Cables provide key insights into the Kingdom’s operations and how it has managed its alliances and consolidated its position as a regional Middle East superpower, including through bribing and co-opting key individuals and institutions. The cables also illustrate the highly centralised bureaucratic structure of the Kingdom, where even the most minute issues are addressed by the most senior officials.

Since late March 2015 the Kingdom of Saudi Arabia has been involved in a war in neighbouring Yemen. The Saudi Foreign Ministry in May 2015 admitted to a breach of its computer networks. Responsibility for the breach was attributed to a group calling itself the Yemeni Cyber Army. The group subsequently released a number of valuable “sample” document sets from the breach on file-sharing sites, which then fell under censorship attacks. The full WikiLeaks trove comprises thousands of times the number of documents and includes hundreds of thousands of pages of scanned images of Arabic text. In a major journalistic research effort, WikiLeaks has extracted the text from these images and placed them into our searchable database. The trove also includes tens of thousands of text files and spreadsheets as well as email messages, which have been made searchable through the WikiLeaks search engine.

By coincidence, the Saudi Cables release also marks two other events. Today marks three years since WikiLeaks founder Julian Assange entered the Ecuadorian Embassy in London seeking asylum from US persecution, having been held for almost five years without charge in the United Kingdom. Also today Google revealed that it had been been forced to hand over more data to the US government in order to assist the prosecution of WikiLeaks staff under US espionage charges arising from our publication of US diplomatic cables.

A searcher with good Arabic skills is going to be necessary to take full advantage of this release.

I am unsure about the title: “Saudi Cables” because some of the documents I retrieved searching for “Bush,” were public interviews and statements. Hardly the burning secrets that are hinted at by “cables.” See for example, Exclusive Interview with Daily Telegraph 27-2-2005.doc or Interview with Wall Street Joutnal 26-4-2004.doc.

Putting “public document” in the words to exclude filter doesn’t eliminate the published interviews.

This has the potential, particularly out of more than 500,000 documents, to have some interesting tidbits. The first step would be to winnow out all published and/or public statements, in English and/or Arabic. Not discarded but excluded from search results until you need to make connections between secret statements and public ones.

A second step would be to identify the author/sender/receiver of each document so they can be matched to known individuals and events.

This is a great opportunity to practice your Arabic NLP processing skills. Or Arabic for that matter.

Hopefully Wikileaks will not decide to act as public censor with regard to these documents.

Governments do enough withholding of the truth. They don’t need the assistance of Wikileaks.

The Political One Percent of the One Percent:…

Wednesday, June 10th, 2015

The Political One Percent of the One Percent: Megadonors fuel rising cost of elections in 2014 by Peter Olsen-Phillips, Russ Choma, Sarah Bryner, and Doub Weber.

From the post:

In the 2014 elections, 31,976 donors — equal to roughly one percent of one percent of the total population of the United States — accounted for an astounding $1.18 billion in disclosed political contributions at the federal level. Those big givers — what we have termed the “Political One Percent of the One Percent” — have a massively outsized impact on federal campaigns.

They’re mostly male, tend to be city-dwellers and often work in finance. Slightly more of them skew Republican than Democratic. A small subset — barely five dozen — earned the (even more) rarefied distinction of giving more than $1 million each. And a minute cluster of three individuals contributed more than $10 million apiece.

The last election cycle set records as the most expensive midterms in U.S. history, and the country’s most prolific donors accounted for a larger portion of the total amount raised than in either of the past two elections.

The $1.18 billion they contributed represents 29 percent of all fundraising that political committees disclosed to the Federal Election Commission in 2014. That’s a greater share of the total than in 2012 (25 percent) or in 2010 (21 percent).

It’s just one of the main takeaways in the latest edition of the Political One Percent of the One Percent, a joint analysis of elite donors in America by the Center for Responsive Politics and the Sunlight Foundation.

BTW, although the report says conservatives “edged their liberal opponents,” the Republicans raised $553 million and Democrats raised $505 million from donors on the one percent of the one percent list. The $48 million difference isn’t rounding error size but once you break one-half $billon, it doesn’t seem as large as it might otherwise.

As far as I can tell, the report does not reproduce the addresses of the one percent of one percent donors. For that you need to use the advanced search option at the FEC and put 8810 (no dollar sign needed) in the first “amount range” box, set the date range to 2014 to 2015 and then search. Quite a long list so you may want to do it by state.

To get the individual location information, you can to follow the transaction number at the end of each record returned by your query and that returns a PDF page. Somewhere on that page will be the address information for the donor.

As far as campaign finance, the report indicates you need to find another way to influence the political process. Any donation much below the one percent of one percent minimum, i.e., $8810, isn’t going to buy you any influence. In fact, you are subsidizing the cost of a campaign that benefits the big donors the most. If big donors want to buy those campaigns, let them support the entire campaign.

In a sound bite: Don’t subsidize major political donors with small contributions.

Once you have identified the one percent of one percent donors, you can start to work out the other relationships between those donors and the levers of power.

Fast Track to the Corporate Wish List [Is There A Hacker In The House?]

Tuesday, June 9th, 2015

Fast Track to the Corporate Wish List by David Dayen.

From the post:

Some time in the next several days, the House will likely vote on trade promotion authority, enabling the Obama administration to proceed with its cherished Trans-Pacific Partnership (TPP). Most House Democrats want no part of the deal, which was crafted by and for corporations. And many Tea Party Republicans don’t want to hand the administration any additional powers, even in service of a victory dearly sought by the GOP’s corporate allies. The vote, which has been repeatedly delayed as both the White House and House GOP leaders try to round up support, is expected to be extremely close.

The Obama administration entered office promising to renegotiate unbalanced trade agreements, which critics believe have cost millions of manufacturing jobs in the past 20 years. But they’ve spent more than a year pushing the TPP, a deal with 11 Pacific Rim nations that mostly adheres to the template of corporate favors masquerading as free trade deals. Of the 29 TPP chapters, only five include traditional trade measures like reducing tariffs and opening markets. Based on leaks and media reports—the full text remains a well-guarded secret—the rest appears to be mainly special-interest legislation.

Pharmaceutical companies, software makers, and Hollywood conglomerates get expanded intellectual property enforcement, protecting their patents and their profits. Some of this, such as restrictions on generic drugs, is at the expense of competition and consumers. Firms get improved access to poor countries with nonexistent labor protections, like Vietnam or Brunei, to manufacture their goods. TPP provides assurances that regulations, from food safety to financial services, will be “harmonized” across borders. In practice, that means a regulatory ceiling. In one of the most contested provisions, corporations can use the investor-state dispute settlement (ISDS) process, and appeal to extra-judicial tribunals that bypass courts and usual forms of due process to seek monetary damages equaling “expected future profits.”

How did we reach this point—where “trade deals” are Trojan horses for fulfilling corporate wish lists, and where all presidents, Democrat or Republican, ultimately pay fealty to them? One place to look is in the political transfer of power, away from Congress and into a relatively obscure executive branch office, the Office of the United States Trade Representative (USTR).

USTR has become a way station for hundreds of officials who casually rotate between big business and the government. Currently, Michael Froman, former Citigroup executive and chief of staff to Robert Rubin, runs USTR, and his actions have lived up to the agency’s legacy as the white-shoe law firm for multinational corporations. Under Froman’s leadership, more ex-lobbyists have funneled through USTR, practically no enforcement of prior trade violations has taken place, and new agreements like TPP are dubiously sold as progressive achievements, laced with condescension for anyone who disagrees.

David does a great job of sketching the background both for the Trans-Pacific Partnership but also the U.S. Trade Representative.

Given the hundreds of people, nation states and corporations that have access to the text of the Trans-Pacific Partnership text, don’t you wonder why it remains secret?

I don’t think President Obama and his business cronies realize that secrecy of an agreement that will affect the vast majority of American citizens strikes at the legitimacy of government itself. True enough, corporations that own entire swaths of Congress are going to get more benefits than the average American. Those benefits are out in the open and citizens can press for benefits as well.

The benefits that accrue to corporations under the Trans-Pacific Partnership will be gained in secret, with little or no opportunity for the average citizen to object. There is something fundamentally unfair about the secret securing of benefits for corporations.

I hope that Obama doesn’t complain about “illegal” activity that foils his plan to secretly favor corporations. I won’t be listening. Will you?

Yemen Cyber Army will release 1M of records per week to stop Saudi Attacks

Sunday, May 31st, 2015

Yemen Cyber Army will release 1M of records per week to stop Saudi Attacks by Pierluigi Paganini.

From the post:

Hackers of the Yemen Cyber Army (YCA) had dumped another 1,000,000 records obtained by violating systems at the Saudi Ministry of Foreign Affairs.

The hacking crew known as the Yemen Cyber Army is continuing its campaign against the Government of Saudi Arabia.

The Yemen Cyber Army (YCA) has released other data from the stolen archived belonging to the Saudi Ministry of Foreign Affairs. The data breach was confirmed by the authorities, Osama bin Ahmad al-Sanousi, a senior official at the kingdom’s Foreign Ministry, made the announcement last week.

Now the hackers have released a new data dump containing 1,000,000 Records ff Saudi VISA Database, they also announced that every week they will release a new lot of 1M records. The Yemen Cyber Army have also shared secret documents of the Private Saudi MOFA with Wikileaks.

he hackers of the Yemen Cyber Army have released 10 records from the archive including a huge amount of data.

Mirror #1 :
Mirror #2 :
Mirror #3 :

The Website has published a detailed analysis of the dump published by the Yemen Cyber Army. reports that the latest dump is mostly visa data.

Good to know that the Yemen Cyber Army is backing up their data with Wikileaks but I don’t think of Wikileaks as a transparent source of government documents. For reasons best known to themselves, Wikileaks has taken on the role of government censor with regard to the information it releases. Acknowledging the critical role Wikileaks has played in recent public debates, don’t blind me to their arrogation of the role of public censor.

Speaking of data dumps, where are the diplomatic records from Iraq? Before or since becoming a puppet government for the United States?

In the meantime, keep watching for more data dumps from the Yemem Cyber Army.

Open Data: Getting Started/Finding

Friday, May 8th, 2015

Data Science – Getting Started With Open Data

23 Resources for Finding Open Data

Ryan Swanstrom has put together two posts will have you using and finding open data.

“Open data” can be a boon to researchers and others, but you should ask the following questions (among others) of any data set:

  1. Who collected the data?
  2. Why was the data collected?
  3. How was the recorded data selected?
  4. How large was the potential data pool?
  5. Was the original data cleaned after collection?
  6. If the original data was cleaned, by what criteria?
  7. How was the accuracy of the data measured?
  8. What instruments were used to collect the data?
  9. How were the instruments used to collect the data developed?
  10. How were the instruments used to collect the data validated?
  11. What publications have relied upon the data?
  12. How did you determine the semantics of the data?

That’s not a compete set but a good starting point.

Just because data is available, open, free, etc. doesn’t mean that it is useful. The best example is the still-in-print Budge translation The book of the dead : the papyrus of Ani in the British Museum. The original was published in 1895, making the current reprints more than a century out of date.

It is a very attractive reproduction (it is rare to see hieroglyphic text with inter-linear transliteration and translation in modern editions) of the papyrus of Ani, but it gives a mis-leading impression of the state of modern knowledge and translation of Middle Egyptian.

Of course, some readers are satisfied with century old encyclopedias as well, but I would not rely upon them or their sources for advice.

Open But Recorded Access

Thursday, May 7th, 2015

Search Airmen Certificate Information

Registry of certified pilots.

From the search page:


I didn’t perform a search so I don’t have a feel for what, if any, validation is done on the requested searcher information.

If you are on Tor, you might want to consider using the address for Wrigley field, 1060 W Addison St, Chicago, IL 60613, to see if it complains.

Bureau of Transportation Statistics

Thursday, May 7th, 2015

Bureau of Transportation Statistics

I discovered this site while looking for “official” statistics to debunk claims about air travel and screening for terrorists. (Begging National Security Questions #1)

I didn’t find it an easy site to navigate but that probably reflects my lack of familiarity with the data being collected. A short guide with a very good index would be quite useful.

A real treasure trove of transportation information (from the about page):

Major Programs of the Bureau of Transportation Statistics (BTS)

It is important to remember that federal agencies (and their equivalents under other governments) have distinct agendas. When confronting outlandish claims from one of the security agencies, it helps to have contradictory data gathered by other, “disinterested,” agencies of the same government.

Security types can dismiss your evidence and analysis as “that’s what you think.” After all, their world is nothing but suspicion and conjecture. Why shouldn’t that be true for others?

Not as easy to dismiss data and analysis by other government agencies.