Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 20, 2015

Saudi Cables (or file dump?)

Filed under: Government,Government Data — Patrick Durusau @ 1:41 pm

WikiLeaks publishes the Saudi Cables

From the post:

Today, Friday 19th June at 1pm GMT, WikiLeaks began publishing The Saudi Cables: more than half a million cables and other documents from the Saudi Foreign Ministry that contain secret communications from various Saudi Embassies around the world. The publication includes “Top Secret” reports from other Saudi State institutions, including the Ministry of Interior and the Kingdom’s General Intelligence Services. The massive cache of data also contains a large number of email communications between the Ministry of Foreign Affairs and foreign entities. The Saudi Cables are being published in tranches of tens of thousands of documents at a time over the coming weeks. Today WikiLeaks is releasing around 70,000 documents from the trove as the first tranche.

Julian Assange, WikiLeaks publisher, said: “The Saudi Cables lift the lid on a increasingly erratic and secretive dictatorship that has not only celebrated its 100th beheading this year, but which has also become a menace to its neighbours and itself.

The Kingdom of Saudi Arabia is a hereditary dictatorship bordering the Persian Gulf. Despite the Kingdom’s infamous human rights record, Saudi Arabia remains a top-tier ally of the United States and the United Kingdom in the Middle East, largely owing to its globally unrivalled oil reserves. The Kingdom frequently tops the list of oil-producing countries, which has given the Kingdom disproportionate influence in international affairs. Each year it pushes billions of petro-dollars into the pockets of UK banks and US arms companies. Last year it became the largest arms importer in the world, eclipsing China, India and the combined countries of Western Europe. The Kingdom has since the 1960s played a major role in the Organization of Petroleum Exporting Countries (OPEC) and the Cooperation Council for the Arab States of the Gulf (GCC) and dominates the global Islamic charity market.

For 40 years the Kingdom’s Ministry of Foreign Affairs was headed by one man: Saud al Faisal bin Abdulaziz, a member of the Saudi royal family, and the world’s longest-serving foreign minister. The end of Saud al Faisal’s tenure, which began in 1975, coincided with the royal succession upon the death of King Abdullah in January 2015. Saud al Faisal’s tenure over the Ministry covered its handling of key events and issues in the foreign relations of Saudi Arabia, from the fall of the Shah and the second Oil Crisis to the September 11 attacks and its ongoing proxy war against Iran. The Saudi Cables provide key insights into the Kingdom’s operations and how it has managed its alliances and consolidated its position as a regional Middle East superpower, including through bribing and co-opting key individuals and institutions. The cables also illustrate the highly centralised bureaucratic structure of the Kingdom, where even the most minute issues are addressed by the most senior officials.

Since late March 2015 the Kingdom of Saudi Arabia has been involved in a war in neighbouring Yemen. The Saudi Foreign Ministry in May 2015 admitted to a breach of its computer networks. Responsibility for the breach was attributed to a group calling itself the Yemeni Cyber Army. The group subsequently released a number of valuable “sample” document sets from the breach on file-sharing sites, which then fell under censorship attacks. The full WikiLeaks trove comprises thousands of times the number of documents and includes hundreds of thousands of pages of scanned images of Arabic text. In a major journalistic research effort, WikiLeaks has extracted the text from these images and placed them into our searchable database. The trove also includes tens of thousands of text files and spreadsheets as well as email messages, which have been made searchable through the WikiLeaks search engine.

By coincidence, the Saudi Cables release also marks two other events. Today marks three years since WikiLeaks founder Julian Assange entered the Ecuadorian Embassy in London seeking asylum from US persecution, having been held for almost five years without charge in the United Kingdom. Also today Google revealed that it had been been forced to hand over more data to the US government in order to assist the prosecution of WikiLeaks staff under US espionage charges arising from our publication of US diplomatic cables.

A searcher with good Arabic skills is going to be necessary to take full advantage of this release.

I am unsure about the title: “Saudi Cables” because some of the documents I retrieved searching for “Bush,” were public interviews and statements. Hardly the burning secrets that are hinted at by “cables.” See for example, Exclusive Interview with Daily Telegraph 27-2-2005.doc or Interview with Wall Street Joutnal 26-4-2004.doc.

Putting “public document” in the words to exclude filter doesn’t eliminate the published interviews.

This has the potential, particularly out of more than 500,000 documents, to have some interesting tidbits. The first step would be to winnow out all published and/or public statements, in English and/or Arabic. Not discarded but excluded from search results until you need to make connections between secret statements and public ones.

A second step would be to identify the author/sender/receiver of each document so they can be matched to known individuals and events.

This is a great opportunity to practice your Arabic NLP processing skills. Or Arabic for that matter.

Hopefully Wikileaks will not decide to act as public censor with regard to these documents.

Governments do enough withholding of the truth. They don’t need the assistance of Wikileaks.

June 10, 2015

The Political One Percent of the One Percent:…

Filed under: Government,Government Data,Politics — Patrick Durusau @ 1:21 pm

The Political One Percent of the One Percent: Megadonors fuel rising cost of elections in 2014 by Peter Olsen-Phillips, Russ Choma, Sarah Bryner, and Doub Weber.

From the post:

In the 2014 elections, 31,976 donors — equal to roughly one percent of one percent of the total population of the United States — accounted for an astounding $1.18 billion in disclosed political contributions at the federal level. Those big givers — what we have termed the “Political One Percent of the One Percent” — have a massively outsized impact on federal campaigns.

They’re mostly male, tend to be city-dwellers and often work in finance. Slightly more of them skew Republican than Democratic. A small subset — barely five dozen — earned the (even more) rarefied distinction of giving more than $1 million each. And a minute cluster of three individuals contributed more than $10 million apiece.

The last election cycle set records as the most expensive midterms in U.S. history, and the country’s most prolific donors accounted for a larger portion of the total amount raised than in either of the past two elections.

The $1.18 billion they contributed represents 29 percent of all fundraising that political committees disclosed to the Federal Election Commission in 2014. That’s a greater share of the total than in 2012 (25 percent) or in 2010 (21 percent).

It’s just one of the main takeaways in the latest edition of the Political One Percent of the One Percent, a joint analysis of elite donors in America by the Center for Responsive Politics and the Sunlight Foundation.

BTW, although the report says conservatives “edged their liberal opponents,” the Republicans raised $553 million and Democrats raised $505 million from donors on the one percent of the one percent list. The $48 million difference isn’t rounding error size but once you break one-half $billon, it doesn’t seem as large as it might otherwise.

As far as I can tell, the report does not reproduce the addresses of the one percent of one percent donors. For that you need to use the advanced search option at the FEC and put 8810 (no dollar sign needed) in the first “amount range” box, set the date range to 2014 to 2015 and then search. Quite a long list so you may want to do it by state.

To get the individual location information, you can to follow the transaction number at the end of each record returned by your query and that returns a PDF page. Somewhere on that page will be the address information for the donor.

As far as campaign finance, the report indicates you need to find another way to influence the political process. Any donation much below the one percent of one percent minimum, i.e., $8810, isn’t going to buy you any influence. In fact, you are subsidizing the cost of a campaign that benefits the big donors the most. If big donors want to buy those campaigns, let them support the entire campaign.

In a sound bite: Don’t subsidize major political donors with small contributions.

Once you have identified the one percent of one percent donors, you can start to work out the other relationships between those donors and the levers of power.

June 9, 2015

Fast Track to the Corporate Wish List [Is There A Hacker In The House?]

Filed under: Government,Government Data,Law,Politics — Patrick Durusau @ 6:19 pm

Fast Track to the Corporate Wish List by David Dayen.

From the post:

Some time in the next several days, the House will likely vote on trade promotion authority, enabling the Obama administration to proceed with its cherished Trans-Pacific Partnership (TPP). Most House Democrats want no part of the deal, which was crafted by and for corporations. And many Tea Party Republicans don’t want to hand the administration any additional powers, even in service of a victory dearly sought by the GOP’s corporate allies. The vote, which has been repeatedly delayed as both the White House and House GOP leaders try to round up support, is expected to be extremely close.

The Obama administration entered office promising to renegotiate unbalanced trade agreements, which critics believe have cost millions of manufacturing jobs in the past 20 years. But they’ve spent more than a year pushing the TPP, a deal with 11 Pacific Rim nations that mostly adheres to the template of corporate favors masquerading as free trade deals. Of the 29 TPP chapters, only five include traditional trade measures like reducing tariffs and opening markets. Based on leaks and media reports—the full text remains a well-guarded secret—the rest appears to be mainly special-interest legislation.

Pharmaceutical companies, software makers, and Hollywood conglomerates get expanded intellectual property enforcement, protecting their patents and their profits. Some of this, such as restrictions on generic drugs, is at the expense of competition and consumers. Firms get improved access to poor countries with nonexistent labor protections, like Vietnam or Brunei, to manufacture their goods. TPP provides assurances that regulations, from food safety to financial services, will be “harmonized” across borders. In practice, that means a regulatory ceiling. In one of the most contested provisions, corporations can use the investor-state dispute settlement (ISDS) process, and appeal to extra-judicial tribunals that bypass courts and usual forms of due process to seek monetary damages equaling “expected future profits.”

How did we reach this point—where “trade deals” are Trojan horses for fulfilling corporate wish lists, and where all presidents, Democrat or Republican, ultimately pay fealty to them? One place to look is in the political transfer of power, away from Congress and into a relatively obscure executive branch office, the Office of the United States Trade Representative (USTR).

USTR has become a way station for hundreds of officials who casually rotate between big business and the government. Currently, Michael Froman, former Citigroup executive and chief of staff to Robert Rubin, runs USTR, and his actions have lived up to the agency’s legacy as the white-shoe law firm for multinational corporations. Under Froman’s leadership, more ex-lobbyists have funneled through USTR, practically no enforcement of prior trade violations has taken place, and new agreements like TPP are dubiously sold as progressive achievements, laced with condescension for anyone who disagrees.

David does a great job of sketching the background both for the Trans-Pacific Partnership but also the U.S. Trade Representative.

Given the hundreds of people, nation states and corporations that have access to the text of the Trans-Pacific Partnership text, don’t you wonder why it remains secret?

I don’t think President Obama and his business cronies realize that secrecy of an agreement that will affect the vast majority of American citizens strikes at the legitimacy of government itself. True enough, corporations that own entire swaths of Congress are going to get more benefits than the average American. Those benefits are out in the open and citizens can press for benefits as well.

The benefits that accrue to corporations under the Trans-Pacific Partnership will be gained in secret, with little or no opportunity for the average citizen to object. There is something fundamentally unfair about the secret securing of benefits for corporations.

I hope that Obama doesn’t complain about “illegal” activity that foils his plan to secretly favor corporations. I won’t be listening. Will you?

May 31, 2015

Yemen Cyber Army will release 1M of records per week to stop Saudi Attacks

Filed under: Cybersecurity,Government,Government Data,Security — Patrick Durusau @ 7:08 am

Yemen Cyber Army will release 1M of records per week to stop Saudi Attacks by Pierluigi Paganini.

From the post:

Hackers of the Yemen Cyber Army (YCA) had dumped another 1,000,000 records obtained by violating systems at the Saudi Ministry of Foreign Affairs.

The hacking crew known as the Yemen Cyber Army is continuing its campaign against the Government of Saudi Arabia.

The Yemen Cyber Army (YCA) has released other data from the stolen archived belonging to the Saudi Ministry of Foreign Affairs. The data breach was confirmed by the authorities, Osama bin Ahmad al-Sanousi, a senior official at the kingdom’s Foreign Ministry, made the announcement last week.

Now the hackers have released a new data dump containing 1,000,000 Records ff Saudi VISA Database, they also announced that every week they will release a new lot of 1M records. The Yemen Cyber Army have also shared secret documents of the Private Saudi MOFA with Wikileaks.

he hackers of the Yemen Cyber Army have released 10 records from the archive including a huge amount of data.

http://pastebin.com/VRGh3imf
http://quickleak.org/3vShKvD4
http://paste.yt/p3418.html

Mirror #1 : http://mymusicexpert.com/images/upload/VISA-1M-W1.rar
Mirror #2 : http://distant.voipopp.vn.ua/PEAR/upload/VISA-1M-W1.rar
Mirror #3 : http://intrasms.com/css/VISA-1M-W1.rar

The Website databreaches.net has published a detailed analysis of the dump published by the Yemen Cyber Army.

Databreaches.net reports that the latest dump is mostly visa data.

Good to know that the Yemen Cyber Army is backing up their data with Wikileaks but I don’t think of Wikileaks as a transparent source of government documents. For reasons best known to themselves, Wikileaks has taken on the role of government censor with regard to the information it releases. Acknowledging the critical role Wikileaks has played in recent public debates, don’t blind me to their arrogation of the role of public censor.

Speaking of data dumps, where are the diplomatic records from Iraq? Before or since becoming a puppet government for the United States?

In the meantime, keep watching for more data dumps from the Yemem Cyber Army.

May 8, 2015

Open Data: Getting Started/Finding

Filed under: Government Data,Open Data — Patrick Durusau @ 8:23 pm

Data Science – Getting Started With Open Data

23 Resources for Finding Open Data

Ryan Swanstrom has put together two posts will have you using and finding open data.

“Open data” can be a boon to researchers and others, but you should ask the following questions (among others) of any data set:

  1. Who collected the data?
  2. Why was the data collected?
  3. How was the recorded data selected?
  4. How large was the potential data pool?
  5. Was the original data cleaned after collection?
  6. If the original data was cleaned, by what criteria?
  7. How was the accuracy of the data measured?
  8. What instruments were used to collect the data?
  9. How were the instruments used to collect the data developed?
  10. How were the instruments used to collect the data validated?
  11. What publications have relied upon the data?
  12. How did you determine the semantics of the data?

That’s not a compete set but a good starting point.

Just because data is available, open, free, etc. doesn’t mean that it is useful. The best example is the still-in-print Budge translation The book of the dead : the papyrus of Ani in the British Museum. The original was published in 1895, making the current reprints more than a century out of date.

It is a very attractive reproduction (it is rare to see hieroglyphic text with inter-linear transliteration and translation in modern editions) of the papyrus of Ani, but it gives a mis-leading impression of the state of modern knowledge and translation of Middle Egyptian.

Of course, some readers are satisfied with century old encyclopedias as well, but I would not rely upon them or their sources for advice.

May 7, 2015

Open But Recorded Access

Filed under: Government,Government Data — Patrick Durusau @ 7:44 pm

Search Airmen Certificate Information

Registry of certified pilots.

From the search page:

airmen-search

I didn’t perform a search so I don’t have a feel for what, if any, validation is done on the requested searcher information.

If you are on Tor, you might want to consider using the address for Wrigley field, 1060 W Addison St, Chicago, IL 60613, to see if it complains.

Bureau of Transportation Statistics

Filed under: Government Data,Politics,Travel — Patrick Durusau @ 4:57 pm

Bureau of Transportation Statistics

I discovered this site while looking for “official” statistics to debunk claims about air travel and screening for terrorists. (Begging National Security Questions #1)

I didn’t find it an easy site to navigate but that probably reflects my lack of familiarity with the data being collected. A short guide with a very good index would be quite useful.

A real treasure trove of transportation information (from the about page):

Major Programs of the Bureau of Transportation Statistics (BTS)

It is important to remember that federal agencies (and their equivalents under other governments) have distinct agendas. When confronting outlandish claims from one of the security agencies, it helps to have contradictory data gathered by other, “disinterested,” agencies of the same government.

Security types can dismiss your evidence and analysis as “that’s what you think.” After all, their world is nothing but suspicion and conjecture. Why shouldn’t that be true for others?

Not as easy to dismiss data and analysis by other government agencies.

April 26, 2015

NOAA weather data – Valuing Open Data – Guessing – History Repeats

Filed under: Cloud Computing,Government Data,NOAA — Patrick Durusau @ 4:02 pm

Tech titans ready their clouds for NOAA weather data by Greg Otto.

From the post:

It’s fitting that the 20 terabytes of data the National Oceanic and Atmospheric Administration produces every day will now live in the cloud.

The Commerce Department took a step Tuesday to make NOAA data more accessible as Commerce Secretary Penny Pritzker announced a collaboration among some of the country’s top tech companies to give the public a range of environmental, weather and climate data to access and explore.

Amazon Web Services, Google, IBM, Microsoft and the Open Cloud Consortium have entered into a cooperative research and development agreement with the Commerce Department that will push NOAA data into the companies’ respective cloud platforms to increase the quantity of and speed at which the data becomes publicly available.

“The Commerce Department’s data collection literally reaches from the depths of the ocean to the surface of the sun,” Pritzker said during a Monday keynote address at the American Meteorological Society’s Washington Forum. “This announcement is another example of our ongoing commitment to providing a broad foundation for economic growth and opportunity to America’s businesses by transforming the department’s data capabilities and supporting a data-enabled economy.”

According to Commerce, the data used could come from a variety of sources: Doppler radar, weather satellites, buoy networks, tide gauges, and ships and aircraft. Commerce expects this data to launch new products and services that could benefit consumer goods, transportation, health care and energy utilities.

The original press release has this cheery note on the likely economic impact of this data:

So what does this mean to the economy? According to a 2013 McKinsey Global Institute Report, open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide. If more of this data could be efficiently released, organizations will be able to develop new and innovative products and services to help us better understand our planet and keep communities resilient from extreme events.

Ah, yes, that would be the Open data: Unlocking innovation and performance with liquid information, on which the summary page says:

Open data can help unlock $3 trillion to $5 trillion in economic value annually across seven sectors.

But you need to read the full report (PDF) in order to find footnote 3 on “economic value:”

3. Throughout this report we express value in terms of annual economic surplus in 2013 US dollars, not the discounted value of future cash flows; this valuation represents estimates based on initiatives where open data are necessary but not sufficient for realizing value. Often, value is achieved by combining analysis of open and proprietary information to identify ways to improve business or government practices. Given the interdependence of these factors, we did not attempt to estimate open data’s relative contribution; rather, our estimates represent the total value created.

That is a disclosure that the estimate of $3 to $5 trillion is a guess and/or speculation.

Odd how the guess/speculation disclosure drops out of the Commerce Department press release and when it gets to Greg’s story it reads:

open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide.

From guess/speculation to no mention to fact, all in the short space of three publications.

Does the valuing of open data remind you of:

virginia-ad

(Image from: http://civics.sites.unc.edu/files/2012/06/EarlyAmericanSettlements1.pdf)

The date of 1609 is important. Wikipedia has an article on Virginia, 1609-1610, titled, Starving Time. That year, only sixty (60) out of five hundred (500) colonists survived.

Does “Excellent Fruites by Planting” sound a lot like “new and innovative products and services?”

It does to me.

I first saw this in a tweet by Kirk Borne.

April 22, 2015

A Scary Earthquake Map – Oklahoma

Filed under: Environment,Government,Government Data — Patrick Durusau @ 8:15 pm

Earthquakes in Oklahoma – Earthquake Map

OK-earthquakes

Great example of how visualization can make the case that “standard” industry practices are in fact damaging the public.

The map is interactive and the screen shot above is only one example.

The main site is located at: http://earthquakes.ok.gov/.

From the homepage:

Oklahoma experienced 585 magnitude 3+ earthquakes in 2014 compared to 109 events recorded in 2013. This rise in seismic events has the attention of scientists, citizens, policymakers, media and industry. See what information and research state officials and regulators are relying on as the situation progresses.

The next stage of data mapping should be identifying the owners or those who profited from the waste water disposal wells and their relationships to existing oil and gas interests, as well as their connections to members of the Oklahoma legislature.

What is it that Republicans call it? Ah, accountability, as in holding teachers and public agencies “accountable.” Looks to me like it is time to hold some oil and gas interests and their owners, “accountable.”

PS: Said to not be a “direct” result of fracking but of the disposal of water used for fracking. Close enough for my money. You?

April 12, 2015

Research Reports by U.S. Congress and UK House of Commons

Filed under: Government,Government Data,Research Methods — Patrick Durusau @ 4:27 pm

Research Reports by U.S. Congress and UK House of Commons by Gary Price.

Gary’s post covers the Congressional Research Service (CRS) (US) and the House of Commons Library Research Service (UK).

Truly amazing I know for an open and transparent government like the United States Goverment but CRS reports are not routinely made available to the public and so we have to rely on the kindness of strangers to make them available. Gary reports:

The good news is that Steven Aftergood, director of the Government Secrecy Project at the Federation of American Scientists (FAS), gets ahold of many of these reports and shares them on the FAS website.

The House of Commons Library Research Service appears to not mind officially sharing its research with anyone with web access.

Unlike some government agencies and publications, the CRS and LRS enjoy reputations for high quality scholarship and accuracy. You still need to evaluate their conclusions and the evidence cited or not, but outright deception and falsehood aren’t part of their traditions.

February 22, 2015

Unleashing the Power of Data to Serve the American People

Filed under: Data Science,Government,Government Data,Politics — Patrick Durusau @ 11:56 am

Unleashing the Power of Data to Serve the American People by Dr. DJ Patil.

You can read (and listen) to Patil’s high level goals as the first ever U.S. Chief Data Scientist at his post.

His goals are too abstract and general to attract meaningful disagreement and that isn’t the purpose of this post.

I posted the link to his comments to urge you to contact Patil (or rather his office) with concrete plans for how his office can assist you in finding and using data. The sooner the better.

No doubt some areas are already off-limits for improved data access and some priorities are already set.

That said, contacting Patil before he and his new office have solidified in place can play an important role in establishing the scope of his office. On a lesser scale, the same situation that confronted George Washington as the first U.S. President. Nothing was set in stone and every act established a precedent for those who came after him.

Now is the time to press for an expansive and far reaching role for the U.S. Chief Data Scientist within the federal bureaucracy.

February 19, 2015

Congress.gov offers email alerts

Filed under: Government,Government Data — Patrick Durusau @ 1:38 pm

Congress.gov offers email alerts

From the post:

Beginning today [5 February 2015], the free legislative information website Congress.gov offers users a new optional email-alerts system that makes tracking legislative action even easier. Users can elect to receive email alerts for tracking:

  • A specific bill in the current Congress: Receive an email when there are updates to a specific bill (new cosponsors, committee action, vote taken, etc.); emails are sent once a day if there has been a change in a particular bill’s status since the previous day.
  • A specific member’s legislative activity: Receive an email when a specific member introduces or cosponsors a bill; emails are sent once a day if a member has introduced or cosponsored a bill since the previous day.
  • Congressional Record: Receive an email as soon as a new issue of the Congressional Record is available on Congress.gov.

The alerts system is a new feature available to anyone who creates a free account on the Congress.gov site. Creating an account also enables users to save searches. Create an account and sign up for alerts at congress.gov/account.

If you are interested in legislation or in influencing those who vote on it, you should sign up for these alerts. No promises other than if you aren’t heard, your opinion won’t be considered.

You should also use Congress.gov to verify the content of legislation when you get a “…the world is ending as we know it…” emails from interest groups. You are not well-informed if you are completely reliant on the opinions of others. Misguided perhaps but not well-informed.

February 15, 2015

The US Patent and Trademark Office should switch from documents to data

Filed under: Government Data,Patents — Patrick Durusau @ 2:00 pm

The US Patent and Trademark Office should switch from documents to data by Justin Duncan.

From the post:

The debate over patent reform — one of Silicon Valley’s top legislative priorities — is once again in focus with last week’s introduction of the Innovation Act (H.R. 9) by House Judiciary Committee Chairman Bob Goodlatte (R-Va.), Rep. Peter DeFazio (D-Ore.), Subcommittee on Courts, Intellectual Property, and the Internet Chairman Darrell Issa (R-Calif.) and Ranking Member Jerrold Nadler (D-N.Y.), and 15 other original cosponsors.

The Innovation Act largely takes aim at patent trolls (formally “non-practicing entities”), who use patent litigation as a business strategy and make money by threatening lawsuits against other companies. While cracking down on litigious patent trolls is important, that challenge is only one facet of what should be a larger context for patent reform.

The need to transform patent information into open data deserves some attention, too.

The United States Patent and Trademark Office (PTO), the agency within the Department of Commerce that grants patents and registers trademarks, plays a crucial role in empowering American innovators and entrepreneurs to create new technologies. Ironically, many of the PTO’s own systems and technologies are out of date.

Last summer, Data Transparency Coalition advisor Joel Gurin and his colleagues organized an Open Data Roundtable with the Department of Commerce, co-hosted by the Governance Lab at New York University (GovLab) and the White House Office of Science and Technology Policy (OSTP). The roundtable focused on ways to improve data management, dissemination, and use at the Department of Commerce. It shed some light on problems faced by the PTO.

According to GovLab’s report of the day’s findings and recommendations, the PTO is currently working to improve the use and availability of some patent data by putting it in a more centralized, easily searchable form.

To make patent applications easier to navigate – for inventors, investors, the public, and the agency itself – the PTO should more fully embrace the use of structured data formats, like XML, to express the information currently collected as PDFs or text documents.

Justin’s post is a brief history of efforts to improve access to patent and trademark information, mostly focusing on the need for the USPTO (US Patent and Trademark Office) to stop relying on PDF as its default format.

Other potential improvements:

Additional GovLab recommendations included:

  • PTO [should] make more information available about the scope of patent rights, including expiration dates, or decisions by the agency and/or courts about patent claims.
  • PTO should add more context to its data to make it usable by non-experts – e.g. trademark transaction data and trademark assignment.
  • Provide Application Programming Interfaces (APIs) to enable third parties to build better interfaces for the existing legacy systems. Access to Patent Application Information Retrieval (PAIR) and Patent Trial and Appeal Board (PTAB) data are most important here.
  • Improve access to Cooperative Patent Classification (CPC)/U.S. Patent Classification (USPC) harmonization data; tie this data more closely to economic data to facilitate analysis.

Tying in related information, the first and last recommendations on the GovLab list is another step in the right direction.

But only a step.

If you have ever searched the USPTO patent database you know making the data “searchable” is only a nod and wink towards accessibility. Making the data is nothing to sneeze at but USPTO reform should have a higher target than simple being “searchable.”

Outside of patent search specialists (and not all of them), what ordinary citizen is going to be able to navigate the terms of art across domains when searching patents?

The USPTO should go beyond making patents literally “searchable” and instead make patents “reliably” searchable. By “reliable” searching I mean searching that returns all the relevant patents. A safe harbor if you will that protects inventors, investors and implementers from costly suits arising out of the murky wood filled with traps, intellectual quicksand and formulaic chants that are the USPTO patent database.

I first saw this in a tweet by Joel Gurin.

Federal Spending Data Elements

Filed under: Government Data,Transparency — Patrick Durusau @ 10:43 am

Federal Spending Data Elements

From the webpage:

The data elements in the below list represent the existing Federal Funding Accountability and Transparency Act (FFATA) data elements currently displayed on USAspending.gov and the additional data elements that will be posted pursuant to the DATA Act. These elements are currently being deliberated on and discussed by the Federal community as a part of DATA Act implementation. At this point, this list is exhaustive. However, additional data elements may be standardized for transparency reporting in the future based on agency or community needs.

Join the Conversation

At this time, we are asking for comments in response to the following questions:

  1. Which data elements are most crucial to your current reporting and/or analysis?
  2. In setting standards, what are industry standards the Treasury and OMB should be considering?
  3. What are some of the considerations that Treasury and OMB should take into account when establishing data standards?

Just reading the responses to the questions on GitHub will give you a sense of what other community members are thinking about.

What responses are you going to contribute?

I first saw this in a tweet by Hudson Hollister.

February 14, 2015

Mercury [March 5, 2015, Washington, DC]

Filed under: Government,Government Data,Intelligence — Patrick Durusau @ 7:47 pm

Mercury Registration Deadline: February 17, 2015.

From the post:

The Intelligence Advanced Research Projects Activity (IARPA) will host a Proposers’ Day Conference for the Mercury Program on March 5, in anticipation of the release of a new solicitation in support of the program. The Conference will be held from 8:30 AM to 5:00 PM EST in the Washington, DC metropolitan area. The purpose of the conference will be to provide introductory information on Mercury and the research problems that the program aims to address, to respond to questions from potential proposers, and to provide a forum for potential proposers to present their capabilities and identify potential team partners.

Program Description and Goals

Past research has found that publicly available data can be used to accurately forecast events such as political crises and disease outbreaks. However, in many cases, relevant data are not available, have significant lag times, or lack accuracy. Little research has examined whether data from foreign Signals Intelligence (SIGINT) can be used to improve forecasting accuracy in these cases.

The Mercury Program seeks to develop methods for continuous, automated analysis of SIGINT in order to anticipate and/or detect political crises, disease outbreaks, terrorist activity, and military actions. Anticipated innovations include: development of empirically driven sociological models for population-level behavior change in anticipation of, and response to, these events; processing and analysis of streaming data that represent those population behavior changes; development of data extraction techniques that focus on volume, rather than depth, by identifying shallow features of streaming SIGINT data that correlate with events; and development of models to generate probabilistic forecasts of future events. Successful proposers will combine cutting-edge research with the ability to develop robust forecasting capabilities from SIGINT data.

Mercury will not fund research on U.S. events, or on the identification or movement of specific individuals, and will only leverage existing foreign SIGINT data for research purposes.

The Mercury Program will consist of both unclassified and classified research activities and expects to draw upon the strengths of academia and industry through collaborative teaming. It is anticipated that teams will be multidisciplinary, and might include social scientists, mathematicians, statisticians, computer scientists, content extraction experts, information theorists, and SIGINT subject matter experts with applied experience in the U.S. SIGINT System.

Attendees must register no later than 6:00 pm EST, February 27, 2015 at http://events.SignUp4.com/MercuryPDRegistration_March2015. Directions to the conference facility and other materials will be provided upon registration. No walk-in registrations will be allowed.

I might be interested if you can hide me under a third or fourth level sub-contractor. 😉

Seriously, it isn’t that I despair of the legitimate missions of intelligence agencies but I do despise waste on ways known to not work. Government funding, even unlimited funding, isn’t going to magically confer the correct semantics on data or enable analysts to meaningfully share their work products across domains.

You would think going on fourteen (14) years post-9/11 and not being one step closer to preventing a similar event, that would be a “wake-up” call to someone. If not in the U.S. intelligence community, perhaps in intelligence communities who tire of aping the U.S. community with no better results.

OpenGov Voices: Bringing transparency to earmarks buried in the budget

Filed under: Government,Government Data,Politics,Transparency — Patrick Durusau @ 7:29 pm

OpenGov Voices: Bringing transparency to earmarks buried in the budget by Matthew Heston, Madian Khabsa, Vrushank Vora, Ellery Wulczyn and Joe Walsh.

From the post:

Last week, President Obama kicked off the fiscal year 2016 budget cycle by unveiling his $3.99 trillion budget proposal. Congress has the next eight months to write the final version, leaving plenty of time for individual senators and representatives, state and local governments, corporate lobbyists, bureaucrats, citizens groups, think tanks and other political groups to prod and cajole for changes. The final bill will differ from Obama’s draft in major and minor ways, and it won’t always be clear how those changes came about. Congress will reveal many of its budget decisions after voting on the budget, if at all.

We spent this past summer with the Data Science for Social Good program trying to bring transparency to this process. We focused on earmarks – budget allocations to specific people, places or projects – because they are “the best known, most notorious, and most misunderstood aspect of the congressional budgetary process” — yet remain tedious and time-consuming to find. Our goal: to train computers to extract all the earmarks from the hundreds of pages of mind-numbing legalese and numbers found in each budget.

Watchdog groups such as Citizens Against Government Waste and Taxpayers for Common Sense have used armies of human readers to sift through budget documents, looking for earmarks. The White House Office of Management and Budget enlisted help from every federal department and agency, and the process still took three months. In comparison, our software is free and transparent and generates similar results in only 15 minutes. We used the software to construct the first publicly available database of earmarks that covers every year back to 1995.

Despite our success, we barely scratched the surface of the budget. Not only do earmarks comprise a small portion of federal spending but senators and representatives who want to hide the money they budget for friends and allies have several ways to do it:

I was checking the Sunlight Foundation Blog for any updated information on the soon to be released indexes of federal data holdings when I encountered this jewel on earmarks.

Important to read/support because:

  1. By dramatically reducing the human time investment to find earmarks, it frees up that time to be spent gathering deeper information about each earmark
  2. It represents a major step forward in the ability to discover relationships between players in the data (what the NSA wants to do but with a rationally chosen data set).
  3. It will educate you on earmarks and their hiding places.
  4. It is an inspirational example of how darkness can be replaced with transparency, some of it anyway.

Will transparency reduce earmarks? I rather doubt it because a sense of shame doesn’t seem to motivate elected and appointed officials.

What transparency can do is create a more level playing field for those who want to buy government access and benefits.

For example, if I knew what it cost to have the following exemption in the FOIA:

Exemption 9: Geological information on wells.

it might be possible to raise enough funds to purchase the deletion of:

Exemption 5: Information that concerns communications within or between agencies which are protected by legal privileges, that include but are not limited to:

4 Deliberative Process Privilege

Which is where some staffers hide their negotiations with former staffers as they prepare to exit the government.

I don’t know that matching what Big Oil paid for the geological information on wells exemption would be enough but it would set a baseline for what it takes to start the conversation.

I say “Big Oil paid…” assuming that most of us don’t equate matters of national security with geological information. Do you have another explanation for such an offbeat provision?

If government is (and I think it is) for sale, then let’s open up the bidding process.

A big win for open government: Sunlight gets U.S. to…

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 6:58 pm

A big win for open government: Sunlight gets U.S. to release indexes of federal data by Matthew Rumsey and Sean Vitka and John Wonderlich.

From the post:

For the first time, the United States government has agreed to release what we believe to be the largest index of government data in the world.

On Friday, the Sunlight Foundation received a letter from the Office of Management and Budget (OMB) outlining how they plan to comply with our FOIA request from December 2013 for agency Enterprise Data Inventories. EDIs are comprehensive lists of a federal agency’s information holdings, providing an unprecedented view into data held internally across the government. Our FOIA request was submitted 14 months ago.

These lists of the government’s data were not public, however, until now. More than a year after Sunlight’s FOIA request and with a lawsuit initiated by Sunlight about to be filed, we’re finally going to see what data the government holds.

Since 2013, federal agencies have been required to construct a list of all of their major data sets, subject only to a few exceptions detailed in President Obama’s executive order as well as some information exempted from disclosure under the FOIA.

Many kudos to the Sunlight Foundation!

As to using the word “win,” do we need to wait and see what Enterprise Data Inventories are in fact produced?

I say that because the executive order of President Obama that is cited in the post, provides these exemptions from disclosure:

4 (d) (d) Nothing in this order shall compel or authorize the disclosure of privileged information, law enforcement information, national security information, personal information, or information the disclosure of which is prohibited by law.

Will that be taken as an excuse to not list the data collections at all?

Or, will the NSA say:

one (1) collection of telephone metadata, timeSpan: 4 (d) exempt, size: 4 (d) exempt, metadataStructure: 4 (d) exempt source: 4 (d) exempt

Do they mean internal NSA phone logs? Do they mean some other source?

Or will they simply not list telephone metadata at all?

What’s exempt under FOAI? (From FOIA.gov):

Not all records can be released under the FOIA.  Congress established certain categories of information that are not required to be released in response to a FOIA request because release would be harmful to governmental or private interests.   These categories are called "exemptions" from disclosures.  Still, even if an exemption applies, agencies may use their discretion to release information when there is no foreseeable harm in doing so and disclosure is not otherwise prohibited by law.  There are nine categories of exempt information and each is described below.  

Exemption 1: Information that is classified to protect national security.  The material must be properly classified under an Executive Order.

Exemption 2: Information related solely to the internal personnel rules and practices of an agency.

Exemption 3: Information that is prohibited from disclosure by another federal law. Additional resources on the use of Exemption 3 can be found on the Department of Justice FOIA Resources page.

Exemption 4: Information that concerns business trade secrets or other confidential commercial or financial information.

Exemption 5: Information that concerns communications within or between agencies which are protected by legal privileges, that include but are not limited to:

  1. Attorney-Work Product Privilege
  2. Attorney-Client Privilege
  3. Deliberative Process Privilege
  4. Presidential Communications Privilege

Exemption 6: Information that, if disclosed, would invade another individual’s personal privacy.

Exemption 7: Information compiled for law enforcement purposes if one of the following harms would occur.  Law enforcement information is exempt if it: 

  • 7(A). Could reasonably be expected to interfere with enforcement proceedings
  • 7(B). Would deprive a person of a right to a fair trial or an impartial adjudication
  • 7(C). Could reasonably be expected to constitute an unwarranted invasion of personal privacy
  • 7(D). Could reasonably be expected to disclose the identity of a confidential source
  • 7(E). Would disclose techniques and procedures for law enforcement investigations or prosecutions
  • 7(F). Could reasonably be expected to endanger the life or physical safety of any individual

Exemption 8: Information that concerns the supervision of financial institutions.

Exemption 9: Geological information on wells.

And the exclusions:

Congress has provided special protection in the FOIA for three narrow categories of law enforcement and national security records. The provisions protecting those records are known as “exclusions.” The first exclusion protects the existence of an ongoing criminal law enforcement investigation when the subject of the investigation is unaware that it is pending and disclosure could reasonably be expected to interfere with enforcement proceedings. The second exclusion is limited to criminal law enforcement agencies and protects the existence of informant records when the informant’s status has not been officially confirmed. The third exclusion is limited to the Federal Bureau of Investigation and protects the existence of foreign intelligence or counterintelligence, or international terrorism records when the existence of such records is classified. Records falling within an exclusion are not subject to the requirements of the FOIA. So, when an office or agency responds to your request, it will limit its response to those records that are subject to the FOIA.

You can spot the truck sized holes as well as I can that may prevent disclosure.

One analytic challenge upon the release of the Enterprise Data Inventories will be to determine what is present and what is missing but should be present. Another will be to assist the Sunlight Foundation in its pursuit of additional FOIAs to obtain data listed but not available. Perhaps I should call this an important victory although of a battle and not the long term war for government transparency.

Thoughts?

February 11, 2015

FBI Records: The Vault

Filed under: Government,Government Data — Patrick Durusau @ 7:53 pm

FBI Records: The Vault

From the webpage:

The Vault is our new FOIA Library, containing 6,700 documents and other media that have been scanned from paper into digital copies so you can read them in the comfort of your home or office. 

Included here are many new FBI files that have been released to the public but never added to this website; dozens of records previously posted on our site but removed as requests diminished; files from our previous FOIA Library, and new, previously unreleased files.

The Vault includes several new tools and resources for your convenience:

  • Searching for Topics: You can browse or search for specific topics or persons (like Al Capone or Marilyn Monroe) by viewing our alphabetical listing, by using the search tool in the upper right of this site, or by checking the different category lists that can be found in the menu on the right side of this page. In the search results, click on the folder to see all of the files for that particular topic.
  • Searching for Key Words: Thanks to new technology we have developed, you can now search for key words or phrases within some individual files. You can search across all of our electronic files by using the search tool in the upper right of this site, or you can search for key words within a specific document by typing in terms in the search box in the upper right hand of the file after it has been opened and loaded. Note: since many of the files include handwritten notes or are not always in optimal condition due to age, this search feature does not always work perfectly.
  • Viewing the Files: We are now using an open source web document viewer, so you no longer need your own file software to view our records. When you click on a file, it loads in a reader that enables you to view one or two pages at a time, search for key words, shrink or enlarge the size of the text, use different scroll features, and more. In many cases, the quality and clarity of the individual files has also been improved.
  • Requesting a Status Update: Use our new Check the Status of Your FOI/PA Request tool to determine where your request stands in our process. Status information is updated weekly. Note: You need your FOI/PA request number to use this feature.

Please note: the content of the files in the Vault encompasses all time periods of Bureau history and do not always reflect the current views, policies, and priorities of the FBI.

New files will be added on a regular basis, so please check back often.

This may be meant as a distraction but I don’t know from what?

I suppose there is some value in knowing that ineffectual law enforcement investigations did not begin with 9/11.

February 7, 2015

Encouraging open data usage…

Filed under: Government Data,Open Data — Patrick Durusau @ 7:04 pm

Encouraging open data usage by commercial developers: Report

From the post:

The second Share-PSI workshop was very different from the first. Apart from presentations in two short plenary sessions, the majority of the two days was spent in facilitated discussions around specific topics. This followed the success of the bar camp sessions at the first workshop, that is, sessions proposed and organised in an ad hoc fashion, enabling people to discuss whatever subject interests them.

Each session facilitator was asked to focus on three key questions:

  1. What X is the thing that should be done to publish or reuse PSI?
  2. Why does X facilitate the publication or reuse of PSI?
  3. How can one achieve X and how can you measure or test it?

This report summarises the 7 plenary presentations, 17 planned sessions and 7 bar camp sessions. As well as the Share-PSI project itself, the workshop benefited from sessions lead by 8 other projects. The agenda for the event includes links to all papers, slides and notes, with many of those notes being available on the project wiki. In addition, the #sharepsi tweets from the event are archived, as are a number of photo albums from Makx Dekkers,
Peter Krantz and José Luis Roda. The event received a generous write up
on the host’s Web site (in Portuguese). The spirit of the event is captured in this video by Noël Van Herreweghe of CORVe.

To avoid confusion, PSI in this context means Public Sector Information, not Published Subject Identifier (PSI).

Amazing coincidence that the W3C has smudged yet another name. You may recall the W3C decided to confuse URIs and IRIs in its latest attempt to re-write history, calling both the the acronym, URI:

Within this specification, the term URI refers to a Universal Resource Identifier as defined in [RFC 3986] and extended in [RFC 2987] [RFC 3987] with the new name IRI. The term URI has been retained in preference to IRI to avoid introducing new names for concepts such as “Base URI” that are defined or referenced across the whole family of XML specifications. (Corrected the RFC listing as shown.) (XQuery and XPath Data Model 3.1 , N. Walsh, J. Snelson, Editors, W3C Candidate Recommendation (work in progress), 18 December 2014, http://www.w3.org/TR/2014/CR-xpath-datamodel-31-20141218/ . Latest version available at http://www.w3.org/TR/xpath-datamodel-31/.)

Interesting discussion but I would pay very close attention to market demand, perhaps I should say, commercial market demand, before planning a start-up based on government data. There is unlimited demand for free data or even better, free enhanced data, but that should not be confused with enhanced data that can be sold to support a start-up on an ongoing basis.

To give you an idea of the uncertainly of conditions for start-ups relying on open data, let me quote the final bullet points of this article:

  • There is a lack of knowledge of what can be done with open data which is hampering uptake.
  • There is a need for many examples of success to help show what can be done.
  • Any long term re-use of PSI must be based on a business plan.
  • Incubators/accelerators should select projects to support based on the business plan.
  • Feedback from re-users is an important component of the ecosystem and can be used to enhance metadata.
  • The boundary between what the public and private sectors can, should and should not do do needs to be better defined to allow the public sector to focus on its core task and businesses to invest with confidence.
  • It is important to build an open data infrastructure, both legal and technical, that supports the sharing of PSI as part of normal activity.
  • Licences and/or rights statements are essential and should be machine readable. This is made easier if the choice of licences is minimised.
  • The most valuable data is the data that the public sector already charges for.
  • Include domain experts who can articulate real problems in hackathons (whether they write code or not).
  • Involvement of the user community and timely response to requests is essential.
  • There are valid business models that should be judged by their effectiveness and/or social impact rather than financial gain.

Just so you know, that last point:

There are valid business models that should be judged by their effectiveness and/or social impact rather than financial gain.

that is not a business model, unless you have renewal financing from some source other than by financial gain. That is a charity model where you are the object of the charity.

February 5, 2015

Forty and Seven Inspector Generals Hit a Stone Wall

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 3:16 pm

Inspectors general testify against agency ‘stonewalling’ before Congress by Sarah Westwood.

From the post:

Frustration with federal agencies that block probes from their inspectors general bubbled over Tuesday in a congressional hearing that dug into allegations of obstruction from a number of government watchdogs.

The Peace Corps, Environmental Protection Agency and Justice Department inspectors general each argued to members of the House Oversight and Government Reform Committee that some of their investigations had been thwarted or stalled by officials who refused to release necessary information to their offices.

Committee members from both parties doubled down on criticisms of the Justice Department’s lack of transparency and called for solutions to the government-wide problem during their first official hearing of the 114th Congress.

“If you can’t do your job, then we can’t do our job in Congress,” Chairman Jason Chaffetz, R-Utah, told the three witnesses and the scores of agency watchdogs who also attended, including the Department of Homeland Security and General Service Administration inspectors general.

Michael Horowitz, the Justice Department’s inspector general, testified that the FBI began reviewing requested documents in 2010 in what he said was a clear violation of federal law that is supposed to grant watchdogs unfettered access to agency records.

The FBI’s process, which involves clearing the release of documents with the attorney general or deputy attorney general, “seriously impairs inspector general independence, creates excessive delays, and may lead to incomplete, inaccurate or significantly delayed findings or recommendations,” Horowitz said.

Perhaps no surprise that the FBI shows up in the non-transparency column. But given the number of inspector generals with similar problems (47), it seems to be part of a larger herd.

If you are interested in going further into this issue, there was a hearing last August 2014), Obstructing Oversight: Concerns from Inspectors General, which is here in ASCII and here with video and witness statements in PDF.

Both sources omit the following documents:

Sept. 9, 2014, letter to Chairman Issa from OMB submitted by Chairman Issa.. 58
Aug. 5, 2014, letter to Reps. Issa, Cummings, Carper, and Coburn from 47 IGs, submitted by Rep. Chaffetz.. 61
Aug. 8, 2014, letter to OMB from Reps. Carper, Coburn, Issa and Cummings, submitted by Rep. Walberg.. 69
Statement for the record from The Institute of Internal Auditors. 71

Isn’t that rather lame? To leave these items in the table of contents but to omit them from the ASCII version and to not even include them with the witness statements.

I’m curious who the other forty-four (44) inspector generals might be. Aren’t you?

If you know where to find these appendix materials, please send me a pointer.

I think it will be more effective to list all of the Inspector Generals who have encountered this stone wall treatment than treat them as all and sundry.

Chairman Jason Chaffetz suggests that by controlling funding that Congress can force transparency. I would use a finer knife. Cut all funding for health care and retirement benefits in the agencies/departments in question. See how the rank and file in the agencies like them apples.

Assuming transparency results, I would not restore those benefits retroactively. Staff chose to support, explicitly or implicitly, illegal behavior. Making bad choices has negative consequences. It would be a teaching opportunity for all future federal staff members.

February 4, 2015

[U.S.] President’s Fiscal Year 2016 Budget

Filed under: Government,Government Data,Politics — Patrick Durusau @ 7:45 pm

Data for the The President’s Fiscal Year 2016 Budget

From the webpage:

Each year, after the President’s State of the Union address, the Office of Management and Budget releases the Administration’s Budget, offering proposals on key priorities and newly announced initiatives. This year we are releasing all of the data included in the President’s Fiscal Year 2016 Budget in a machine-readable format here on GitHub. The Budget process should be a reflection of our values as a country, and we think it’s important that members of the public have as many tools at their disposal as possible to see what is in the President’s proposals. And, if they’re motivated to create their own visualizations or products from the data, they should have that chance as well.

You can see the full Budget on Medium.

About this Repository

This repository includes three data files that contain an extract of the Office of Management and Budget (OMB) budget database. These files can be used to reproduce many of the totals published in the Budget and examine unpublished details below the levels of aggregation published in the Budget.

The user guide file contains detailed information about this data, its format, and its limitations. In addition, OMB provides additional data tables, explanations and other supporting documents in XLS format on its website.

Feedback and Issues

Please submit any feedback or comments on this data, or the Budget process here.

Before you start cheering too loudly, spend a few minutes with the User Guide. Not impenetrable but not an easy stroll either. I suspect the additional data tables, etc. are going to be necessary for interpretation of the main files.

Writing up how to use this data set would be a large but worthwhile undertaking.

A larger in scope but also worthwhile project would be to track how the initial allocations in the budget change through the legislative process. That is to know on a day to day basis, which departments, programs, etc. are up or down. Tied to votes in Congress and particular amendments that could prove to be very interesting.


Update: A tweet from Aaron Kirschenfeld directed us to: The U.S. Tax Code Is a Travesty by John Cassidy. Cassidy says to take a look at table S-9 in the numbers section under “Loophole closers.” The trick to the listed loopholes is that very few people qualify for the loophole. See Cassidy’s post for the details.

Other places that merit special attention?


Update: DHS Budget Justification 2016 (3906 pages, PDF). First saw this in a tweet by Dave Maass.

January 22, 2015

Project Blue Book Collection (UFO’s)

Filed under: Government,Government Data — Patrick Durusau @ 11:18 am

Project Blue Book Collection

From the webpage:

This site was created by The Black Vault to house 129,491 pages, comprising of more than 10,000 cases of the Project Blue Book, Project Sign and Project Grudge files declassifed. Project Blue Book (along with Sign and Grudge) was the name that was given to the official investigation by the United States military to determine what the Unidentified Flying Object (UFO) phenomena was. It lasted from 1947 – 1969. Below you will find the case files compiled for research, and available free to download.

The CNN report Air Force UFO files land on Internet by Emanuella Grinberg reports Roswell is omitted from these files.

You won’t find anything new here, the files have been available on microfilm for years but being searchable and on the Internet is a step forward in terms of accessibility.

When I say “searchable,” the site notes:

1) A search is a good start — but is not 100% — There are more than 10,000 .pdf files here and although all of them are indexed in the search engine, the quality of the original documents, given the fact that many of them are more than 6 decades old, is very poor. This means that when they are converted to text for searching, many of the words are not readable to a computer. As a tip: make your search as basic as possible. Searching for a location? Just search a city, then the state, to see what comes up. Searching for a type of UFO? Use “saucer” vs. “flying saucer” or longer expression. It will increase the chances of finding what you are looking for.

2) The text may look garbled on the search results page (but not the .pdf!) — This is normal. For the same reason above… converting a sentence that may read ok to the human eye, may be gibberish to a computer due to the quality of the decades old state of many of the records. Don’t let that discourage you. Load the .PDF and see what you find. If you searched for “Hollywood” and a .pdf hit came up for Rome, New York, there is a reason why. The word “Hollywood” does appear in the file…so check it out!

3) Not everything was converted to .pdfs — There are a few case files in the Blue Book system that were simply too large to convert. They are:

undated/xxxx-xx-9667997-[BLANK][ 8,198 Pages ]
undated/xxxx-xx-9669100-[ILLEGIBLE]-[ILLEGIBLE]-/ [ 1,450 Pages ]
undated/xxxx-xx-9669191-[ILLEGIBLE]/ [ 3,710 Pages ]

These files will be sorted at a later date. If you are interested in helping, please email contact@theblackvault.com

I tried to access the files not yet processed but was redirected. I will see what is required to see the not yet processed files.

If you are interested in trying your skills at PDF conversion/improvement, the main data set should be more than sufficient.

If you are interested in automatic discovery of what or who was blacked out of government reports, this is also an interesting data set. Personally I think blacking out passages should be forbidden. People should have to accept the consequences of their actions, good or bad. We require that of citizens, why not government staff?

I assume crowd sourcing corrections has already been considered. 130K of pages is a fairly small number when it comes to crowd sourcing. Surely there are more than 10,000 people interested in the data set, which would be 13 pages each. Assuming each one did 100 pages each, you would have more than enough overlap to do statistics to choose the best corrections.

For those of you who see patterns in UFO reports, a good way to reach across the myriad sightings and reports would be to topic map the entire collection.

Personally I suspect at least some of the reports do concern alien surveillance and the absence in the intervening years indicates they have lost interest. Given our performance since the 1940’s, that’s not hard to understand.

January 16, 2015

Key Court Victory Closer for IRS Open-Records Activist

Filed under: Government Data,Open Data — Patrick Durusau @ 8:12 pm

Key Court Victory Closer for IRS Open-Records Activist by Suzanne Perry.

From the post:

The open-records activist Carl Malamud has moved a step closer to winning his legal battle to give the public greater access to the wealth of information on Form 990 tax returns that nonprofits file.

During a hearing in San Francisco on Wednesday, U.S. District Judge William Orrick said he tentatively planned to rule in favor of Mr. Malamud’s group, Public. Resource. Org, which filed a lawsuit to force the Internal Revenue Service to release nonprofit tax forms in a format that computers can read. That would make it easier to conduct online searches for data about organizations’ finances, governance, and programs.

“It looks like a win for Public. Resource and for the people who care about electronic access to public documents,” said Thomas Burke, the group’s lawyer.

The suit asks the IRS to release Forms 990 in machine-readable format for nine nonprofits that had submitted their forms electronically. Under current practice, the IRS converts all Forms 990 to unsearchable image files, even those that have been filed electronically.

That’s a step in the right direction but not all that will be required.

Suzanne goes on to note that the IRS removes donor lists from the 990 forms.

Any number of organizations will object but I think the donor lists should be public information as well.

Making all donors public may discourage some people from donating to unpopular causes but that’s a hit I would be willing to take to know who owns the political non-profits. And/or who funds the NRA for example.

Data that isn’t open enough to know who is calling the shots at organizations isn’t open data, its an open data tease.

What Counts: Harnessing Data for America’s Communities

Filed under: Data Management,Finance Services,Government,Government Data,Politics — Patrick Durusau @ 5:44 pm

What Counts: Harnessing Data for America’s Communities Senior Editors: Naomi Cytron, Kathryn L.S. Pettit, & G. Thomas Kingsley. (new book, free pdf)

From: A Roadmap: How To Use This Book

This book is a response to the explosive interest in and availability of data, especially for improving America’s communities. It is designed to be useful to practitioners, policymakers, funders, and the data intermediaries and other technical experts who help transform all types of data into useful information. Some of the essays—which draw on experts from community development, population health, education, finance, law, and information systems—address high-level systems-change work. Others are immensely practical, and come close to explaining “how to.” All discuss the incredibly exciting opportunities and challenges that our ever-increasing ability to access and analyze data provide.

As the book’s editors, we of course believe everyone interested in improving outcomes for low-income communities would benefit from reading every essay. But we’re also realists, and know the demands of the day-to-day work of advancing opportunity and promoting well-being for disadvantaged populations. With that in mind, we are providing this roadmap to enable readers with different needs to start with the essays most likely to be of interest to them.

For everyone, but especially those who are relatively new to understanding the promise of today’s data for communities, the opening essay is a useful summary and primer. Similarly, the final essay provides both a synthesis of the book’s primary themes and a focus on the systems challenges ahead.

Section 2, Transforming Data into Policy-Relevant Information (Data for Policy), offers a glimpse into the array of data tools and approaches that advocates, planners, investors, developers and others are currently using to inform and shape local and regional processes.

Section 3, Enhancing Data Access and Transparency (Access and Transparency), should catch the eye of those whose interests are in expanding the range of data that is commonly within reach and finding ways to link data across multiple policy and program domains, all while ensuring that privacy and security are respected.

Section 4, Strengthening the Validity and Use of Data (Strengthening Validity), will be particularly provocative for those concerned about building the capacity of practitioners and policymakers to employ appropriate data for understanding and shaping community change.

The essays in section 5, Adopting More Strategic Practices (Strategic Practices), examine the roles that practitioners, funders, and policymakers all have in improving the ways we capture the multi-faceted nature of community change, communicate about the outcomes and value of our work, and influence policy at the national level.

There are of course interconnections among the essays in each section. We hope that wherever you start reading, you’ll be inspired to dig deeper into the book’s enormous richness, and will join us in an ongoing conversation about how to employ the ideas in this volume to advance policy and practice.

Thirty-one (31) essays by dozens of authors on data and its role in public policy making.

From the acknowledgements:

This book is a joint project of the Federal Reserve Bank of San Francisco and the Urban Institute. The Robert Wood Johnson Foundation provided the Urban Institute with a grant to cover the costs of staff and research that were essential to this project. We also benefited from the field-building work on data from Robert Wood Johnson grantees, many of whom are authors in this volume.

If you are pitching data and/or data projects where the Federal Reserve Bank of San Francisco/Urban Institute set the tone of policy making conversations, a must read. It is likely to have an impact on other policy discussions, but adjusted for local concerns and conventions. You could also use it to shape your local policy discussions.

I first saw this in There is no seamless link between data and transparency by Jennifer Tankard.

January 15, 2015

Open Addresses

Filed under: Government,Government Data,Mapping — Patrick Durusau @ 10:23 am

Open Addresses

From the homepage:

At Open Addresses, we are bringing together information about the places where we live, work and go about our daily lives. By gathering information provided to us by people about their own addresses, and from open sources on the web, we are creating an open address list for the UK, available to everyone.

Do you want to enter our photography competition?

Or do you want to get involved by submitting an address?

It’s as simple as entering it below.

Addresses are a vital part of the UK’s National Information Infrastructure. Open Addresses will be used by a whole range of individuals and organisations (academics, charities, public sector and private sector). By having accurate information about addresses, we’ll all benefit from getting more of the things we want, and less of the things we don’t.

Datasets as of 10 December 2014 are available for download now. Via BitTorrent so I assume the complete datasets are fairly large. Anyone downloaded them?

If you do download all or part of the records, curious what other public data sets would you combine with them?

January 14, 2015

SODA Developers

Filed under: Government Data,Open Data,Programming — Patrick Durusau @ 7:50 pm

SODA Developers

From the webpage:

The Socrata Open Data API allows you to programatically access a wealth of open data resources from governments, non-profits, and NGOs around the world.

I have mentioned Socrata and their Open Data efforts more than once on this blog but I don’t think I have ever pointed to their developer site.

Very much worth spending time here if you are interested in governmental data.

Not that I take any data, government or otherwise, at face value. Data is created and released/leaked for reasons that may or may not coincide with your assumptions or goals. Access to data is just the first step in uncovering whose interests the data represents.

January 4, 2015

Project Open Data Dashboard

Filed under: Government,Government Data,Open Data — Patrick Durusau @ 1:54 pm

Project Open Data Dashboard

From the about page:

This website shows how Federal agencies are performing on the latest Open Data Policy (M-13-13) using the guidance provided by Project Open Data. It also provides many other other tools and resources to help agencies and other interested parties implement their open data programs. Features include:

  • A dashboard to track the progress of agencies implementing Project Open Data on a quarterly basis
  • Automated analysis of URLs provided within metadata to see if the links work as expected
  • A validator for v1.0 and v1.1 of the Project Open Data Metadata Schema
  • A converter to transform CSV files into JSON as defined by the Project Open Data Metadata Schema Link broken as of 4 January 2014. Site notified.
  • An export API to export from the CKAN API and transform the metadata into JSON as defined by the Project Open Data Metadata Schema
  • A changeset viewer to compare a data.json file to the metadata currently available in CKAN (eg catalog.data.gov)

You can learn more by reading the main documentation page.

The main documentation defines the “Number of Datasets” on the dashboard as:

This element accounts for the total number of all datasets listed in the Enterprise Data Inventory. This includes those marked as “Public”, “Non-Public” and “Restricted”.

If you compare the “Milestone – May 31st 2014” to November, the number of data sets increases in most cases, as you would expect. However, both the Department of Commerce and the Department of Health and Human Services, had decreases in the number of available data sets.

On May 31st, the Department of Commerce listed 20488 data sets but on November 30th, only 372. A decrease of more than 20,000 data sets.

On May 31st, the Department of Health and Human Services listed 1507 data sets but on November 30th, only 1064, a decrease of 443 data sets.

Looking further, the sudden decrease for both agencies occurred between Milestone 3 and Milestone 4 (August 31st 2014).

Sounds exciting! Yes?

Yes, but this illustrates why you should “drill down” in data whenever possible. And if not possible in interface, check other sources.

I followed the Department of Commerce link (the first column on the left) to the details of the crawl and thence the data link to determine the number of publicly available data sets.

As of today, 04 January 2014, the Department of Commerce has 23,181 datasets and not the 372 reported for Milestones 5 or the 268 reported for Milestone 4.

As of today, 04 January 2014, the Department of Health and Human Services has 1,672 datasets and not the 1064 reported for Milestones 5 or the 1088 reported for Milestone 4.

The reason(s) for the differences are unclear and the dashboard itself offers no explanation for the disparate figures. I suspect there is some glitch in the automatic harvesting of the information and/or in the representation of those results in the dashboard.

Always remember that just because a representation* claims some “fact,” that doesn’t necessarily make it so.

*Representation: Bear in mind that anything you see on a computer screen is a “representation.” There isn’t anything in data storage that has any resemblance to what you see on the screen. Choices have been made out of your sight as to how information will be represented to you.

As I mentioned yesterday, there is a common and naive assumption that data as represented to us has a reliable correspondence with data held in storage. And that the data held in storage has a reliable correspondence to data as entered or obtained from other sources.

Those assumptions aren’t unreasonable, at least until they are. Can you think of ways to illustrate those principles? I ask because at least one way to illustrate those principles makes an excellent case for open source software. More on that anon.

December 31, 2014

U.S. Appropriations by Fiscal Year

Filed under: Government,Government Data — Patrick Durusau @ 5:39 pm

U.S. Appropriations by Fiscal Year

Congressdotgov tweeted about this resource earlier today.

It’s a great starting place for research on U.S. appropriations but it is more of a bulk resource than a granular one.

You will have to wade through this resource and many others to piece together some of the details on any particular line item in the budget. Not surprisingly, anyone interested in the same line item will have to repeat that mechanical process. For every line in the budget.

There are collected resources on different aspects of the budget process, hearing documents, campaign donation records, etc. but they are for the most part all separated and not easily collated. Perhaps that is due to lack of foresight. Perhaps.

In any event, it is a starting place if you have a particular line item in mind. Think about creating a result that can be re-used and shared if at all possible.

December 19, 2014

Collection of CRS reports released to the public

Filed under: Government,Government Data — Patrick Durusau @ 4:07 pm

Collection of CRS reports released to the public by Kevin Kosar.

From the post:

Something rare has occurred—a collection of reports authored by the Congressional Research Service has been published and made freely available to the public. The 400-page volume, titled, “The Evolving Congress,” and was produced in conjunction with CRS’s celebration of its 100th anniversary this year. Congress, not CRS, published it. (Disclaimer: Before departing CRS in October, I helped edit a portion of the volume.)

The Congressional Research Service does not release its reports publicly. CRS posts its reports at CRS.gov, a website accessible only to Congress and its staff. The agency has a variety of reasons for this policy, not least that its statute does not assign it this duty. Congress, with ease, could change this policy. Indeed, it already makes publicly available the bill digests (or “summaries”) CRS produces at Congress.gov.

The Evolving Congress” is a remarkable collection of essays that cover a broad range of topic. Readers would be advised to start from the beginning. Walter Oleszek provides a lengthy essay on how Congress has changed over the past century. Michael Koempel then assesses how the job of Congressman has evolved (or devolved depending on one’s perspective). “Over time, both Chambers developed strategies to reduce the quantity of time given over to legislative work in order to accommodate Members’ other duties,” Koempel observes.

The NIH (National Institutes of Health) requires that NIH funded research be made available to the public. Other government agencies are following suite. Isn’t it time for the Congressional Research Service to make its publicly funded research available to the public that paid for it?

Congress needs to require it. Contact your member of Congress today. Ask for all Congressional Research Service reports, past, present and future be made available to the public.

You have already paid for the reports, why shouldn’t you be able to read them?

Senate Joins House In Publishing Legislative Information In Modern Formats [No More Sneaking?]

Filed under: Government,Government Data,Law,Law - Sources — Patrick Durusau @ 3:29 pm

Senate Joins House In Publishing Legislative Information In Modern Formats by Daniel Schuman.

From the post:

There’s big news from today’s Legislative Branch Bulk Data Task Force meeting. The United States Senate announced it would begin publishing text and summary information for Senate legislation, going back to the 113th Congress, in bulk XML. It would join the House of Representatives, which already does this. Both chambers also expect to have bill status information available online in XML format as well, but a little later on in the year.

This move goes a long way to meet the request made by a coalition of transparency organizations, which asked for legislative information be made available online, in bulk, in machine-processable formats. These changes, once implemented, will hopefully put an end to screen scraping and empower users to build impressive tools with authoritative legislative data. A meeting to spec out publication methods will be hosted by the Task Force in late January/early February.

The Senate should be commended for making the leap into the 21st century with respect to providing the American people with crucial legislative information. We will watch closely to see how this is implemented and hope to work with the Senate as it moves forward.

In addition, the Clerk of the House announced significant new information will soon be published online in machine-processable formats. This includes data on nominees, election statistics, and members (such as committee assignments, bioguide IDs, start date, preferred name, etc.) Separately, House Live has been upgraded so that all video is now in H.264 format. The Clerk’s website is also undergoing a redesign.

The Office of Law Revision Counsel, which publishes the US Code, has further upgraded its website to allow pinpoint citations for the US Code. Users can drill down to the subclause level simply by typing the information into their search engine. This is incredibly handy.

This is great news!

Law is a notoriously opaque domain and the process of creating it even more so. Getting the data is a great first step, parsing out steps in the process and their meaning is another. To say nothing of the content of the laws themselves.

Still, progress is progress and always welcome!

Perhaps citizen review will stop the Senate from sneaking changes past sleepy members of the House.

« Newer PostsOlder Posts »

Powered by WordPress