Archive for the ‘Public Data’ Category

Gotta Minute To Help @WikiCommons?

Sunday, May 21st, 2017

Wikimedia NYC tweeted and Michael Peter Edison retweeted:

I know. Moving images from one silo to another.

But, it does increase the odds of @WikiCommons users finding the additional images. That’s a good thing.

Take a minute to visit,, select the public domain facet and grab an image to upload to WikiMedia Commons.

The process is quite painless, I uploaded The Pit of Acheron, or the Birth of of the Plagues of England today.

With practice it should take less than a minute but I got diverted looking for more background on the image.

Rowlandson the Caricaturist: A Selection from His Works, with Anecdotal Descriptions of His Famous Caricatures and a Sketch of His Life, Times, and Contemporaries, Volume 1 by Joseph Greco, J. W. Bouton, New York, 1880, page 112:

January 1. 1784. The Pit of Acheron, or the Birth of of the Plagues of England. —

The Pit of Acheron, if we may trust the satirist, is not situated at any considerable distance from Westminister; the precincts of that city appear through the smoke of the incantations which are carried on in the Pit. Three weird sisters, like the Witches in ‘Macbeth,’ are working the famous charm; a monstrous cauldron is supported by death’s-heads and harpies; the ingredients of the broth are various; a crucifix, a rosary, Deceit, Loans, Lotteries, and Pride, together with a fox’s head, cards, dice, daggers, and an executioner’s axe, &c., form portions of the accessories employed in these uncanny rites. Three heads are rising from the flames—the good-natured face of Lord North, the spectacled and incisive outline of Burke, and Fox’s ‘gunpowder jowl,’ which is drifting Westminster-wards. One hag, who is dropping Rebellion into the brew, is demanding, ‘Well, sister, what hast thou got for the ingredients of our charm’d pot?’ To this her fellow-witch, who is turning out certain mischievous ingredients which she has collected in her bag, is responding, ‘A best from Scotland called an Erskine, famous for duplicity, low art, and cunning; the other a monster who’d spurn even at Charter’s Rights.’ Erskine is shot out of the bag, crying, ‘I am like a Proteus, can turn any shape, from a sailor to a lawyer, and always lean to the strongest side!’ The other member, whose tail is that of a serpent, is singing, ‘Over the water and over the lee, thro’ hell I would follow my Charlie.’

I remain uncertain about the facts and circumstances surrounding the Westminster election of 1784 that would further explain this satire. Perhaps another day.

If you can’t wait, consider reading History of the Westminster Election, containing Every Material Occurrence, from its commencement On the First of April to the Close of the Poll, on the 17th of May, to which is prefixed A Summary Account of the Proceedings of the Late Parliament by James Hartley. (562 pages)

Rowlandson was also noted for his erotica: collection of erotica by Rowlandson.

New York Public Library – 180K Hi-Res Images/Metadata

Thursday, January 7th, 2016

NYPL Releases Hi-Res Images, Metadata for 180,000 Public Domain Items in its Digital Collections

from the post:

JANUARY 6, 2016 — The New York Public Library has expanded access to more than 180,000 items with no known U.S. copyright restrictions in its Digital Collections database, releasing hi-res images, metadata, and tools facilitating digital creation and reuse. The release represents both a simplification and an enhancement of digital access to a trove of unique and rare materials: a removal of administration fees and processes from public domain content, and also improvements to interfaces — popular and technical — to the digital assets themselves. Online users of the NYPL Digital Collections website will find more prominent download links and filters highlighting restriction-free content; while more technically inclined users will also benefit from updates to the Library’s collections API enabling bulk use and analysis, as well as data exports and utilities posted to NYPL’s GitHub account. These changes are intended to facilitate sharing, research and reuse by scholars, artists, educators, technologists, publishers, and Internet users of all kinds. All subsequently digitized public domain collections will be made available in the same way, joining a growing repository of open materials.

“The New York Public Library is committed to giving our users access to information and resources however possible,” said Tony Marx, president of the Library. “Today, we are going beyond providing our users with digital facsimiles that give only an impression of something we have in our physical collection. By making our highest-quality assets freely available, we are truly giving our users the greatest access possible to our collections in the digital environment.”

To encourage novel uses of its digital resources, NYPL is also now accepting applications for a new Remix Residency program. Administered by the Library’s digitization and innovation team, NYPL Labs, the residency is intended for artists, information designers, software developers, data scientists, journalists, digital researchers, and others to make transformative and creative uses of digital collections and data,and the public domain assets in particular. Two projects will be selected, receiving financial and consultative support from Library curators and technologists.

To provide further inspiration for reuse, the NYPL Labs team has also released several demonstration projects delving into specific collections, as well as a visual browsing tool allowing users to explore the public domain collections at scale. These projects — which include a then-and-now comparison of New York’s Fifth Avenue, juxtaposing 1911 wide angle photographs with Google Street View, and a “trip planner” using locations extracted from mid-20th century motor guides that listed hotels, restaurants, bars, and other destinations where black travelers would be welcome — suggest just a few of the myriad investigations made possible by fully opening these collections.

The public domain release spans the breadth and depth of NYPL’s holdings, from the Library’s rich New York City collection, historic maps, botanical illustrations, unique manuscripts, photographs, ancient religious texts, and more. Materials include:

Visit for information about the materials related to the public domain update and links to all of the projects demonstrating creative reuse of public domain materials.

The New York Public Library’s Rights and Information Policy team has carefully reviewed Items and collections to determine their copyright status under U.S. law. As a U.S.-based library, NYPL limits its determinations to U.S. law and does not analyze the copyright status of an item in every country. However, when speaking more generally, the Library uses terms such as “public domain” and “unrestricted materials,” which are used to describe the aggregate collection of items it can offer to the public without any restrictions on subsequent use.

If you are looking for content for a topic map or inspiration to pass onto other institutions about opening up their collections, take a look at the New York Public Library’s Digital Collections.

Content designed for re-use. Imagine that, re-use of content.

The exact time/place of the appearance of seamless re-use of content will be debated by future historians but for now, this is a very welcome step in that direction.

Bloggers! Help Defend The Public Domain – Prepare To Host/Repost “Baby Blue”

Wednesday, December 30th, 2015

Harvard Law Review Freaks Out, Sends Christmas Eve Threat Over Public Domain Citation Guide by Mike Masnick.

From the post:

In the fall of 2014, we wrote about a plan by public documents guru Carl Malamud and law professor Chris Sprigman, to create a public domain book for legal citations (stay with me, this isn’t as boring as it sounds!). For decades, the “standard” for legal citations has been “the Bluebook” put out by Harvard Law Review, and technically owned by four top law schools. Harvard Law Review insists that this standard of how people can cite stuff in legal documents is covered by copyright. This seems nuts for a variety of reasons. A citation standard is just an method for how to cite stuff. That shouldn’t be copyrightable. But the issue has created ridiculous flare-ups over the years, with the fight between the Bluebook and the open source citation tool Zotero representing just one ridiculous example.

In looking over all of this, Sprigman and Malamud realized that the folks behind the Bluebook had failed to renew the copyright properly on the 10th edition of the book, which was published in 1958, meaning that that version of the book was in the public domain. The current version is the 19th edition, but there is plenty of overlap from that earlier version. Given that, Malamud and Sprigman announced plans to make an alternative to the Bluebook called Baby Blue, which would make use of the public domain material from 1958 (and, I’d assume, some of their own updates — including, perhaps, citations that it appears the Bluebook copied from others).

As soon as “Baby Blue” drops, one expects the Harvard Law Review with its hired thugs Ropes & Gray to swing into action against Carl Malamud and Jon Sprigman.

What if the world of bloggers even those odds just a bit?

What if as soon as Baby Blue hits the streets, law bloggers, law librarian bloggers, free speech bloggers, open access bloggers, and any other bloggers all post Baby Blue to their sites and post it to file repositories?

I’m game.

Are you?

PS: If you think this sounds risky, ask yourself how much racial change would have happened in the South in the 1960’s if Martin Luther King had marched alone?

Disclosing Government Contracts

Friday, August 21st, 2015

The More the Merrier? How much information on government contracts should be published and who will use it by Gavin Hayman.

From the post:

A huge bunch of flowers to Rick Messick for his excellent post asking two key questions about open contracting. And some luxury cars, expensive seafood and a vat or two of cognac.

Our lavish offerings all come from Slovakia, where in 2013 the Government Public Procurement Office launched a new portal publishing all its government contracts. All these items were part of the excessive government contracting uncovered by journalists, civil society and activists. In the case of the flowers, teachers investigating spending at the Department of Education uncovered florists’ bills for thousands of euros. Spending on all of these has subsequently declined: a small victory for fiscal probity.

The flowers, cars, and cognac help to answer the first of two important questions that Rick posed: Will anyone look at contracting information? In the case of Slovakia, it is clear that lowering the barriers to access information did stimulate some form of response and oversight.

The second question was equally important: “How much contracting information should be disclosed?”, especially in commercially sensitive circumstances.

These are two of key questions that we have been grappling with in our strategy at the Open Contracting Partnership. We thought that we would share our latest thinking below, in a post that is a bit longer than usual. So grab a cup of tea and have a read. We’ll be definitely looking forward to your continued thoughts on these issues.

Not a short read so do grab some coffee (outside of Europe) and settle in for a good read.

Disclosure: I’m financially interested in government disclosure in general and contracts in particular. With openness there comes more effort to conceal semantics and increase the need for topic maps to pierce the darkness.

I don’t think openness reduces the amount of fraud and misconduct in government, it only gives an alignment between citizens and the career interests of a prosecutor a sporting chance to catch someone out.

Disclosure should be as open as possible and what isn’t disclosed voluntarily, well, one hopes for brave souls who will leak the remainder.

Support disclosure of government contracts and leakers of the same.

If you need help “connecting the dots,” consider topic maps.

Early English Books Online – Good News and Bad News

Friday, January 2nd, 2015

Early English Books Online

The very good news is that 25,000 volumes from the Early English Books Online collection have been made available to the public!

From the webpage:

The EEBO corpus consists of the works represented in the English Short Title Catalogue I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement. Together these trace the history of English thought from the first book printed in English in 1475 through to 1700. The content covers literature, philosophy, politics, religion, geography, science and all other areas of human endeavor. The assembled collection of more than 125,000 volumes is a mainstay for understanding the development of Western culture in general and the Anglo-American world in particular. The STC collections have perhaps been most widely used by scholars of English, linguistics, and history, but these resources also include core texts in religious studies, art, women’s studies, history of science, law, and music.

Even better news from Sebastian Rahtz Sebastian Rahtz (Chief Data Architect, IT Services, University of Oxford):

The University of Oxford is now making this collection, together with Gale Cengage’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints, available in various formats (TEI P5 XML, HTML and ePub) initially via the University of Oxford Text Archive at, and offering the source XML for community collaborative editing via Github. For the convenience of UK universities who subscribe to JISC Historic Books, a link to page images is also provided. We hope that the XML will serve as the base for enhancements and corrections.

This catalogue also lists EEBO Phase 2 texts, but the HTML and ePub versions of these can only be accessed by members of the University of Oxford.

[Technical note]
Those interested in working on the TEI P5 XML versions of the texts can check them out of Github, via, where each of the texts is in its own repository (eg There is a CSV file listing all the texts at, and a simple Linux/OSX shell script to clone all 32853 unrestricted repositories at

Now for the BAD NEWS:

An additional 45,000 books:

Currently, EEBO-TCP Phase II texts are available to authorized users at partner libraries. Once the project is done, the corpus will be available for sale exclusively through ProQuest for five years. Then, the texts will be released freely to the public.

Can you guess why the public is barred from what are obviously public domain texts?

Because our funding is limited, we aim to key as many different works as possible, in the language in which our staff has the most expertise.

Academic projects are supposed to fund themselves and be self-sustaining. When anyone asks about sustainability of an academic project, ask them when the last time your countries military was “self sustaining?” The U.S. has spent $2.6 trillion on a “war on terrorism” and has nothing to show for it other than dead and injured military personnel, perversion of budgetary policies, and loss of privacy on a world wide scale.

It is hard to imagine what sort of life-time access for everyone on Earth could be secured for less than $1 trillion. No more special pricing and contracts if you are in countries A to Zed. Eliminate all that paperwork for publishers and to access all you need is a connection to the Internet. The publishers would have a guaranteed income stream, less overhead from sales personnel, administrative staff, etc. And people would have access (whether used or not) to educate themselves, to make new discoveries, etc.

My proposal does not involve payments to large military contractors or subversion of legitimate governments or imposition of American values on other cultures. Leaving those drawbacks to one side, what do you think about it otherwise?

Exemplar Public Health Datasets

Friday, November 14th, 2014

Exemplar Public Health Datasets Editor: Jonathan Tedds.

From the post:

This special collection contains papers describing exemplar public heath datasets published as part of the Enhancing Discoverability of Public Health and Epidemiology Research Data project commissioned by the Wellcome Trust and the Public Health Research Data Forum.

The publication of the datasets included in this collection is intended to promote faster progress in improving health, better value for money and higher quality science, in accordance with the joint statement made by the forum members in January 2011.

Submission to this collection is by invitation only, and papers have been peer reviewed. The article template and instructions for submission are available here.

Data for analysis as well as examples of best practices for pubic health datasets.


I first saw this in a tweet by Christophe Lallanne.

analyze survey data for free

Friday, October 24th, 2014

Anthony Damico has “unlocked” a number of public survey data sets with blog posts that detail how to analyze those sets with R.

Forty-six (46) data set are covered so far:

unlocked public-use data sets

An impressive donation of value to R and public data and an example that merits emulation! Pass this along.

I first saw this in a tweet by Sharon Machlis.

Access vs. Understanding

Monday, July 7th, 2014

In Do doctors understand test results? William Kremer covers Risk savvy : how to make good decisions, a recent book on understanding risk statistics by Gerd Gigerenzer.

You will have little doubt that doctors don’t know the correct risk statistics for very common medical issues (breast cancer screening) and even when supplied with the correct information, they are incapable of interpreting it correctly when you finish Kermer’s article.

And the public?

Unsurprisingly, patients’ misconceptions about health risks are even further off the mark than doctors’. Gigerenzer and his colleagues asked over 10,000 men and women across Europe about the benefits of PSA screening and breast cancer screening respectively. Most overestimated the benefits, with respondents in the UK doing particularly badly – 99% of British men and 96% of British women overestimated the benefit of the tests. (Russians did the best, though Gigerenzer speculates that this is not because they get more good information, but because they get less misleading information.)

What does that suggest to you about the presentation/interpretation of data encoded with a topic map or not?

To me it says that beyond testing an interface for usability and meeting the needs of users, we need to start testing users’ understanding of the data presented by interfaces. Delivery of great information that leaves a user mis-informed (unless that is intentional) doesn’t seem all that helpful.

I am looking forward to reading Risk savvy : how to make good decisions. I don’t know that I will make “better” decisions but I will know when I am ignoring the facts. 😉

I first saw this in a tweet by Alastair Kerr.

Is It in the Public Domain?

Wednesday, June 11th, 2014

The Samuelson Clinic releases “Is it in the Public Domain?” handbook

From the post:

The Samuelson Clinic is excited to release a handbook, “Is it in the Public Domain?,” and accompanying visuals. These educational tools help users to evaluate the copyright status of a work created in the United States between January 1, 1923 and December 31, 1977—those works that were created before today’s 1976 Copyright Act. Many important works—from archival materials to family photos and movies—were created during this time, and it can be difficult to tell whether they are still under copyright.

The handbook walks readers though a series of questions—illustrated by accompanying charts—to help readers explore whether a copyrighted work from that time is in the public domain, and therefore free to be used without permission from a copyright owner. Knowing whether a work is in the public domain or protected by copyright is an important first step in any decision regarding whether or how to make use of a work.

The handbook was originally developed for the Student Nonviolent Coordinating Committee Legacy Project (“SLP”), a nonprofit organization run by civil rights movement veterans that is creating a digital archive of historical materials.


This is the resource to reference when questions of “public domain” come up in project discussions.

If you need more advice than you find here, get legal counsel. Intellectual property law isn’t a good area for learning experiences. That is to say the experiences can be quite unpleasant and expensive.

I first saw this in a tweet by Michael Peter Edson.

PLOS’ Bold Data Policy

Tuesday, March 4th, 2014

PLOS’ Bold Data Policy by David Crotty.

From the post:

If you pay any attention at all to scholarly publishing, you’re likely aware of the current uproar over PLOS’ recent announcement requiring all article authors to make their data publicly available. This is a bold move, and a forward-looking policy from PLOS. It may, for many reasons, have come too early to be effective, but ultimately, that may not be the point.

Perhaps the biggest practical problem with PLOS’ policy is that it puts an additional time and effort burden on already time-short, over-burdened researchers. I think I say this in nearly every post I write for the Scholarly Kitchen, but will repeat it again here: Time is a researcher’s most precious commodity. Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.

When depositing NIH-funded papers in PubMed Central was voluntary, only 3.8% of eligible papers were deposited, not because people didn’t want to improve access to their results, but because it wasn’t required and took time and effort away from experiments. Even now, with PubMed Central deposit mandatory, only 20% of what’s deposited comes from authors. The majority of papers come from journals depositing on behalf of authors (something else for which no one seems to give publishers any credit, Kent, one more for your list). Without publishers automating the process on the author’s behalf, compliance would likely be vastly lower. Lightening the burden of the researcher in this manner has become a competitive advantage for the journals that offer this service.

While recognizing the goal of researchers to do more experiments, isn’t this reminiscent of the lack of documentation for networks and software?

That creators of networks and software want to get on with the work they enjoy, documentation not being part of that work.

The problem with the semantics of research data, much as it is with network and software semantics, it there is no one else to ask about its semantics. If researchers don’t document those semantics as they perform experiments, then they will have to spend the time at publication to gather that information together.

I sense an opportunity here for software to assist researchers in capturing semantics as they perform experiments, so that production of semantically annotated data at the end of an experiment can be largely a clerical task, subject to review by the actual researchers.

The minimal semantics that needs to be captured for different type of research will vary. That is all the more reason to research and document those semantics before anyone writes a complex monolith of semantics into which existing semantics must be shoe horned.

Reasoning if we don’t know the semantics of data, it is more cost effective to pipe it to /dev/null.

I first saw this in a tweet by ChemConnector.

Data Access for the Open Access Literature: PLOS’s Data Policy

Wednesday, February 26th, 2014

Data Access for the Open Access Literature: PLOS’s Data Policy by Theo Bloom.

From the post:

Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances. In line with Open Access to research articles themselves, PLOS strongly believes that to best foster scientific progress, the underlying data should be made freely available for researchers to use, wherever this is legal and ethical. Data availability allows replication, reanalysis, new analysis, interpretation, or inclusion into meta-analyses, and facilitates reproducibility of research, all providing a better ‘bang for the buck’ out of scientific research, much of which is funded from public or nonprofit sources. Ultimately, all of these considerations aside, our viewpoint is quite simple: ensuring access to the underlying data should be an intrinsic part of the scientific publishing process.

PLOS journals have requested data be available since their inception, but we believe that providing more specific instructions for authors regarding appropriate data deposition options, and providing more information in the published article as to how to access data, is important for readers and users of the research we publish. As a result, PLOS is now releasing a revised Data Policy that will come into effect on March 1, 2014, in which authors will be required to include a data availability statement in all research articles published by PLOS journals; the policy can be found below. This policy was developed after extensive consultation with PLOS in-house professional and external Academic Editors and Editors in Chief, who are practicing scientists from a variety of disciplines.

We now welcome input from the larger community of authors, researchers, patients, and others, and invite you to comment before March. We encourage you to contact us collectively at; feedback via Twitter and other sources will also be monitored. You may also contact individual PLOS journals directly.

That is a large step towards verifiable research and was taken by PLOS in December of 2013.

That has been supplemented with details that do not change the December announcement in: PLOS’ New Data Policy: Public Access to Data by Liz Silva, which reads in part:

A flurry of interest has arisen around the revised PLOS data policy that we announced in December and which will come into effect for research papers submitted next month. We are gratified to see a huge swell of support for the ideas behind the policy, but we note some concerns about how it will be implemented and how it will affect those preparing articles for publication in PLOS journals. We’d therefore like to clarify a few points that have arisen and once again encourage those with concerns to check the details of the policy or our FAQs, and to contact us with concerns if we have not covered them.

I think the bottom line is: Don’t Panic, Ask.

There are always going to be unanticipated details or concerns but as time goes by and customs develop for how to solve those issues, the questions will become fewer and fewer.

Over time and not that much time, our history of arrangements other than open access are going to puzzle present and future generations of researchers.

Free Access to EU Satellite Data

Thursday, November 14th, 2013

Free Access to EU Satellite Data (Press Release, Brussels, 13 November 2013).

From the release:

The European Commission will provide free, full and open access to a wealth of important environmental data gathered by Copernicus, Europe’s Earth observation system. The new open data dissemination regime, which will come into effect next month, will support the vital task of monitoring the environment and will also help Europe’s enterprises, creating new jobs and business opportunities. Sectors positively stimulated by Copernicus are likely to be services for environmental data production and dissemination, as well as space manufacturing. Indirectly, a variety of other economic segments will see the advantages of accurate earth observation, such as transport, oil and gas, insurance and agriculture. Studies show that Copernicus – which includes six dedicated satellite missions, the so-called Sentinels, to be launched between 2014 and 2021 – could generate a financial benefit of some € 30 billion and create around 50.000 jobs by 2030. Moreover, the new open data dissemination regime will help citizens, businesses, researchers and policy makers to integrate an environmental dimension into all their activities and decision making procedures.

To make maximum use of this wealth of information, researchers, citizens and businesses will be able to access Copernicus data and information through dedicated Internet-based portals. This free access will support the development of useful applications for a number of different industry segments (e.g. agriculture, insurance, transport, and energy). Other examples include precision agriculture or the use of data for risk modelling in the insurance industry. It will fulfil a crucial role, meeting societal, political and economic needs for the sustainable delivery of accurate environmental data.

More information on the Copernicus web site at:

The “€ 30 billion” financial benefit seems a bit soft after looking at the study reports on the economic value of Copernicus.

For example, if Copernicus is used to monitor illegal dumping (D. Drimaco, Waste monitoring service to improve waste management practices and detect illegal landfills), how is a financial benefit calculated for illegal dumping prevented?

If you are the Office of Management and Budget (U.S.), you could simply make up the numbers and report them in near indecipherable documents. (Free Sequester Data Here!)

I don’t doubt there will be economic benefits from Copernicus but questions remain: how much and for who?

I first saw this in a tweet by Stefano Bertolo.


Friday, May 10th, 2013


I suppose it had to happen. With all the noise about public data sets that someone would create a startup to search them. 😉

Not a lot of detail at the site but you can sign up for a free trial.


100,000+ Public Data Sources: Access everything from import bills of lading, to aircraft ownership, lobbying activity,real estate assessments, spectrum licenses, financial filings, liens, government spending contracts and much, much more.

Augment Your Data: Get a more complete picture of investments, customers, partners, and suppliers. Discover unseen correlations between events, geographies and transactions.

API Access: Get direct access to the data sets, relational engine and NLP technologies that power Enigma.

Request Custom Data: Can’t find a data set anywhere else? Need to synthesize data from disparate sources? We are here to help.

Discover While You Work: Never miss a critical piece of information. Enigma uncovers entities in context, adding intelligence and insight to your daily workflow.

Powerful Context Filters: Our vast collection of public data sits atop a proprietary data ontology. Filter results by topics, tags and source to quickly refine and scope your query.

Focus on the Data: Immerse yourself in the details. Data is presented in its raw form, full screen and without distraction.

Curated Metadata: Source data is often unorganized and poorly documented. Our domain experts focus on sanitizing, organizing and annotating the data.

Easy Filtering: Rapidly prototype hypotheses by refining and shaping data sets in context. Filter tools allow the sorting, refining, and mathematical manipulation of data sets.

The “proprietary data ontology” jumps out at me as an obvious question. Do users get to know what the ontology is?

Not to mention the “our domain experts focus on sanitizing,….” Works for some cases, take legal research for example. Not sure that “your” experts works as well as “my” experts for less focused areas.

Looking forward to learning more about Enigma!

Scenes from a Dive

Wednesday, March 20th, 2013

Scenes from a Dive – what’s big data got to do with fighting poverty and fraud? by Prasanna Lal Das.

From the post:

A more detailed recap will follow soon but here’s a very quick hats off to the about 150 data scientists, civic hackers, visual analytics savants, poverty specialists, and fraud/anti-corruption experts that made the Big Data Exploration at Washington DC over the weekend such an eye-opener.We invite you to explore the work that the volunteers did (these are rough documents and will likely change as you read them so it’s okay to hold off if you would rather wait for a ‘final’ consolidated  document). The projects that the volunteers worked on include: 

Here are some visualizations that some project teams built. A few photos from the event are here (thanks @neilfantom). More coming soon (and yes, videos too!). Thanks @francisgagnon for the first blog about the event. The event hashtag was #data4good (follow @datakind and @WBopenfinances for more updates on Twitter).

Great meeting and projects but I would suggest a different sort of “big data”

Requiring recipients to grant reporting access to all bank accounts where funds will be transferred and requiring the same for any entity paid out of those accounts to the point where transfers over 90 days are less than $1,000 for any entity (or related entity), would be a better start.

With the exception of the “related entity” information, banks already keep transfer of funds information as a matter of routine business. It would be “big data” that is rich in potential for spotting fraud and waste.

The reporting banks should also be required to deliver other banking records they have on the accounts where funds are transferred and other activity in those accounts.

Before crying “invasion of privacy,” remember World Bank funding is voluntary.

As is acceptance of payment from World Bank funded projects. Anyone and everyone is free to decline such funding and avoid the proposed reporting requirements.

“Big data” to track fraud and waste is already collected by the banking industry.

The question is whether we will use that “big data” to effectively track fraud and waste or wait for particularly egregious cases to come to light?

February NYC DataKind Meetup (video)

Friday, March 15th, 2013

February NYC DataKind Meetup (video)

From the post:

A video of our February NYC DataKind Meetup is online for those of you who couldn’t join us in New York. Hear about the projects our amazing Data Ambassadors are working on with Medic Mobile, Sunlight Foundation, and Refugees United as well as listen to Anoush Tatevossian from the UN Global Pulse talk about how the UN is using data for the greater good. It was a fantastic event and we’re thrilled to get to share it with all of you.

A great pre-meeting format, beer first and during the presentations.

Need to recommend that format to Balisage.

None for the speaker, they could be the “designated driver” before and during their presentation.

New Army Guide to Open-Source Intelligence

Sunday, September 16th, 2012

New Army Guide to Open-Source Intelligence

If you don’t know Full Text Reports, you should.

A top-tier research professional’s hand-picked selection of documents from academe, corporations, government agencies, interest groups, NGOs, professional societies, research institutes, think tanks, trade associations, and more.

You will winnow some chaff but also find jewels like Open Source Intelligence (PDF).

From the post:

  • Provides fundamental principles and terminology for Army units that conduct OSINT exploitation.
  • Discusses tactics, techniques, and procedures (TTP) for Army units that conduct OSINT exploitation.
  • Provides a catalyst for renewing and emphasizing Army awareness of the value of publicly available information and open sources.
  • Establishes a common understanding of OSINT.
  • Develops systematic approaches to plan, prepare, collect, and produce intelligence from publicly available information from open sources.

Impressive intelligence overview materials.

Would be nice to re-work into a topic map intelligence approach document with the ability to insert a client’s name and industry specific examples. Has that militaristic tone that is hard to capture with civilian writers.

Importing public data with SAS instructions into R

Wednesday, July 11th, 2012

Importing public data with SAS instructions into R by David Smith.

From the post:

Many public agencies release data in a fixed-format ASCII (FWF) format. But with the data all packed together without separators, you need a “data dictionary” defining the column widths (and metadata about the variables) to make sense of them. Unfortunately, many agencies make such information available only as a SAS script, with the column information embedded in a PROC IMPORT statement.

David reports on the SAScii package from Anthony Damico.

You still have to parse the files but it gets you one step closer to having useful information.

Data-gov Wiki

Monday, June 27th, 2011

Data-gov Wiki

From the wiki:

The Data-gov Wiki is a project being pursued in the Tetherless World Constellation at Rensselaer Polytechnic Institute. We are investigating open government datasets using semantic web technologies. Currently, we are translating such datasets into RDF, getting them linked to the linked data cloud, and developing interesting applications and demos on linked government data. Most of the datasets shown on this page come from the US government’s Web site, although some are from other countries or non-government sources.

Try out their Drupal site with new demos:

Linking Open Government Data

My misgivings about the “openness” that releasing government data brings to one side, the Drupal site is a job well done and merits your attention.

Open Government Data 2011 wrap-up

Sunday, June 19th, 2011

Open Government Data 2011 wrap-up by Lutz Maicher.

From the post:

On June 16, 2011 the OGD 2011 – the first Open Data Conference in Austria – took place. Thanks to a lot of preliminary work of the Semantic Web Company the topic open (government) data is very hot in Austria, especially in Vienna and Linz. Hence 120 attendees (see the list here) for the first conference is a real success. Congrats to the organizers. And congrats to the community which made the conference to a very vital and interesting event.

If there is a Second Open Data Conference, it is a venue where topic maps should put in an appearance.

PublicData.EU Launched During DAA

Sunday, June 19th, 2011

PublicData.EU Launched During DAA

From the post:

During the Digital Agenda Assembly this week in Brussels the new portal PublicData.EU was launched in beta. This is a step aimed to make public data easier to find across the EU. As it says on the ‘about’ page:

“In order to unlock the potential of digital public sector information, developers and other prospective users must be able to find datasets they are interested in reusing. will provide a single point of access to open, freely reusable datasets from numerous national, regional and local public bodies throughout Europe.

Information about European public datasets is currently scattered across many different data catalogues, portals and websites in many different languages, implemented using many different technologies. The kinds of information stored about public datasets may vary from country to country, and from registry to registry. will harvest and federate this information to enable users to search, query, process, cache and perform other automated tasks on the data from a single place. This helps to solve the “discoverability problem” of finding interesting data across many different government websites, at many different levels of government, and across the many governments in Europe.

In addition to providing access to official information about datasets from public bodies, will capture (proposed) edits, annotations, comments and uploads from the broader community of public data users. In this way, will harness the social aspect of working with data to create opportunities for mass collaboration. For example, a web developer might download a dataset, convert it into a new format, upload it and add a link to the new version of the dataset for others to use. From fixing broken URLs or typos in descriptions to substantive comments or supplementary documentation about using the datasets, will provide up to date information for data users, by data users.”

PublicData.EU is built by the Open Knowledge Foundation as part of the LOD2 project. “ is powered by CKAN, a data catalogue system used by various institutions and communities to manage open data. CKAN and all its components are open source software and used by a wide community of catalogue operators from across Europe, including the UK Government’s portal.”

Here’s a European marketing opportunity for topic maps. How would a topic map solution be different from what is offered here? (There are similar opportunities in the US as well.)