Archive for the ‘Census Data’ Category

Australian Census Data and Same Sex Marriage

Tuesday, December 5th, 2017

Combining Australian Census data with the Same Sex Marriage Postal Survey in R by Miles McBain.

Last week I put out a post that showed you how to tidy the Same Sex Marriage Postal Survey Data in R. In this post we’ll visualise that data in combination with the 2016 Australian Census. Note to people just here for the R — the main challenge here is actually just navigating the ABS’s Census DataPack, but I’ve tried to include a few pearls of wisdom on joining datasets to keep things interesting for you.

Decoding the “datapack” is an early task:

The datapack consists of 59 encoded csv files and 3 metadata excel files that will help us decode their meaning. What? You didn’t think this was going to be straight forward did you?

When I say encoded, I mean the csv’s have inscrutable names like ‘2016Census_G09C.csv’ and contain column names like ‘Se_d_r_or_t_h_t_Tot_NofB_0_ib’ (H.T. @hughparsonage).

Two of the metadata files in /Metadata/ have useful applications for us. ‘2016Census_geog_desc_1st_and_2nd_release.xlsx’ will help us resolve encoded geographic areas to federal electorate names. ‘Metadata_2016_GCP_DataPack.xlsx’ lists the topics of each of the 59 tables and will allow us to replace a short and uninformative column name with a much longer, and slightly more informative name….

Followed by the joys of joining and analyzing the data sets.

McBain develops original analysis of the data that demonstrates a relationship between having children and opinions on the impact of same sex marriage on children.

No, I won’t repeat his insight. Read his post, it’s quite entertaining.

OECD Gender Data Portal

Wednesday, March 9th, 2016

OECD Gender Data Portal

From the webpage:

The OECD Gender Data Portal includes selected indicators shedding light on gender inequalities in education, employment, entrepreneurship, health and development, showing how far we are from achieving gender equality and where actions is most needed. The data cover OECD member countries, as well as Brazil, China, India, Indonesia, and South Africa. (emphasis in the original)

Data indicators are grouped into employment, education, entrepreneurship, health and development.

A major data source for those who feel the need to demonstrate gender discrimination but seems to be lacking for those of us who take gender discrimination as given.

That is those of us who acknowledge gender discrimination exists, that it should be changed and who are looking for solutions, not more documentation that gender discrimination exists.

Sadly, pointing out gender discrimination does exist has proven largely ineffectual in reducing it presence in American society for example.

Gender and gender discrimination being extremely complex human issues, can be summarized but not addressed by macro-level data.

Changing hearts and minds is more a matter of personal interaction at a micro-level. To that extent the macro-level data may motivate us to more personal interaction but isn’t the answer to what is a person-to-person issue.

4 Tips to Learn More About ACS Data [$400 Billion Market, 3X Big Data]

Saturday, November 14th, 2015

4 Tips to Learn More About ACS Data by Ari Lamstein.

From the post:

One of the highlights of my recent east coast trip was meeting Ezra Haber Glenn, the author of the acs package in R. The acs package is my primary tool for accessing census data in R, and I was grateful to spend time with its author. My goal was to learn how to “take the next step” in working with the census bureau’s American Community Survey (ACS) dataset. I learned quite a bit during our meeting, and I hope to share what I learned over the coming weeks on my blog.

Today I’ll share 4 tips to help you get started in learning more. Before doing that, though, here is some interesting trivia: did you know that the ACS impacts how over $400 billion is allocated each year?

If the $400 billion got your attention, follow the tips in Ari’s post first, look for more posts in that series second, then visit the American Community Survey (ACS) website.

For comparison purposes, keep in mind that Forbes projects the Big Data Analytics market in 2015 to be a paltry $125 Billion.

The ACS data market is over 3 times larger ($400 Billion (ACS) versus $125 Billion (BigData) for 2015.

Suddenly, ACS data and R look quite attractive.

Using Graph Structure Record Linkage on Irish Census

Friday, October 9th, 2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j by Brian Underwood.

From the post:

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911 Irish censuses, I hoped I would be able to find a way to reliably link resident records from the two together to identify the same residents.

Since then I’ve learned a bit about master data management and record linkage and so I thought I would give it another stab.

Here I’d like to talk about how I’ve been matching records based on the local data space around objects to improve my record linkage scoring.

An interesting issue that has currency with intelligence agencies slurping up digital debris at every opportunity. So you have trillions of records. Which ones have you been able to reliably match up?

From a topic map perspective, I could not help but notice that in the 1901 census, the categories for Marriage were:

  • Married
  • Widower
  • Widow
  • Not Married

Whereas the 1911 census records:

  • Married
  • Widower
  • Widow
  • Single

As you know, one of the steps in record linkage is normalization of the data headings and values before you apply the standard techniques to link records together.

In traditional record linkage, the shift from “not married” to “single” is lost in the normalization.

May not be meaningful for your use case but could be important for someone studying shifts in marital relationship language. Or shifts in religious, ethnic, or racist language.

Or for that matter, shifts in the names of database column headers and/or tables. (Like anyone thinks those are stable.)

Pay close attention to how Brian models similarity candidates.

Once you move beyond string equivalent identifiers (TMDM), you are going to be facing the same issues.

American FactFinder

Saturday, February 14th, 2015

American FactFinder

From the webpage:

American FactFinder provides access to data about the United States, Puerto Rico and the Island Areas. The data in American FactFinder come from several censuses and surveys. For more information see Using FactFinder and What We Provide.

As I was writing this post I returned to CensusReporter (2013) which reported on an effort to make U.S. census data easier to use. Essentially a common toolkit.

At that time CensusReporter was in “beta” but has long passed that stage! Whether you will prefer American FactFinder or CensusReporter better will depend upon you and your requirements.

I can say that CensusReporter is working on A tool to aggregate American Community Survey data to non-census geographies. That could prove to be quite useful.


Think Big Challenge 2014 [Census Data – Anonymized]

Monday, October 27th, 2014

Think Big Challenge 2014 [Census Data – Anonymized]

The Think Big Challenge 2014 closed October 19, 2014, but the data sets for that challenge remain available.

From the data download page:

This subdirectory contains a small extract of the data set (1,000 records). There are two data sets provided:

A complete set of records from after the year 1820 is available for download from Amazon S3 at The full data set is available for download from Amazon S3 at as a 127MB gzip file.

A sample of records pre-1820 for use in the data science “Learning of Common Ancestors” challenge. This can be downloaded at as a 4MB gzip file.

The records have been pre-processed:

The contest data set includes both publicly availabl[e] records (e.g., census data) and user-contributed submissions on To preserve user privacy, all surnames present in the data have been obscured with a hash function. The hash is constructed such that all occurrences of the same string will result in the same hash code.

Reader exercise: You can find multiple ancestors of yours in these records with different surnames and compare those against the hash function results. How many you will need to reverse the hash function and recover all the surnames? Use other ancestors of yours to check your results.

Take a look at the original contest tasks for inspiration. What other online records would you want to merge with these? Thinking local newspapers? What about law reporters?


I first saw this mentioned on Danny Bickson’s blog as: Interesting dataset from

Update: I meant to mention Risks of Not Understanding a One-Way Function by Bruce Schneier, to get you started on the deanonymization task. Apologies for the omission.

If you are interested in cryptography issues, following Bruce Schneier’s blog should be on your regular reading list.

New Data Sets Available in Census Bureau API

Monday, June 9th, 2014

New Data Sets Available in Census Bureau API

From the post:

Today the U.S. Census Bureau added several data sets to its application programming interface, including 2013 population estimates and 2012 nonemployer statistics.

The Census Bureau API allows developers to create a variety of apps and tools, such as ones that allow homebuyers to find detailed demographic information about a potential new neighborhood. By combining Census Bureau statistics with other data sets, developers can create tools for researchers to look at a variety of topics and how they impact a community.

Data sets now available in the API are:

  • July 1, 2013, national, state, county and Puerto Rico population estimates
  • 2012-2060 national population projections
  • 2007 Economic Census national, state, county, place and region economy-wide key statistics
  • 2012 Economic Census national economy-wide key statistics
  • 2011 County Business Patterns at the national, state and county level (2012 forthcoming)
  • 2012 national, state and county nonemployer statistics (businesses without paid employees)

The API also includes three decades (1990, 2000 and 2010) of census statistics and statistics from the American Community Survey covering one-, three- and five-year periods of data collection. Developers can access the API online and share ideas through the Census Bureau’s Developers Forum. Developers can use the Discovery Tool to examine the variables available in each dataset.

In case you are looking for census data to crunch!


Precision from Disaggregation

Friday, April 18th, 2014

Building Precise Maps with Disser by Brandon Martin-Anderson.

From the post:

Spatially aggregated statistics are pretty great, but what if you want more precision? Here at Conveyal we built a utility to help with that: aggregate-disser. Let me tell you how it works.

Let’s start with a classic aggregated data set – the block-level population counts from the US Census. Here’s a choropleth map of total population for blocks around lower Manhattan and Brooklyn. The darkest shapes contain about five thousand people.

Brandon combines census data with other data sets to go from 5,000 person census blocks to locating every job and bed in Manhattan into individual buildings.

Very cool!

Not to mention instructive when you encounter group subjects that need to be disaggregated before being combined with other data.

I first saw this in a tweet by The O.C.R.

1939 Register

Thursday, March 27th, 2014

1939 Register

From the webpage:

The 1939 Register is being digitised and will be published within the next two years.

It will provide valuable information about over 30 million people living in England and Wales at the start of World War Two.

What is the 1939 Register?

The British government took a record of the civilian population shortly after the outbreak of World War Two. The information was used to issue identity cards and organise rationing. It was also used to set up the National Health Service.

Explanations are one of the perils of picking very obvious/intuitive names for projects. 😉

The data should include:

Data will be provided only where the individual is recorded as deceased (or where clear evidence of death can be provided by the applicant) and will include;

  • National Registration number
  • Address
  • Surname
  • First Forename
  • Other Forename(s)/Initial(s)
  • Date of Birth
  • Sex
  • Marital Status
  • Occupation

As per the 1939 Register Service, a government office that charges money to search what one assumes are analog records. (Yikes!)

The reason I mention the 1939 Register Service is the statement:

Is any other data available?

If you wish to request additional information under the Freedom of Information Act 2000, please email or contact us using the postal address below, marking the letter for the Higher Information Governance Officer (Southport).

Which implies to me there is more data to be had, but the says not.

Well, assuming you don’t include:

“If member of armed forces or reserves,” which was column G on the original form.

Hard to say why that would be omitted.

It will be interesting to see if the original and then “updated” cards are digitized.

In some of the background reading I did on this data, some mothers omitted their sons from the registration cards (one assumes to avoid military service) but when rationing began based on the registration cards, they filed updated cards to include their sons.

I suspect the 1939 data will be mostly of historical interest but wanted to mention it because people will be interested in it.

Open Census Data (UK)

Monday, January 6th, 2014

Open Census Data

From the post:

First off, congratulations to Jeni Tennison OBE and Keith Dugmore MBE on their gongs for services to Open Data. As we release our Census Data as Open Data it is worth remembering how ‘bad’ things were before Keith’s tireless campaign for Open Census data. Young data whippersnappers may not believe this, but when I first started working with Census data a corporate license for the ED boundaries (just the boundaries, no actual flippin’ data) was £80,000. In the late 90′s a simple census reporting tool in a GIS usually involved license fees of more than £250K. Today using QGIS, POSTGIS, opendata and a bit of imagination you could have such a thing for £0K license costs

Talking of Census data, we’ve released our full UK census data pack today as Open Data. You can access it here.

Good news on all fronts!

However, I am waiting for “open data” to trickle down to the drafts of agency budgets and details of purchases and other expenditures with the payees being identified.

With that data you could draw boundaries around the individuals and groups favored by an agency.

I don’t know what the results would be in the UK but I would wager considerable sums on the results if applied to in Washington, D.C.

You would find out where retirees from federal “service” go when they retire. (Hint, it’s not Florida.)

Discover Your Neighborhood with Census Explorer

Wednesday, December 25th, 2013

Discover Your Neighborhood with Census Explorer by Michael Ratcliffe.

From the post:

Our customers often want to explore neighborhood-level statistics and see how their communities have changed over time. Our new Census Explorer interactive mapping tool makes this easier than ever. It provides statistics on a variety of topics, such as percent of people who are foreign-born, educational attainment and homeownership rate. Statistics from the 2008 to 2012 American Community Survey power Census Explorer.

While you may be familiar with other ways to find neighborhood-level statistics, Census Explorer provides an interactive map for states, counties and census tracts. You can even look at how these neighborhoods have changed over time because the tool includes information from the 1990 and 2000 censuses in addition to the latest American Community Survey statistics. Seeing these changes is possible because the annual American Community Survey replaced the decennial census long form, giving communities throughout the nation more timely information than just once every 10 years.

Topics currently available in Census Explorer:

  • Total population
  • Percent 65 and older
  • Foreign-born population percentage
  • Percent of the population with a high school degree or higher
  • Percent with a bachelor’s degree or higher
  • Labor force participation rate
  • Home ownership rate
  • Median household income

Fairly coarse (census tract level) data but should be useful for any number of planning purposes.

For example, you could cross this data with traffic ticket and arrest data to derive “police presence” statistics.

Or add “citizen watcher” data from tweets about police car # and locations.

Different data sets often use different boundaries for areas.

Consider creating topic map based filters so when the boundaries change (a favorite activity of local governments) so will your summaries of that data.


Sunday, September 22nd, 2013

Easier Census data browsing with CensusReporter by Nathan Yau.

Nathan writes:

Census data can be interesting and super informative, but getting the data out of the dreaded American FactFinder is often a pain, especially if you don’t know the exact table you want. (This is typically the case.) CensusReporter, currently in beta, tries to make the process easier.

Whatever your need for census data, even the beta interface is worth your attention!

Resources for Mapping Census Data

Friday, June 28th, 2013

Resources for Mapping Census Data by Katy Rossiter.

From the post:

Mapping data makes statistics come to life. Viewing statistics spatially can give you a better understanding, help identify patterns, and answer tough questions about our nation. Therefore, the Census Bureau provides maps, including digital files for use in a Geographic Information System (GIS), and interactive mapping capabilities in order to visualize our statistics.

Here are some of the mapping resources available from the Census Bureau:


That listing is just some of the resources that Katy covers in her post.

Combining your data or public data along with census data could result in a commercially successful map.

Why the Obsession with Tables?

Thursday, May 2nd, 2013

Why the Obsession with Tables? by Robert Kosara.

From the post:

Lots of data are still presented and released as tables. But why, when we know that visual representations are so much easier to read and understand? Eric Newburger from the U.S. Census Bureau has an interesting theory.

In a short talk on visualization at the Census Bureau, he describes how in the 1880s, the Census published maps and charts. Many of those are actually amazingly well done, even by today’s standards. But starting with 1890 census, they were replaced with tables.

This, according to Newburger, was due to an important innovation: the Hollerith Tabulating Machine. The new machines were much faster and could slice and dice the data in a lot of new ways, but their output ended up in tables. Throughout the 20th century, the Census created enormous amount of tables, with only a small fraction of the data shown as maps or charts.

Newburger argues that people don’t bother trying to read tables, whereas visualizations are much more likely to catch their attention and get them interested in the underlying data. We clearly have the means to create any visualization we want today, and there is plenty of data available, so why keep publishing tables? It’s a matter of the attitudes towards data, and these can be hard to change after more than 100 years:

Suggestions of images from maps and charts from the Census in the 1880s?

If the Hollerith Tabulating Machine is responsible for the default to tables, it is also responsible for spreadsheets?

Quicker for a machine to produce but less useful to an end user.

Mapping the census…

Sunday, February 10th, 2013

Mapping the census: how one man produced a library for all by Simon Rogers.

From the post:

The census is an amazing resource – so full of data it’s hard to know where to begin. And increasingly where to begin is by putting together web-based interactives – like this one on language and this on transport patterns that we produced this month.

But one academic is taking everything back to basics – using some pretty sophisticated techniques. Alex Singleton, a lecturer in geographic information science (GIS) at Liverpool University has used R to create the open atlas project.

Singleton has basically produced a detailed mapping report – as a PDF and vectored images – on every one of the local authorities of England & Wales. He automated the process and has provided the code for readers to correct and do something with. In each report there are 391 pages, each with a map. That means, for the 354 local authorities in England & Wales, he has produced 127,466 maps.

Check out Simon’s post to see why Singleton has undertaken such a task.

Question: Was the 2011 census more “transparent,” or “useful” after Singleton’s work or before?

I would say more “transparent” after Singleton’s work.


U.S. Census Bureau Offers Public API for Data Apps

Monday, July 30th, 2012

U.S. Census Bureau Offers Public API for Data Apps by Nick Kolakowski.

From the post:

For any software developers with an urge to play around with demographic or socio-economic data: the U.S. Census Bureau has launched an API for Web and mobile apps that can slice that statistical information in all sorts of nifty ways.

The API draws data from two sets: the 2010 Census (statistics include population, age, sex, and race) and the 2006-2010 American Community Survey (offers information on education, income, occupation, commuting, and more). In theory, developers could use those datasets to analyze housing prices for a particular neighborhood, or gain insights into a city’s employment cycles.

The APIs include no information that could identify an individual. (emphasis added)

Suppose it should say: “Some assembly required.”

Similar resources at and Google Public Data Explorer.

I first saw this at: Dashboard Insight.

1940 US Census Indexing Progress Report—May 18, 2012

Sunday, May 20th, 2012

1940 US Census Indexing Progress Report—May 18, 2012

From the post:

We’re finishing our 7th week of indexing and we are a breath away from having 40% of the entire collection indexed. I hear from so many people words of amazement at the things this indexing community has accomplished. In 7 weeks we’ve collectively indexed more than 55 million names. It is truly amazing. With 111,612 indexers now signed up to index and arbitrate, we have a formidable team making some great things happen. Let’s keep up the great work.

It is a popular data set but isn’t the whole story.

What do you think are the major factors that contribute to their success?

1940 Census (U.S.A.)

Tuesday, April 3rd, 2012

1940 Census (U.S.A.)

From the “about” page:

Census records are the only records that describe the entire population of the United States on a particular day. The 1940 census is no different. The answers given to the census takers tell us, in detail, what the United States looked like on April 1, 1940, and what issues were most relevant to Americans after a decade of economic depression.

The 1940 census reflects economic tumult of the Great Depression and President Franklin D. Roosevelt’s New Deal recovery program of the 1930s. Between 1930 and 1940, the population of the Continental United States increased 7.2% to 131,669,275. The territories of Alaska, Puerto Rico, American Samoa, Guam, Hawaii, the Panama Canal, and the American Virgin Islands comprised 2,477,023 people.

Besides name, age, relationship, and occupation, the 1940 census included questions about internal migration; employment status; participation in the New Deal Civilian Conservation Corps (CCC), Works Progress Administration (WPA), and National Youth Administration (NYA) programs; and years of education.

Great for ancestry and demographic studies. What other data would you use with this census information?

MPC – Minnesota Population Center

Tuesday, March 6th, 2012

MPC – Minnesota Population Center

I mentioned the Integrated Public Use Microdata Series (IPUMS-USA) data set last year which self-describes as:

IPUMS-USA is a project dedicated to collecting and distributing United States census data. Its goals are to:

  • Collect and preserve data and documentation
  • Harmonize data
  • Disseminate the data absolutely free!

Use it for GOOD — never for EVIL

There is international data and more U.S. data that may be of interest:

Statistics Finland is making further utilisation of statistical data easier

Sunday, January 29th, 2012

Statistics Finland is making further utilisation of statistical data easier

From the post:

Statistics Finland has confirmed new Terms of Use for the utilisation of already published statistical data. In them, Statistics Finland grants a universal, irrevocable right to the use of the data published in its website service and in related free statistical databases. The right extends to use for both commercial and non-commercial purposes. The aim is to make further utilisation of the data easier and thereby increase the exploitation and effectiveness of statistics in society.

At the same time, an open interface has been built to the StatFin database. The StatFin database is a general database built with AC-Axis tools that is free-of-charge and contains a wide array of statistical data on a variety of areas in society. It contains data from some 200 sets of statistics, thousands of tables and hundreds of millions of individual data cells. The contents of the StatFin database have been systematically widened in the past few years and its expansion with various information contents and regional divisions will be continued even further.

Curious if the free commercial re-use of government collected data (paid for by taxpayers) favors established re-sellers of data or startups that will combine existing data in interesting ways. Thoughts?

First seen at Christophe Lalanne’s Bag of Tweets for January 2012.

Opening Up the Domesday Book

Thursday, December 22nd, 2011

Opening Up the Domesday Book by Sam Leon.

From the post:

Domesday Book might be one of the most famous government datasets ever created. Which makes it all the stranger that it’s not freely available online – at the National Archives, you have to pay £2 per page to download copies of the text.

Domesday is pretty much unique. It records the ownership of almost every acre of land in England in 1066 and 1086 – a feat not repeated in modern times. It records almost every household. It records the industrial resources of an entire nation, from castles to mills to oxen.

As an event, held in the traumatic aftermath of the Norman conquest, the Domesday inquest scarred itself deeply into the mindset of the nation – and one historian wrote that on his deathbed, William the Conqueror regretted the violence required to complete it. As a historical dataset, it is invaluable and fascinating.

In my spare time, I’ve been working on making Domesday Book available online at Open Domesday. In this, I’ve been greatly aided by the distinguished Domesday scholar Professor John Palmer, and his geocoded dataset of settlements and people in Domesday, created with AHRC funding in the 1990s.

I guess it really is all a matter of perspective. I have never thought of the Domesday Book as a “government dataset….” 😉

Certainly would make an interesting basis for a chronological topic map tracing the ownership and fate of “…almost every acre of land in England….”