Tracking Congressional Whores

Introducing legis-graph – US Congressional Data With Govtrack and Neo4j by William Lyon.

From the post:

Interactions among members of any large organization are naturally a graph, yet the tools we use to collect and analyze data about these organizations often ignore the graphiness of the data and instead map the data into structures (such as relational databases) that make taking advantage of the relationships in the data much more difficult when it comes time to analyze the data. Collaboration networks are a perfect example. So let’s focus on one of the most powerful collaboration networks in the world: the US Congress.

Introducing legis-graph: US Congress in a graph

The goal of legis-graph is to provide a way of converting the data provided by Govtrack into a rich property-graph model that can be inserted into Neo4j for analysis or integrating with other datasets.

The code and instructions are available in this Github repo. The ETL workflow works like this:

  1. A shell script is used to rsync data from Govtrack for a specific Congress (i.e. the 114th Congress). The Govtrack data is a mix of JSON, CSV, and YAML files. It includes information about legislators, committees, votes, bills, and much more.
  2. To extract the pieces of data we are interested in for legis-graph a series of Python scripts are used to extract and transform the data from different formats into a series of flat CSV files.
  3. The third component is a series of Cypher statements that make use of LOAD CSV to efficiently import this data into Neo4j.

To get started with legis-graph in Neo4j you can follow the instructions here. Alternatively, a Neo4j data store dump is available here for use without having to execute the import scripts. We are currently working to streamline the ETL process so this may change in the future, but any updates will be appear in the Github README.

This project began during preparation for a graph database focused Meetup presentation. George Lesica and I wanted to provide an interesting dataset in Neo4j for new users to explore.

Whenever the U.S. Congress is mentioned, I am reminded of the Obi-Wan Kenobi’s line about Mos Eisley:

You will never find a more wretched hive of scum and villainy. We must be cautious.

The data model for William’s graph:


As you can see from the diagram above, the datamodel is rich and captures quite a bit of information about the actions of legislators, committees, and bills in Congress. Information about what properties are currently included is available here.

A great starting place that can be extended and enriched.

In terms of the data model, note that “subject” is now the title of a bill. Definitely could use some enrichment there.

Another association for the bill, “who_benefits.”

If you are really ambitious, try developing information on what individuals or groups make donations to the legislator on an annual basis.

To clear noise out of the data set, drop everyone who doesn’t contribute annually and even then, any total less than $5,000. Remember that members of congress depend on regular infusions of cash so erratic or one-time donors may get a holiday card but they are not on the ready access list.

The need for annual cash is one reason why episodic movements may make the news but they rarely make a difference. To make a difference requires years of steady funding and grooming of members of congress and improving your access, plus your influence.

Don’t be disappointed if you can “prove” member of congress X is in the pocket of Y or Z organization/corporation and nothing happens. More likely than not, such proof will increase their credibility for fund raising.

As Leonard Cohen says (Everybody Knows):

Everybody knows the dice are loaded, Everybody rolls with their fingers crossed

Comments are closed.