Archive for the ‘Ggplot2’ Category

Predicting Police Cellphone Locations – Weaponizing Open Data

Wednesday, February 8th, 2017

Predicting And Mapping Arrest Types in San Francisco with LightGBM, R, ggplot2 by Max Woolf.

Max does a great job of using open source data SF OpenData to predict arrest types in San Francisco.

It takes only a small step to realize that Max is also predicting the locations of police officers and their cellphones.

Without police officers, you aren’t going to have many arrests. 😉

Anyone operating a cellphone surveillance device can use Max’s predictions to gather data from police cellphones and other electronic gear. For particular police officers, for particular types of arrests, or at particular times of day, etc.

From the post:

The new hotness in the world of data science is neural networks, which form the basis of deep learning. But while everyone is obsessing about neural networks and how deep learning is magic and can solve any problem if you just stack enough layers, there have been many recent developments in the relatively nonmagical world of machine learning with boring CPUs.

Years before neural networks were the Swiss army knife of data science, there were gradient-boosted machines/gradient-boosted trees. GBMs/GBTs are machine learning methods which are effective on many types of data, and do not require the traditional model assumptions of linear/logistic regression models. Wikipedia has a good article on the advantages of decision tree learning, and visual diagrams of the architecture:

GBMs, as implemented in the Python package scikit-learn, are extremely popular in Kaggle machine learning competitions. But scikit-learn is relatively old, and new technologies have emerged which implement GBMs/GBTs on large datasets with massive parallelization and and in-memory computation. A popular big data machine learning library, H2O, has a famous GBM implementation which, per benchmarks, is over 10x faster than scikit-learn and is optimized for datasets with millions of records. But even faster than H2O is xgboost, which can hit a 5x-10x speed-ups relative to H2O, depending on the dataset size.

Enter LightGBM, a new (October 2016) open-source machine learning framework by Microsoft which, per benchmarks on release, was up to 4x faster than xgboost! (xgboost very recently implemented a technique also used in LightGBM, which reduced the relative speedup to just ~2x). As a result, LightGBM allows for very efficient model building on large datasets without requiring cloud computing or nVidia CUDA GPUs.

A year ago, I wrote an analysis of the types of police arrests in San Francisco, using data from the SF OpenData initiative, with a followup article analyzing the locations of these arrests. Months later, the same source dataset was used for a Kaggle competition. Why not give the dataset another look and test LightGBM out?

Cellphone data gathered as a result of Max’s predictions can be tested against arrest and other police records to establish the presence and/or absence of particular police officers at a crime scene.

After a police office corroborates the presence of a gun in a suspect’s hand, cellphone evidence they were blocks away, in the presence of other police officers, could prove to be inconvenient.

How to weigh a dog with a ruler? [Or Price a US Representative?]

Wednesday, December 14th, 2016

How to weigh a dog with a ruler? (looking for translators)

From the post:

We are working on a series of comic books that introduce statistical thinking and could be used as activity booklets in primary schools. Stories are built around adventures of siblings: Beta (skilled mathematician) and Bit (data hacker).

What is the connection between these comic books and R? All plots are created with ggplot2.

The first story (How to weigh a dog with a ruler?) is translated to English, Polish and Czech. If you would like to help us to translate this story to your native language, just write to me (przemyslaw.biecek at gmail) or create an issue on GitHub. It’s just 8 pages long, translations are available on Creative Commons BY-ND licence.

The key is to chart animals by their height as against their weight.

Pricing US Representatives is likely to follow a similar relationship where their priced goes up by years of service in Congress.

I haven’t run the data but such a chart would keep “people” (includes corporations in the US) from paying too much or offering too little. To the embarrassment of all concerned.

ggplot2 cheatsheet updated – other R spreadsheets

Wednesday, November 2nd, 2016

RStudio Cheat Sheets

I saw a tweet that the ggplot2 cheatsheet has been updated.

Here’s a list of all the cheatsheets available at RStudio:

  • R Markdown Cheat Sheet
  • RStudio IDE Cheat Sheet
  • Shiny Cheat Sheet
  • Data Visualization Cheat Sheet
  • Package Development Cheat Sheet
  • Data Wrangling Cheat Sheet
  • R Markdown Reference Guide

Contributed Cheatsheets

  • Base R
  • Advanced R
  • Regular Expressions
  • How big is your graph? (base R graphics)

I have deliberately omitted links as when cheat sheets are updated, the links will break and/or you will get outdated information.

Use and reference the RStudio Cheat Sheets page.


ggplot2 2.2.0 coming soon! [Testers Needed!]

Friday, September 30th, 2016

ggplot2 2.2.0 coming soon! by Hadley Wickham.

From the post:

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

Install the pre-release version with:

# install.packages("devtools")

If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:


ggplot2 2.2.0 will be a relatively major release including:

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: ggraph, ggforce, and tweenr.

Just in case you are casual about time, tomorrow is October 1st. Which on most calendars means that “early November” isn’t far off.

Here’s an easy opportunity to test ggplot2.2.2.0 and related visualization packages. Before the official release.


ggplot2 – Elegant Graphics for Data Analysis – At Last Call

Thursday, June 9th, 2016

ggplot2 – Elegant Graphics for Data Analysis by Hadley Wickham.

Hadley tweeted today that “ggplot2” is still up but will be removed after publication.

If you want/need a digital copy, now would be a good time to acquire one.

How I build up a ggplot2 figure [Class Response To ggplot2 criticism]

Friday, February 19th, 2016

How I build up a ggplot2 figure by John Muschelli.

From the post:

Recently, Jeff Leek at Simply Statistics discussed why he does not use ggplot2. He notes “The bottom line is for production graphics, any system requires work.” and describes a default plot that needs some work:

John responds to perceived issues with using ggplot2 by walking through each issue and providing you with examples of how to solve it.

That doesn’t mean that you will switch to ggplot2, but it does mean you will be better informed of your options.

An example to be copied!

Ggplot2 Quickref

Wednesday, January 6th, 2016

Ggplot2 Quickref by Selva Prabhakaran.

If you use ggplot2, map this to a “hot” key on your keyboard.


Fun with facets in ggplot2 2.0

Saturday, December 26th, 2015

Fun with facets in ggplot2 2.0 by Bob Rudis.

From the post:

ggplot2 2.0 provides new facet labeling options that make it possible to create beautiful small multiple plots or panel charts without resorting to icky grob manipulation.

Very appropriate for this year in Georgia (US) at any rate. Facets are used to display temperature by year and temperature versus Kwh by year.

The high today, 26th of December, 2015, is projected to be 77°F.

Sigh, that’s just not December weather.

ggplot 2.0.0

Monday, December 21st, 2015

ggplot 2.0.0 by Hadley Wickham.

From the post:

I’m very pleased to announce the release of ggplot2 2.0.0. I know I promised that there wouldn’t be any more updates, but while working on the 2nd edition of the ggplot2 book, I just couldn’t stop myself from fixing some long standing problems.

On the scale of ggplot2 releases, this one is huge with over one hundred fixes and improvements. This might break some of your existing code (although I’ve tried to minimise breakage as much as possible), but I hope the new features make up for any short term hassle. This blog post documents the most important changes:

  • ggplot2 now has an official extension mechanism.
  • There are a handful of new geoms, and updates to existing geoms.
  • The default appearance has been thoroughly tweaked so most plots should look better.
  • Facets have a much richer set of labelling options.
  • The documentation has been overhauled to be more helpful, and require less integration across multiple pages.
  • A number of older and less used features have been deprecated.

These are described in more detail below. See the release notes for a complete list of all changes.

It’s one thing to find an error in the statistics of a research paper.

It is quite another to visualize the error in a captivating way.

No guarantees for some random error but ggplot 2.0.0 is one of the right tools for such a job.

Visualizing Chess Data With ggplot

Monday, November 2nd, 2015

Visualizing Chess Data With ggplot by Joshua Kunst.

Sales of traditional chess sets peak during the holiday season. The following graphic does not include sales of chess gadgets, chess software, or chess computers:


(Source: Terapeak Trends: Which Tabletop Games Sell Best on eBay? by Aron Hsiao.)

Joshua’s post is a guide to using and visualizing chess data under the following topics:

  1. The Data
  2. Piece Movements
  3. Survival rates
  4. Square usage by player
  5. Distributions for the first movement
  6. Who captures whom

Joshua is using public chess data but it’s just a short step to using data from your own chess games or those of friends from your local chess club. 😉

Visualize the play of openings, defenses, players + openings/defenses, you are limited only by your imagination.

Give a chess friend a visualization they can’t buy in any store!

PS: Check out: rchess a Chess Package for R also by Joshua Kunst.

I first saw this in a tweet by Christophe Lalanne.

Visualizing ggplot2 internals…

Tuesday, July 15th, 2014

Visualizing ggplot2 internals with shiny and D3 by Carson Sievert.

From the post:

As I started this project, I became frustrated trying to understand/navigate through the nested list-like structure of ggplot objects. As you can imagine, it isn’t an optimal approach to print out the structure everytime you want to checkout a particular element. Out of this frustration came an idea to build this tool to help interact with and visualize this structure. Thankfully, my wonderful GSoC mentor Toby Dylan Hocking agreed that this project could bring value to the ggplot2 community and encouraged me to pursue it.

By default, this tool presents a radial Reingold–Tilford Tree of this nested list structure, but also has options to use the collapsable or cartesian versions. It also leverages the shinyAce package which allows users to send arbitrary ggplot2 code to a shiny server thats evaluate the results and re-renders the visuals. I’m quite happy with the results as I think this tool is a great way to quickly grasp the internal building blocks of ggplot(s). Please share your thoughts below!

I started with the blog post about the visualization but seeing the visualization is more powerful:

Visualizing ggplot2 internals (demo)

I rather like the radial layout.

For either topic map design or analysis, this looks like a good technique to explore the properties we assign to subjects.

ggplot2 Choropleth of Supreme Court Decisions: A Tutorial

Saturday, July 13th, 2013

ggplot2 Choropleth of Supreme Court Decisions: A Tutorial

From the post:

I don't do much GIS but I like to. It's rather enjoyable and involves a tremendous skill set. Often you will find your self grabbing data sets from some site, scraping, data cleaning and reshaping, and graphing. On the ride home from work yesterday I heard an NPR talk about the Supreme Court decisions being very close with this court. This got me wondering if there is a data base with this information and the journey began. This tutorial is purely exploratory but you will learn to:

  1. Grab .zip files from a data base and read into R
  2. Clean data
  3. Reshape data with reshape2
  4. Merge data sets
  5. Plot a choropleth map in ggplot2
  6. Arrange several grid plots with gridExtra

I'm lazy and like a good challenge. I challenged myself to not manually open a file so I downloaded Biobase from bioconductor to open the pdf files for the codebook. Also I used my own package qdap because it had some functions I like and I'm used to using them. This blog post was created in the dev. version of the reports package using the wordpress_rmd template.

Good R practice and an interesting view of Supreme Court cases.

Why Do the New Orleans Saints Lose?…

Thursday, December 27th, 2012

Why Do the New Orleans Saints Lose? Data Visualization II by Nathan Lemoine.

I’m not a nationalist, apparatchik, school, state, profession, class, religion, language or development approach booster.

I must confess, however, I am a New Orleans Saints fan. Diversity, read other teams, are a necessary evil to give the Saints someone to beat. 😉

An exercise you can repeat/expand with other teams (shudder), in other sports (shudder, shudder), to explore R and visualization of data.

What other stats/information would you want to incorporate/visualize?


Friday, July 13th, 2012


From the webpage:

ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

I have a few posts about ggplot2 but this site is the mother ship of information on it. Use other resources as necessary but this looks like the canonical source. (Plus you can download a local copy for your laptop. For the odd occasion when you are off net.)

ggplot posixct cheat sheet

Monday, March 19th, 2012

ggplot posixct cheat sheet

From the brain of Mat Kelcey, a cheatsheet of common plots using ggplot.

Any common plots in your toolbox?

Slides and replay for “A backstage tour of ggplot2”

Friday, February 10th, 2012

Slides and replay for “A backstage tour of ggplot2”

From the post:

Many thanks to Hadley Wickham for his informative and entertaining webinar yesterday, “A backstage tour of ggplot2”. Thanks also to everyone who submitted questions — with more than 800 attendees live on the line we had many more questions than we had time to answer.

Pointers to lots of goodies and video of the presentation!

Great Maps with ggplot2

Friday, February 3rd, 2012

Great Maps with ggplot2

I have mentioned ggplot2 before but this item caught my eye because of its skillful use with a map of cycle tours of London.

Not that I intend to take a cycle tour of London any time soon but it occurs to me that creating maps to resturants, entertainment, etc., from conference sites would be a good use of it. Coupled with a topic map, as the conference progresses, reviews/tweets about those locations could become available to other participants.

Other geographic locations/information could be plotted as well.

A Backstage Tour of ggplot2 with Hadley Wickham

Thursday, January 12th, 2012

A Backstage Tour of ggplot2 with Hadley Wickham


Date: Wednesday, February 8, 2012
Time: 11:00AM – 12:00PM Pacific Time
Presenter: Hadley Wickham, Professor of Statistics, Rice University

From the webpage:

GGplot2 is one of R’s most popular, widely used packages, developed by Rice University’s Hadley Wickham. Ggplot2’s exploratory graphics capabilities are driving the use of R as a complement to legacy analytics tools such as SAS. SAS is well-regarded for its strength in data management and “production” statistics, where you know what you want to do and need to do it repeatedly. On the other hand, R is strong in data analysis and exploration in situations where figuring out what is needed is the biggest challenge. In this important way, SAS and R are strong companions.

This webinar will provide an all-access pass to Hadley’s latest work. He’ll discuss:

  • A brief overview of ggplot2, and how it’s different to other plotting systems
  • A sneak peek at some of the new features coming to the next version of ggplot2
  • What’s been learned about good development practices in the 5 years since first starting to develop ggplot
  • Some of the internals of ggplot2, and talk about how he is gradually making it easier for others to contribute

Join this webinar to understand how ggplot2 adds valuable, unmatched capabilities to your analytics toolbox.