Archive for the ‘R’ Category

A Docker tutorial for reproducible research [Reproducible Reporting In The Future?]

Wednesday, November 15th, 2017

R Docker tutorial: A Docker tutorial for reproducible research.

From the webpage:

This is an introduction to Docker designed for participants with knowledge about R and RStudio. The introduction is intended to be helping people who need Docker for a project. We first explain what Docker is and why it is useful. Then we go into the the details on how to use it for a reproducible transportable project.

Six lessons, instructions for installing Docker, plus zip/tar ball of the materials. What more could you want?

Science has paid lip service to the idea of replication of results for centuries but with the sharing of data and analysis, reproducible research is becoming a reality.

Is reproducible reporting in the near future? Reporters preparing their analysis and releasing raw data and their extraction methods?

Or will selective releases of data, when raw data is released at all, continue to be the norm?

Please let @ICIJorg know how you feel about data hoarding, #ParadisePapers, #PanamaPapers, when data and code sharing are becoming the norm in science.

Data Munging with R (MEAP)

Monday, November 6th, 2017

Data Munging with R (MEAP) by Dr. Jonathan Carroll.

From the description:

Data Munging with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. Whether you already have some programming experience or you’re just a spreadsheet whiz looking for a more powerful data manipulation tool, this book will help you get started. You’ll discover the ins and outs of using the data-oriented R programming language and its many task-specific packages. With dozens of practical examples to follow, learn to fill in missing values, make predictions, and visualize data as graphs. By the time you’re done, you’ll be a master munger, with a robust, reproducible workflow and the skills to use data to strengthen your conclusions!

Five (5) out of eleven (11) parts available now under the Manning Early Access Program (MEAP). Chapter one, Introducing Data and the R Language is free.

Even though everyone writes books from front to back (or at least claim to), it would be nice to see a free “advanced” chapter every now and again. There’s not much you can say about an introductory chapter other than it’s an introductory chapter. That’s no different here.

I suspect you will get a better idea about Dr. Carroll’s writing from his blog, Irregularly Scheduled Programming or by following him on Twitter: @carroll_jono.

A cRyptic crossword with an R twist

Friday, October 13th, 2017

A cRyptic crossword with an R twist

From the post:

Last week’s R-themed crossword from R-Ladies DC was popular, so here’s another R-related crossword, this time by Barry Rowlingson and published on page 39 of the June 2003 issue of R-news (now known as the R Journal). Unlike the last crossword, this one follows the conventions of a British cryptic crossword: the grid is symmetrical, and eschews 4×4 blocks of white or black squares. Most importantly, the clues are in the cryptic style: rather than being a direct definition, cryptic clues pair wordplay (homonyms, anagrams, etc) with a hidden definition. (Wikipedia has a good introduction to the types of clues you’re likely to find.) Cryptic crosswords can be frustrating for the uninitiated, but are fun and rewarding once you get to into it.

In fact, if you’re unfamiliar with cryptic crosswords, this one is a great place to start. Not only are many (but not all) of the answers related in some way to R, Barry has helpfully provided the answers along with an explanation of how the cryptic clue was formed. There’s no shame in peeking, at least for a few, to help you get your legs with the cryptic style.

Another R crossword for your weekend enjoyment!


A cRossword about R [Alternative to the NYTimes Sunday Crossword Puzzle]

Friday, October 6th, 2017

A cRossword about R by David Smith.

From the post:

The members of the R Ladies DC user group put together an R-themed crossword for a recent networking event. It’s a fun way to test out your R knowledge. (Click to enlarge, or download a printable version here.)

Maybe not a complete alternative to the NYTimes Sunday Crossword Puzzle but R enthusiasts will enjoy it.

I suspect the exercise of writing a crossword puzzle is a greater learning experience than solving it.


Exploratory Data Analysis of Tropical Storms in R

Tuesday, September 26th, 2017

Exploratory Data Analysis of Tropical Storms in R by Scott Stoltzman.

From the post:

The disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious about the history of hurricanes and tropical storms so I found a data set on and started some basic Exploratory data analysis (EDA).

EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are handy and provide a great place to start. They are an easy-to-use method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kick-off the EDA process but those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.

The original source of the data was can be found at

Great walk through on exploratory data analysis.

Everyone talks about the weather but did you know there is a forty (40) year climate lag between cause and effect?

The human impact on the environment today, won’t be felt for another forty (40) years.

Can to predict the impact of a hurricane in 2057?

Some other data/analysis resources on hurricanes, Climate Prediction Center, Hurricane Forecast Computer Models, National Hurricane Center.

PS: Is a Category 6 Hurricane Possible? by Brian Donegan is an interesting discussion on going beyond category 5 for hurricanes. For reference on speeds, see: Fujita Scale (tornadoes).


Monday, September 18th, 2017

RStartHere by Garrett Grolemund.

R packages organized by their role in data science:

This is very cool! Use and share!

@rstudio Cheatsheets Now B&W Printer Friendly

Saturday, September 9th, 2017

Mara Averick, @dataandme, tweets:

All the @rstudio Cheatsheets have been B&W printer-friendlier-ized

It’s a small thing but appreciated when documentation is B&W friendly.

PS: The @rstudio cheatsheets are also good examples layout and clarity.

.Rddj (data journalism with R)

Wednesday, June 21st, 2017

.Rddj Hand-curated, high quality resources for doing data journalism with R by Timo Grossenbacher.

From the webpage:

The R Project is a great software environment for doing all sorts of data-driven journalism. It can be used for any of the stages of a typical data project: data collection, cleaning, analysis and even (interactive) visualization. And it’s all reproducible and transparent! Sure, it requires a fair amount of scripting, yet…

Do not fear! With this hand-curated (and opinionated) list of resources, you will be guided through the thick jungle of countless R packages, from learning the basics of R’s syntax, to scraping HTML tables, to a guide on how to make your work comprehensible and reproducible.

Now, enjoy your journey.

Some more efforts at persuasion: As I work in the media, I know how a lot of journalists are turned off by everything that doesn’t have a graphical interface with buttons to click on. However, you don’t need to spend days studying programming concepts in order to get started with R, as most operations can be carried out without applying scary things such as loops or conditionals – and, nowadays, high-level abstrations like dplyr make working with data a breeze. My advice if you’re new to data journalism or data processing in general: Better learn R than Excel, ’cause getting to know Excel (and the countless other tools that each do a single thing) doesn’t come for free, either.

This list is (partially) inspired by R for Journalists by Ed Borasky, which is another great resource for getting to know R.

… (emphasis in original)

The topics are familiar:

  • RStudio
  • Syntax and basic R programming
  • Collecting Data (from the Web)
  • Data cleaning and manipulation
  • Text mining / natural language processing
  • Exploratory data analysis and plotting
  • Interactive data visualization
  • Publication-quality graphics
  • Reproducibility
  • Examples of using R in (data) journalism
  • What makes this list of resources different from search results?

    Hand curation.

    How much of a difference?

    Compare the search results of “R” + any of these categories to the resources here.

    Bookmark .Rddj for data journalism and R, then ping me with the hand curated list of resources you are creating.

    Save yourself and the rest of us from search. Thanks!

    Copy-n-Paste Security Alert!

    Wednesday, June 7th, 2017

    Security: The Dangers Of Copying And Pasting R Code.

    From the post:

    Most of the time when we stumble across a code snippet online, we often blindly copy and paste it into the R console. I suspect almost everyone does this. After all, what’s the harm?

    The post illustrates how innocent appearing R code can conceal unhappy surprises!

    Concealment isn’t limited to R code.

    Any CSS controlled display is capable of concealing code for you to copy-n-paste into a console, terminal window, script or program.

    Endless possibilities for HTML pages/emails with code + a “little something extra.”

    What are your copy-n-paste practices?

    Network analysis of Game of Thrones family ties [A Timeless Network?]

    Monday, May 15th, 2017

    Network analysis of Game of Thrones family ties by Shirin Glander.

    From the post:

    In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

    Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.

    The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.

    Glander improves network data for the Game of Thrones and walks you through the use of R to analyze that network.

    It’s useful work and will repay close study.

    Network analysis can used with all social groups, activists, bankers, hackers, members of Congress (U.S.), terrorists, etc.

    But just as Ned Stark has no relationship with dire wolves when the story begins, networks of social groups develop, change, evolve if you will, over time.

    Moreover, events, interactions, involving one or more members of the network, occur in time sequence. A social network that fails to capture those events and their sequencing, from one or more points of view, is a highly constrained network.

    A useful network as Glander demonstrates but one cannot answer simple questions about the order in which characters gained knowledge that a particular character hurled another character from a very high window.

    If I were investigating say a leak of NSA cybertools, time sequencing like that would be one of my top priorities.


    R Weekly – Update

    Friday, February 24th, 2017

    R Weekly

    A community based aggregation resource on R.

    Seventy-two (72) links plus R project updates in R Weekly 2017 Issue 8.

    Great way to stay up on R resources and get a sense for the R community.


    PS: The first post of R Weekly that I reviewed had 6 links. R Weekly [Another Word for It post]

    “Tidying” Up Jane Austen (R)

    Thursday, February 16th, 2017

    Text Mining the Tidy Way by Julia Silge.

    Thanks to Julia’s presentation I now know there is an R package with all of Jane Austen’s novels ready for text analysis.

    OK, Austen may not be at the top of your reading list, but the Tidy techniques Julia demonstrates are applicable to a wide range of textual data.

    Among those mentioned in the presentation, NASA datasets!

    Julia, along with Dave Robinson, wrote: Text Mining with R: A Tidy Approach, available online now and later this year from O’Reilly.

    Predicting Police Cellphone Locations – Weaponizing Open Data

    Wednesday, February 8th, 2017

    Predicting And Mapping Arrest Types in San Francisco with LightGBM, R, ggplot2 by Max Woolf.

    Max does a great job of using open source data SF OpenData to predict arrest types in San Francisco.

    It takes only a small step to realize that Max is also predicting the locations of police officers and their cellphones.

    Without police officers, you aren’t going to have many arrests. 😉

    Anyone operating a cellphone surveillance device can use Max’s predictions to gather data from police cellphones and other electronic gear. For particular police officers, for particular types of arrests, or at particular times of day, etc.

    From the post:

    The new hotness in the world of data science is neural networks, which form the basis of deep learning. But while everyone is obsessing about neural networks and how deep learning is magic and can solve any problem if you just stack enough layers, there have been many recent developments in the relatively nonmagical world of machine learning with boring CPUs.

    Years before neural networks were the Swiss army knife of data science, there were gradient-boosted machines/gradient-boosted trees. GBMs/GBTs are machine learning methods which are effective on many types of data, and do not require the traditional model assumptions of linear/logistic regression models. Wikipedia has a good article on the advantages of decision tree learning, and visual diagrams of the architecture:

    GBMs, as implemented in the Python package scikit-learn, are extremely popular in Kaggle machine learning competitions. But scikit-learn is relatively old, and new technologies have emerged which implement GBMs/GBTs on large datasets with massive parallelization and and in-memory computation. A popular big data machine learning library, H2O, has a famous GBM implementation which, per benchmarks, is over 10x faster than scikit-learn and is optimized for datasets with millions of records. But even faster than H2O is xgboost, which can hit a 5x-10x speed-ups relative to H2O, depending on the dataset size.

    Enter LightGBM, a new (October 2016) open-source machine learning framework by Microsoft which, per benchmarks on release, was up to 4x faster than xgboost! (xgboost very recently implemented a technique also used in LightGBM, which reduced the relative speedup to just ~2x). As a result, LightGBM allows for very efficient model building on large datasets without requiring cloud computing or nVidia CUDA GPUs.

    A year ago, I wrote an analysis of the types of police arrests in San Francisco, using data from the SF OpenData initiative, with a followup article analyzing the locations of these arrests. Months later, the same source dataset was used for a Kaggle competition. Why not give the dataset another look and test LightGBM out?

    Cellphone data gathered as a result of Max’s predictions can be tested against arrest and other police records to establish the presence and/or absence of particular police officers at a crime scene.

    After a police office corroborates the presence of a gun in a suspect’s hand, cellphone evidence they were blocks away, in the presence of other police officers, could prove to be inconvenient.

    A Data Driven Exploration of Kung Fu Films

    Tuesday, January 24th, 2017

    A Data Driven Exploration of Kung Fu Films by Jim Vallandingham.

    From the post:

    Recently, I’ve been a bit caught up in old Kung Fu movies. Shorting any technical explorations, I have instead been diving head-first into any and all Netflix accessible martial arts masterpieces from the 70’s and 80’s.

    While I’ve definitely been enjoying the films, I realized recently that I had little context for the movies I was watching. I wondered if some films, like our latest favorite, Executioners from Shaolin, could be enjoyed even more, with better understanding of the context in which these films exist in the Kung Fu universe.

    So, I began a data driven quest for truth and understanding (or at least a semi-interesting dataset to explore) of all Shaw Brothers Kung Fu movies ever made!

    If you’re not familiar with the genre, here is a three-minute final fight collage from YouTube:

    When I saw the title, I was hopeful that Jim had captured the choreography of the movies for comparison.

    No such luck! 😉

    That would be an extremely difficult and labor intensive task.

    Just in case you are curious, there is a Dance Notation Bureau with extensive resources should you decide to capture one or more Kung Fu films in notation.

    Or try Notation Reloaded: eXtensible Dance Scripting Notation by Matthew Gough.

    A search using “xml dance notation” produces a number of interesting resources.

    Three More Reasons To Learn R

    Friday, January 6th, 2017

    Three reasons to learn R today by David Smith.

    From the post:

    If you're just getting started with data science, the Sharp Sight Labs blog argues that R is the best data science language to learn today.

    The blog post gives several detailed reasons, but the main arguments are:

    1. R is an extremely popular (arguably the most popular) data progamming language, and ranks highly in several popularity surveys.
    2. Learning R is a great way of learning data science, with many R-based books and resources for probability, frequentist and Bayesian statistics, data visualization, machine learning and more.
    3. Python is another excellent language for data science, but with R it's easier to learn the foundations.

    Once you've learned the basics, Sharp Sight also argues that R is also a great data science to master, even though it's an old langauge compared to some of the newer alternatives. Every tool has a shelf life, but R isn't going anywhere and learning R gives you a foundation beyond the language itself.

    If you want to get started with R, Sharp Sight labs offers a data science crash course. You might also want to check out the Introduction to R for Data Science course on EdX.

    Sharp Sight Labs: Why R is the best data science language to learn today, and Why you should master R (even if it might eventually become obsolete)

    If you need more reasons to learn R:

    • Unlike Facebook, R isn’t a sinkhole of non-testable propositions.
    • Unlike Instagram, R is rarely NSFW.
    • Unlike Twitter, R is a marketable skill.

    Glad to hear you are learning R!

    How to weigh a dog with a ruler? [Or Price a US Representative?]

    Wednesday, December 14th, 2016

    How to weigh a dog with a ruler? (looking for translators)

    From the post:

    We are working on a series of comic books that introduce statistical thinking and could be used as activity booklets in primary schools. Stories are built around adventures of siblings: Beta (skilled mathematician) and Bit (data hacker).

    What is the connection between these comic books and R? All plots are created with ggplot2.

    The first story (How to weigh a dog with a ruler?) is translated to English, Polish and Czech. If you would like to help us to translate this story to your native language, just write to me (przemyslaw.biecek at gmail) or create an issue on GitHub. It’s just 8 pages long, translations are available on Creative Commons BY-ND licence.

    The key is to chart animals by their height as against their weight.

    Pricing US Representatives is likely to follow a similar relationship where their priced goes up by years of service in Congress.

    I haven’t run the data but such a chart would keep “people” (includes corporations in the US) from paying too much or offering too little. To the embarrassment of all concerned.

    Identifying Speech/News Writers

    Friday, December 2nd, 2016

    David Smith’s post: Stylometry: Identifying authors of texts using R details the use of R to distinguish tweets by president-elect Donald Trump from his campaign staff. (Hmmm, sharing a Twitter account password, there’s bad security for you.)

    The same techniques may distinguish texts delivered “live” versus those “inserted” into Congressional Record.

    What other texts are ripe for distinguishing authors?

    From the post:

    Few people expect politicians to write every word they utter themselves; reliance on speechwriters and spokepersons is a long-established political practice. Still, it's interesting to know which statements are truly the politician's own words, and which are driven primarily by advisors or influencers.

    Recently, David Robinson established a way of figuring out which tweets from Donald Trump's Twitter account came from him personally, as opposed to from campaign staff, whcih he verified by comparing the sentiment of tweets from Android vs iPhone devices. Now, Ali Arsalan Kazmi has used stylometric analysis to investigate the provenance of speeches by the Prime Minister of Pakistan

    A small amount of transparency can go a long way.

    Email archives anyone?

    War and Peace & R

    Friday, December 2nd, 2016

    No, not a post about R versus Python but about R and Tolstoy‘s War and Peace.

    Using R to Gain Insights into the Emotional Journeys in War and Peace by Wee Hyong Tok.

    From the post:

    How do you read a novel in record time, and gain insights into the emotional journey of main characters, as they go through various trials and tribulations, as an exciting story unfolds from chapter to chapter?

    I remembered my experiences when I start reading a novel, and I get intrigued by the story, and simply cannot wait to get to the last chapter. I also recall many conversations with friends on some of the interesting novels that I have read awhile back, and somehow have only vague recollection of what happened in a specific chapter. In this post, I’ll work through how we can use R to analyze the English translation of War and Peace.

    War and Peace is a novel by Leo Tolstoy, and captures the salient points about Russian history from the period 1805 to 1812. The novel consists of the stories of five families, and captures the trials and tribulations of various characters (e.g. Natasha and Andre). The novel consists of about 1400 pages, and is one of the longest novels that have been written.

    We hypothesize that if we can build a dashboard (shown below), this will allow us to gain insights into the emotional journey undertaken by the characters in War and Peace.

    Impressive work, even though I would not use it as a short-cut to “read a novel in record time.”

    Rather I take this as an alternative way of reading War and Peace, one that can capture insights a casual reader may miss.

    Moreover, the techniques demonstrated here could be used with other works of literature, or even non-fictional works.

    Imagine conducting this analysis over the reportedly more than 7,000 page full CIA Torture Report, for example.

    A heatmap does not connect any dots, but points a user towards places where interesting dots may be found.

    Certainly a tool for exploring large releases/leaks of text data.


    PS: Large, tiresome, obscure-on-purpose, government reports to practice on with this method?

    Learning R programming by reading books: A book list

    Thursday, November 24th, 2016

    Learning R programming by reading books: A book list by Liang-Cheng Zhang.

    From the post:

    Despite R’s popularity, it is still very daunting to learn R as R has no click-and-point feature like SPSS and learning R usually takes lots of time. No worries! As self-R learner like us, we constantly receive the requests about how to learn R. Besides hiring someone to teach you or paying tuition fees for online courses, our suggestion is that you can also pick up some books that fit your current R programming level. Therefore, in this post, we would like to share some good books that teach you how to learn programming in R based on three levels: elementary, intermediate, and advanced levels. Each level focuses on one task so you will know whether these books fit your needs. While the following books do not necessarily focus on the task we define, you should focus the task when you reading these books so you are not lost in contexts.

    Books and reading form the core of my most basic prejudice: Literacy is the doorway to unlimited universes.

    A prejudice so strong that I have to work hard at realizing non-literates live in and sense worlds not open to literates. Not less complex, not poorer, just different.

    But book lists in particular appeal to that prejudice and since my blog is read by literates, I’m indulging that prejudice now.

    I do have a title to add to the list: Practical Data Science with R by Nina Zumel and John Mount.

    Judging from the other titles listed, Practical Data Science with R falls in the intermediate range. Should not be your first R book but certainly high on the list for your second R book.

    Avoid the rush! Start working on your Amazon wish list today! 😉

    How to get started with Data Science using R

    Sunday, November 20th, 2016

    How to get started with Data Science using R by Karthik Bharadwaj.

    From the post:

    R being the lingua franca of data science and is one of the popular language choices to learn data science. Once the choice is made, often beginners find themselves lost in finding out the learning path and end up with a signboard as below.

    In this blog post I would like to lay out a clear structural approach to learning R for data science. This will help you to quickly get started in your data science journey with R.

    You won’t find anything you don’t already know but this is a great short post to pass onto others.

    Point out R skills will help them expose and/or conceal government corruption.

    The new Tesseract package: High Quality OCR in R

    Thursday, November 17th, 2016

    The new Tesseract package: High Quality OCR in R by Jeroen Ooms.

    From the post:

    Optical character recognition (OCR) is the process of extracting written or typed text from images such as photos and scanned documents into machine-encoded text. The new rOpenSci package tesseract brings one of the best open-source OCR engines to R. This enables researchers or journalists, for example, to search and analyze vast numbers of documents that are only available in printed form.

    People looking to extract text and metadata from pdf files in R should try our pdftools package.

    Reading too quickly at first I thought I had missed a new version of Tesseract (tesseract-ocr Github), an OCR program that I use on a semi-regular basis.

    Reading a little slower, ;-), I discovered Ooms is describing a new package for R, which uses Tesseract for OCR.

    This is great news but be aware that Tesseract (whether called by an R package or standalone) can generate a large amount of output in a fairly short period of time.

    One of the stumbling blocks of OCR is the labor intensive process of cleaning up the inevitable mistakes.

    Depending on how critical accuracy is for searching, for example, you may choose to verify and clean only quotes for use in other publications.

    Best to make those decisions up front and not be faced with a mountain of output that isn’t useful unless and until it has been corrected.

    Useful Listicle: The 5 most downloaded R packages

    Tuesday, November 15th, 2016

    The 5 most downloaded R packages

    From the post:

    Curious which R packages your colleagues and the rest of the R community are using? Thanks to you can now see for yourself! aggregates R documentation and download information from popular repositories like CRAN, BioConductor and GitHub. In this post, we’ll take a look at the top 5 R packages with the most direct downloads!

    Sorry! No spoiler!

    Do check out: aggregates help documentation for R packages from CRAN, BioConductor, and GitHub – the three most common sources of current R documentation. goes beyond simply aggregating this information, however, by bringing all of this documentation to your fingertips via the RDocumentaion package. The RDocumentation package overwrites the basic help functions from the utils package and gives you access to from the comfort of your RStudio IDE. Look up the newest and most popular R packages, search through documentation and post community examples.

    As they say:

    Create an RDocumentation account today!

    I’m always sympathetic to documentation but more so today because I have wasted hours over the past two or three days on issues that could have been trivially documented.

    I will be posting “corrected” documentation later this week.

    PS: If you have or suspect you have poorly written documentation, I have some time available for paid improvement of the same.

    We Should Feel Safer Than We Do

    Tuesday, November 8th, 2016

    We Should Feel Safer Than We Do by Christian Holmes.

    Christian’s Background and Research Goals:


    Crime is a divisive and important issue in the United States. It is routinely ranked as among the most important issue to voters, and many politicians have built their careers around their perceived ability to reduce crime. Over 70% of Americans believe that crime is increasing, according to a recent Gallup poll, but is that really the case? I seek to answer this question in this post, as well as determine if there is any clear correlation between government spending and crime.

    Research Goals

    -Is crime increasing or decreasing in this country?
    -Is there a clear link between government spending and crime?

    provide an interesting contrast with his conclusions:

    From the crime data, it is abundantly clear that crime is on the decline, and has been for around 20 years. The reasons behind this decrease are quite nuanced, though, and I found no clear link between either increased education or police spending and decreasing crime rates. This does not mean that such a relationship does not exist. Rather, it merely means that there is no obvious correlation between the two variables over this specific time frame.

    In his background, Christian says:

    Over 70% of Americans believe that crime is increasing, according to a recent Gallup poll, but is that really the case? I seek to answer this question in this post,…

    Christian presumes, without proof, a relationship between: public beliefs about crime rates (rising or falling) and crime rates as recorded by government agencies.

    Which also presumes:

    1. The public is aware that government collects crime statistics.
    2. The public is aware of current crime statistics.
    3. Current crime statistics influence public beliefs about the incidence of crime.

    If the central focus of the paper is a comparison of “crime rates” as measured by government with other data on government spending, why even mention the disparity between public “belief” about crime and crime statistics?

    I suspect, just as a rhetorical move, Christian is attempting to draw a favorable inference for his “evidence” by contrasting it with “public belief.” “Public belief” that is contrary to the “evidence” in this instance.

    Christian doesn’t offer us any basis for judgments about public opinion on crime one way or the other. Any number of factors could be influencing public opinion on that issue, the crime rate as measured by government being only one of those.

    The violent crime rate may be very low, statistically speaking, but if you are the victim of a violent crime, from your perspective crime is very prevalent.

    Of R and Relationships

    Christian uses R to compare crime date with government spending on education and policing.

    The unhappy result is that no relationship is evidenced between government spending and a reduction in crime so Christian cautions:

    …This does not mean that such a relationship does not exist. Rather, it merely means that there is no obvious correlation between the two variables over this specific time frame….

    There is where we switch from relying on data and explore the realms of “the data didn’t prove I was wrong.”

    Since it isn’t possible to prove the absence of a relationship between the “crime rate” and government spending on education/police, no, the evidence didn’t prove Christian to be wrong.

    On the other hand, it clearly shows that Christopher has no evidence for that “relationship.”

    The caution here is that using R and “reliable” data may lead to conclusions you would rather avoid.

    PS: Crime and the public’s fear of crime are both extremely complex issues. Aggregate data can justify previously chosen positions, but little more.

    How the Ghana Floods animation was created [Animating Your Local Flood Data With R]

    Monday, November 7th, 2016

    How the Ghana Floods animation was created by David Quartey.

    From the post:

    Ghana has always been getting flooded, but it seems that only floods in Accra are getting attention. I wrote about it here, and the key visualization was an animated map showing the floods in Ghana, and built in R. In this post, I document how I did it, hopefully you can do one also!

    David’s focus is on Ghana but the same techniques work for data of greater local interest.

    ggplot2 cheatsheet updated – other R spreadsheets

    Wednesday, November 2nd, 2016

    RStudio Cheat Sheets

    I saw a tweet that the ggplot2 cheatsheet has been updated.

    Here’s a list of all the cheatsheets available at RStudio:

    • R Markdown Cheat Sheet
    • RStudio IDE Cheat Sheet
    • Shiny Cheat Sheet
    • Data Visualization Cheat Sheet
    • Package Development Cheat Sheet
    • Data Wrangling Cheat Sheet
    • R Markdown Reference Guide

    Contributed Cheatsheets

    • Base R
    • Advanced R
    • Regular Expressions
    • How big is your graph? (base R graphics)

    I have deliberately omitted links as when cheat sheets are updated, the links will break and/or you will get outdated information.

    Use and reference the RStudio Cheat Sheets page.


    Data Science for Political and Social Phenomena [Special Interest Search Interface]

    Sunday, October 23rd, 2016

    Data Science for Political and Social Phenomena by Chris Albon.

    From the webpage:

    I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

    Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

    Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

    If you like learning from examples, this is the site for you!

    Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

    That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

    Serious question.

    Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

    An introduction to data cleaning with R

    Tuesday, October 4th, 2016

    An introduction to data cleaning with R by Edwin de Jonge and Mark van der Loo.


    Data cleaning, or data preparation is an essential part of statistical analysis. In fact, in practice it is often more time-consuming than the statistical analysis itself. These lecture notes describe a range of techniques, implemented in the R statistical environment, that allow the reader to build data cleaning scripts for data suffering from a wide range of errors and inconsistencies, in textual format. These notes cover technical as well as subject-matter related aspects of data cleaning. Technical aspects include data reading, type conversion and string matching and manipulation. Subject-matter related aspects include topics like data checking, error localization and an introduction to imputation methods in R. References to relevant literature and R packages are provided throughout.

    These lecture notes are based on a tutorial given by the authors at the useR!2013 conference in Albacete, Spain.

    Pure gold!

    Plus this tip (among others):

    Tip. To become an R master, you must practice every day.

    The more data you clean, the better you will become!


    ggplot2 2.2.0 coming soon! [Testers Needed!]

    Friday, September 30th, 2016

    ggplot2 2.2.0 coming soon! by Hadley Wickham.

    From the post:

    I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

    Install the pre-release version with:

    # install.packages("devtools")

    If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:


    ggplot2 2.2.0 will be a relatively major release including:

    The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: ggraph, ggforce, and tweenr.

    Just in case you are casual about time, tomorrow is October 1st. Which on most calendars means that “early November” isn’t far off.

    Here’s an easy opportunity to test ggplot2.2.2.0 and related visualization packages. Before the official release.


    The Simpsons by the Data [South Park as well]

    Thursday, September 29th, 2016

    The Simpsons by the Data by Todd Schneider.

    From the post:

    The Simpsons needs no introduction. At 27 seasons and counting, it’s the longest-running scripted series in the history of American primetime television.

    The show’s longevity, and the fact that it’s animated, provides a vast and relatively unchanging universe of characters to study. It’s easier for an animated show to scale to hundreds of recurring characters; without live-action actors to grow old or move on to other projects, the denizens of Springfield remain mostly unchanged from year to year.

    As a fan of the show, I present a few short analyses about Springfield, from the show’s dialogue to its TV ratings. All code used for this post is available on GitHub.

    Alert! You must run Flash in order to access Simpsons World, the source of Todd’s data.

    Advice: Treat Flash as malware and run in a VM.

    Todd covers the number of words spoken per character, gender imbalance, focus on characters, viewership, and episode summaries (tf-idf).

    Other analysis awaits your imagination and interest.

    BTW, if you want comedy data a bit closer to the edge, try Text Mining South Park by Kaylin Walker. Kaylin uses R for her analysis as well.

    Other TV programs with R-powered analysis?

    R Weekly

    Monday, September 12th, 2016

    R Weekly

    A new weekly publication of R resources that began on 21 May 2016 with Issue 0.

    Mostly titles of post and news articles, which is useful, but not as useful as short summaries, including the author’s name.