Archive for the ‘R’ Category

21 Recipes for Mining Twitter Data with rtweet

Saturday, January 6th, 2018

21 Recipes for Mining Twitter Data with rtweet by Bob Rudis.

From the preface:

I’m using this as way to familiarize myself with bookdown so I don’t make as many mistakes with my web scraping field guide book.

It’s based on Matthew R. Russell’s book. That book is out of distribution and much of the content is in Matthew’s “Mining the Social Web” book. There will be many similarities between his “21 Recipes” book and this book on purpose. I am not claiming originality in this work, just making an R-centric version of the cookbook.

As he states in his tome, “this intentionally terse recipe collection provides you with 21 easily adaptable Twitter mining recipes”.

Rudis has posted about this editing project at: A bookdown “Hello World” : Twenty-one (minus two) Recipes for Mining Twitter with rtweet, which you should consult if you want to contribute to this project.

Working through 21 Recipes for Mining Twitter Data with rtweet will give you experience proofing a text and if you type in the examples (no cut-n-paste), you’ll develop rtweet muscle memory.


Who’s on everyone’s 2017 “hit list”?

Thursday, January 4th, 2018

Who’s on everyone’s 2017 “hit list”? by Suzan Baert.

From the post:

At the end of the year, everyone is making lists. And radio stations are no exceptions.
Many of our radio stations have a weekly “people’s choice” music chart. Throughout the week, people submit their top 3 recent songs, and every week those votes turn into a music chart. At the end of the year, they collapse all those weekly charts into a larger one covering the entire year.

I find this one quite interesting: it’s not dependent on what music people buy, it’s determined by what the audience of that station wants to hear. So what are the differences between these stations? And do they match up with what I would expect?

What was also quite intriguing: in Dutch we call it a hit lijst and if you translate that word for word you get: hit list. Which at least one radio station seems to do…

Personally, when I hear the word hit list, music is not really what comes to mind, but hey, let’s roll with it: which artists are on everyone’s ‘hit list’?

A delightful scraping of four (4) radio station “hit lists,” which uses rOpenSci robotstxt, rvest, xml2, dplyr, tidyr, ggplot2, phantomJS, and collates the results.

Music doesn’t come to mind for me when I hear “hit list.”

For me “hit list” means what Google wasn’t you to know about subject N.


Game of Thrones DVDs for Christmas?

Wednesday, December 27th, 2017

Mining Game of Thrones Scripts with R by Gokhan Ciflikli

If you are serious about defeating all comers to Game of Thrones trivia, then you need to know the scripts cold. (sorry)

Ciflikli introduces you to the quanteda and analysis of the Game of Thrones scripts in a single post saying:

I meant to showcase the quanteda package in my previous post on the Weinstein Effect but had to switch to tidytext at the last minute. Today I will make good on that promise. quanteda is developed by Ken Benoit and maintained by Kohei Watanabe – go LSE! On that note, the first 2018 LondonR meeting will be taking place at the LSE on January 16, so do drop by if you happen to be around. quanteda v1.0 will be unveiled there as well.

Given that I have already used the data I had in mind, I have been trying to identify another interesting (and hopefully less depressing) dataset for this particular calling. Then it snowed in London, and the dire consequences of this supernatural phenomenon were covered extensively by the r/CasualUK/. One thing led to another, and before you know it I was analysing Game of Thrones scripts:

2018, with its mid-term congressional elections, will be a big year for leaked emails, documents, in addition to the usual follies of government.

Text mining/analysis skills you gain with the Game of Thrones scripts will be in high demand by partisans, investigators, prosecutors, just about anyone you can name.

From the quanteda documentation site:

quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined “thesaurus”, and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.
… (emphasis in original)

Once you follow the analysis of the Game of Thrones scripts, what other texts or features of quanteda will catch your eye?


Geocomputation with R – Open Book in Progress – Contribute

Tuesday, December 26th, 2017

Geocomputation with R by Robin Lovelace, Jakub Nowosad, Jannes Muenchow.

Welcome to the online home of Geocomputation with R, a forthcoming book with CRC Press.


p>Inspired by bookdown and other open source projects we are developing this book in the open. Why? To encourage contributions, ensure reproducibility and provide access to the material as it evolves.

The book’s development can be divided into four main phases:

  1. Foundations
  2. Basic applications
  3. Geocomputation methods
  4. Advanced applications

Currently the focus is on Part 2, which we aim to be complete by December. New chapters will be added to this website as the project progresses, hosted at and kept up-to-date thanks to Travis….

Speaking of R and geocomputation, I’ve been trying to remember to post about Geocomputation with R since I encountered it a week or more ago. Not what I expect from CRC Press. That got my attention right away!

Part II, Basic Applications has two chapters, 7 Location analysis and 8 Transport applications.

Layering display of data from different sources should be included under Basic Applications. For example, relying on but not displaying topographic data to calculate line of sight between positions. Perhaps the base display is a high-resolution image overlaid with GPS coordinates at intervals and structures have the line of site colored on their structures.

Other “basic applications” you would suggest?

Looking forward to progress on this volume!

All targets have spatial-temporal locations.

Tuesday, December 26th, 2017


From the about page: is a website and blog for those interested in using R to analyse spatial or spatio-temporal data.

Posts in the last six months to whet your appetite for this blog:

The budget of a government for spatial-temporal software is no indicator of skill with spatial and spatial-temporal data.

How are yours?

Learn to Write Command Line Utilities in R

Thursday, December 21st, 2017

Learn to Write Command Line Utilities in R by Mark Sellors.

From the post:

Do you know some R? Have you ever wanted to write your own command line utilities, but didn’t know where to start? Do you like Harry Potter?

If the answer to these questions is “Yes!”, then you’ve come to the right place. If the answer is “No”, but you have some free time, stick around anyway, it might be fun!

Sellors invokes the tradition of *nix command line tools saying: “The thing that most [command line] tools have in common is that they do a small number of things really well.”

The question to you is: What small things do you want to do really well?

Spatial Microsimulation with R – Public Policy Advocates Take Note

Thursday, December 14th, 2017

Spatial Microsimulation with R by Robin Lovelace and Morgane Dumont.

Apologies for the long quote below but spatial microsimulation is unfamiliar enough that it merited an introduction in the authors’ own prose.

We have all attended public meetings where developers, polluters, landfill operators, etc., had charts, studies, etc., and the public was armed with, well, its opinions.

Spatial Microsimulation with R can put you in a position to offer alternative analysis, meaningfully ask for data used in other studies, in short, arm yourself with weapons long abused in public policy discussions.

From Chapter 1, 1.2 Motivations:

Imagine a world in which data on companies, households and governments were widely available. Imagine, further, that researchers and decision-makers acting in the public interest had tools enabling them to test and model such data to explore different scenarios of the future. People would be able to make more informed decisions, based on the best available evidence. In this technocratic dreamland pressing problems such as climate change, inequality and poor human health could be solved.

These are the types of real-world issues that we hope the methods in this book will help to address. Spatial microsimulation can provide new insights into complex problems and, ultimately, lead to better decision-making. By shedding new light on existing information, the methods can help shift decision-making processes away from ideological bias and towards evidence-based policy.

The ‘open data’ movement has made many datasets more widely available. However, the dream sketched in the opening paragraph is still far from reality. Researchers typically must work with data that is incomplete or inaccessible. Available datasets often lack the spatial or temporal resolution required to understand complex processes. Publicly available datasets frequently miss key attributes, such as income. Even when high quality data is made available, it can be very difficult for others to check or reproduce results based on them. Strict conditions inhibiting data access and use are aimed at protecting citizen privacy but can also serve to block democratic and enlightened decision making.

The empowering potential of new information is encapsulated in the saying that ‘knowledge is power’. This helps explain why methods such as spatial microsimulation, that help represent the full complexity of reality, are in high demand.

Spatial microsimulation is a growing approach to studying complex issues in the social sciences. It has been used extensively in fields as diverse as transport, health and education (see Chapter ), and many more applications are possible. Fundamental to the approach are approximations of individual level data at high spatial resolution: people allocated to places. This spatial microdata, in one form or another, provides the basis for all spatial microsimulation research.

The purpose of this book is to teach methods for doing (not reading about!) spatial microsimulation. This involves techniques for generating and analysing spatial microdata to get the ‘best of both worlds’ from real individual and geographically-aggregated data. Population synthesis is therefore a key stage in spatial microsimulation: generally real spatial microdata are unavailable due to concerns over data privacy. Typically, synthetic spatial microdatasets are generated by combining aggregated outputs from Census results with individual level data (with little or no geographical information) from surveys that are representative of the population of interest.

The resulting spatial microdata are useful in many situations where individual level and geographically specific processes are in operation. Spatial microsimulation enables modelling and analysis on multiple levels. Spatial microsimulation also overlaps with (and provides useful initial conditions for) agent-based models (see Chapter 12).

Despite its utility, spatial microsimulation is little known outside the fields of human geography and regional science. The methods taught in this book have the potential to be useful in a wide range of applications. Spatial microsimulation has great potential to be applied to new areas for informing public policy. Work of great potential social benefit is already being done using spatial microsimulation in housing, transport and sustainable urban planning. Detailed modelling will clearly be of use for planning for a post-carbon future, one in which we stop burning fossil fuels.

For these reasons there is growing interest in spatial microsimulation. This is due largely to its practical utility in an era of ‘evidence-based policy’ but is also driven by changes in the wider research environment inside and outside of academia. Continued improvements in computers, software and data availability mean the methods are more accessible than ever. It is now possible to simulate the populations of small administrative areas at the individual level almost anywhere in the world. This opens new possibilities for a range of applications, not least policy evaluation.

Still, the meaning of spatial microsimulation is ambiguous for many. This book also aims to clarify what the method entails in practice. Ambiguity surrounding the term seems to arise partly because the methods are inherently complex, operating at multiple levels, and partly due to researchers themselves. Some uses of the term ‘spatial microsimulation’ in the academic literature are unclear as to its meaning; there is much inconsistency about what it means. Worse is work that treats spatial microsimulation as a magical black box that just ‘works’ without any need to describe, or more importantly make reproducible, the methods underlying the black box. This book is therefore also about demystifying spatial microsimulation.

If that wasn’t impressive enough, the authors:

We’ve put Spatial Microsimulation with R on-line because we want to reduce barriers to learning. We’ve made it open source via a GitHub repository because we believe in reproducibility and collaboration. Comments and suggests are most welcome there. If the content of the book helps your research, please cite it (Lovelace and Dumont, 2016).

How awesome is that!

Definitely a model for all of us to emulate!

Connecting R to Keras and TensorFlow

Tuesday, December 12th, 2017

Connecting R to Keras and TensorFlow by Joseph Rickert.

From the post:

It has always been the mission of R developers to connect R to the “good stuff”. As John Chambers puts it in his book Extending R:

One of the attractions of R has always been the ability to compute an interesting result quickly. A key motivation for the original S remains as important now: to give easy access to the best computations for understanding data.

From the day it was announced a little over two years ago, it was clear that Google’s TensorFlow platform for Deep Learning is good stuff. This September (see announcment), J.J. Allaire, François Chollet, and the other authors of the keras package delivered on R’s “easy access to the best” mission in a big way. Data scientists can now build very sophisticated Deep Learning models from an R session while maintaining the flow that R users expect. The strategy that made this happen seems to have been straightforward. But, the smooth experience of using the Keras API indicates inspired programming all the way along the chain from TensorFlow to R.

The Redditor deepfakes, of AI-Assisted Fake Porn fame mentions Keras as one of his tools. Is that an endorsement?

Rickert’s post is a quick start to Keras and Tensorflow but he does mention:

the MEAP from the forthcoming Manning Book, Deep Learning with R by François Chollet, the creator of Keras, and J.J. Allaire.

I’ve had good luck with Manning books in general so am looking forward to this one as well.

Introducing Data360R — data to the power of R [On Having an Agenda]

Saturday, December 9th, 2017

Introducing Data360R — data to the power of R

From the post:

Last January 2017, the World Bank launched TCdata360 (, a new open data platform that features more than 2,000 trade and competitiveness indicators from 40+ data sources inside and outside the World Bank Group. Users of the website can compare countries, download raw data, create and share data visualizations on social media, get country snapshots and thematic reports, read data stories, connect through an application programming interface (API), and more.

The response to the site has been overwhelmingly enthusiastic, and this growing user base continually inspires us to develop better tools to increase data accessibility and usability. After all, open data isn’t useful unless it’s accessed and used for actionable insights.

One such tool we recently developed is data360r, an R package that allows users to interact with the TCdata360 API and query TCdata360 data, metadata, and more using easy, single-line functions.

So long as you remember the World Bank has an agenda and all the data it releases serves that agenda, you should suffer no permanent harm.

Don’t take that as meaning other sources of data have less of an agenda, although you may find their agendas differ from that of the World Bank.

The recent “discovery” that machine learning algorithms can conceal social or racist bias, was long overdue.

Anyone who took survey work in social science methodology in the last half of the 20th century would report that data collection itself, much less its processing, is fraught with unavoidable bias.

It is certainly possible, in the physical sense, to give students standardized tests, but what test results mean for any given question, such as teacher competence, is far from clear.

Or to put it differently, just because something can be measured is no guarantee the measurement is meaningful. The same applied to the data that results from any measurement process.

Take advantage of data360r certainly, but keep a wary eye on data from any source.

Building a Telecom Dictionary scraping web using rvest in R [Tunable Transparency]

Tuesday, December 5th, 2017

Building a Telecom Dictionary scraping web using rvest in R by Abdul Majed Raja.

From the post:

One of the biggest problems in Business to carry out any analysis is the availability of Data. That is where in many cases, Web Scraping comes very handy in creating that data that’s required. Consider the following case: To perform text analysis on Textual Data collected in a Telecom Company as part of Customer Feedback or Reviews, primarily requires a dictionary of Telecom Keywords. But such a dictionary is hard to find out-of-box. Hence as an Analyst, the most obvious thing to do when such dictionary doesn’t exist is to build one. Hence this article aims to help beginners get started with web scraping with rvest in R and at the same time, building a Telecom Dictionary by the end of this exercise.

Great for scraping an existing glossary but as always, it isn’t possible to extract information that isn’t captured by the original glossary.

Things like the scope of applicability for the terms, language, author, organization, even characteristics of the subjects the terms represent.

Of course, if your department invested in collecting that information for every subject in the glossary, there is no external requirement that on export all that information be included.

That is your “data silo” can have tunable transparency, that is you enable others to use your data with as much or as least semantic friction as the situation merits.

For some data borrowers, they get opaque spreadsheet field names, column1, column2, etc.

Other data borrowers, perhaps those willing to help defray the cost of semantic annotation, well, they get a more transparent view of the data.

One possible method of making semantic annotation and its maintenance a revenue center as opposed to a cost one.

Australian Census Data and Same Sex Marriage

Tuesday, December 5th, 2017

Combining Australian Census data with the Same Sex Marriage Postal Survey in R by Miles McBain.

Last week I put out a post that showed you how to tidy the Same Sex Marriage Postal Survey Data in R. In this post we’ll visualise that data in combination with the 2016 Australian Census. Note to people just here for the R — the main challenge here is actually just navigating the ABS’s Census DataPack, but I’ve tried to include a few pearls of wisdom on joining datasets to keep things interesting for you.

Decoding the “datapack” is an early task:

The datapack consists of 59 encoded csv files and 3 metadata excel files that will help us decode their meaning. What? You didn’t think this was going to be straight forward did you?

When I say encoded, I mean the csv’s have inscrutable names like ‘2016Census_G09C.csv’ and contain column names like ‘Se_d_r_or_t_h_t_Tot_NofB_0_ib’ (H.T. @hughparsonage).

Two of the metadata files in /Metadata/ have useful applications for us. ‘2016Census_geog_desc_1st_and_2nd_release.xlsx’ will help us resolve encoded geographic areas to federal electorate names. ‘Metadata_2016_GCP_DataPack.xlsx’ lists the topics of each of the 59 tables and will allow us to replace a short and uninformative column name with a much longer, and slightly more informative name….

Followed by the joys of joining and analyzing the data sets.

McBain develops original analysis of the data that demonstrates a relationship between having children and opinions on the impact of same sex marriage on children.

No, I won’t repeat his insight. Read his post, it’s quite entertaining.

Name a bitch badder than Taylor Swift

Tuesday, December 5th, 2017

It all began innocently enough, a tweet with this image and title by Nutella.

Maëlle Salmon reports in Names of b…..s badder than Taylor Swift, a class in women’s studies? that her first pass on tweets quoting Nutella’s tweet, netted 15,653 tweets! (Salmon posted on 05 December 2017 so a later tweet count will be higher.)

Salmon uses rtweet to obtain the tweets, cleanNLP to extract entities, and then enhances those entities with Wikidata.

There’s a lot going on in this one post!

Enjoy the post and remember to follow Maëlle Salmon on Twitter!

Other value-adds for this data set?

Over Thinking Secret Santa ;-)

Thursday, November 30th, 2017

Secret Santa is a graph traversal problem by Tristan Mahr.

From the post:

Last week at Thanksgiving, my family drew names from a hat for our annual game of Secret Santa. Actually, it wasn’t a hat but you know what I mean. (Now that I think about it, I don’t think I’ve ever seen names drawn from a literal hat before!) In our family, the rules of Secret Santa are pretty simple:

  • The players’ names are put in “a hat”.
  • Players randomly draw a name from a hat, become that person’s Secret Santa, and get them a gift.
  • If a player draws their own name, they draw again.

Once again this year, somebody asked if we could just use an app or a website to handle the drawing for Secret Santa. Or I could write a script to do it I thought to myself. The problem nagged at the back of my mind for the past few days. You could just shuffle the names… no, no, no. It’s trickier than that.

In this post, I describe a couple of algorithms for Secret Santa sampling using R and directed graphs. I use the DiagrammeR package which creates graphs from dataframes of nodes and edges, and I liberally use dplyr verbs to manipulate tables of edges.

If you would like a more practical way to use R for Secret Santa, including automating the process of drawing names and emailing players, see this blog post.

If you haven’t done your family Secret Santa yet, you are almost late! (November 30, 2017)


A Docker tutorial for reproducible research [Reproducible Reporting In The Future?]

Wednesday, November 15th, 2017

R Docker tutorial: A Docker tutorial for reproducible research.

From the webpage:

This is an introduction to Docker designed for participants with knowledge about R and RStudio. The introduction is intended to be helping people who need Docker for a project. We first explain what Docker is and why it is useful. Then we go into the the details on how to use it for a reproducible transportable project.

Six lessons, instructions for installing Docker, plus zip/tar ball of the materials. What more could you want?

Science has paid lip service to the idea of replication of results for centuries but with the sharing of data and analysis, reproducible research is becoming a reality.

Is reproducible reporting in the near future? Reporters preparing their analysis and releasing raw data and their extraction methods?

Or will selective releases of data, when raw data is released at all, continue to be the norm?

Please let @ICIJorg know how you feel about data hoarding, #ParadisePapers, #PanamaPapers, when data and code sharing are becoming the norm in science.

Data Munging with R (MEAP)

Monday, November 6th, 2017

Data Munging with R (MEAP) by Dr. Jonathan Carroll.

From the description:

Data Munging with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. Whether you already have some programming experience or you’re just a spreadsheet whiz looking for a more powerful data manipulation tool, this book will help you get started. You’ll discover the ins and outs of using the data-oriented R programming language and its many task-specific packages. With dozens of practical examples to follow, learn to fill in missing values, make predictions, and visualize data as graphs. By the time you’re done, you’ll be a master munger, with a robust, reproducible workflow and the skills to use data to strengthen your conclusions!

Five (5) out of eleven (11) parts available now under the Manning Early Access Program (MEAP). Chapter one, Introducing Data and the R Language is free.

Even though everyone writes books from front to back (or at least claim to), it would be nice to see a free “advanced” chapter every now and again. There’s not much you can say about an introductory chapter other than it’s an introductory chapter. That’s no different here.

I suspect you will get a better idea about Dr. Carroll’s writing from his blog, Irregularly Scheduled Programming or by following him on Twitter: @carroll_jono.

A cRyptic crossword with an R twist

Friday, October 13th, 2017

A cRyptic crossword with an R twist

From the post:

Last week’s R-themed crossword from R-Ladies DC was popular, so here’s another R-related crossword, this time by Barry Rowlingson and published on page 39 of the June 2003 issue of R-news (now known as the R Journal). Unlike the last crossword, this one follows the conventions of a British cryptic crossword: the grid is symmetrical, and eschews 4×4 blocks of white or black squares. Most importantly, the clues are in the cryptic style: rather than being a direct definition, cryptic clues pair wordplay (homonyms, anagrams, etc) with a hidden definition. (Wikipedia has a good introduction to the types of clues you’re likely to find.) Cryptic crosswords can be frustrating for the uninitiated, but are fun and rewarding once you get to into it.

In fact, if you’re unfamiliar with cryptic crosswords, this one is a great place to start. Not only are many (but not all) of the answers related in some way to R, Barry has helpfully provided the answers along with an explanation of how the cryptic clue was formed. There’s no shame in peeking, at least for a few, to help you get your legs with the cryptic style.

Another R crossword for your weekend enjoyment!


A cRossword about R [Alternative to the NYTimes Sunday Crossword Puzzle]

Friday, October 6th, 2017

A cRossword about R by David Smith.

From the post:

The members of the R Ladies DC user group put together an R-themed crossword for a recent networking event. It’s a fun way to test out your R knowledge. (Click to enlarge, or download a printable version here.)

Maybe not a complete alternative to the NYTimes Sunday Crossword Puzzle but R enthusiasts will enjoy it.

I suspect the exercise of writing a crossword puzzle is a greater learning experience than solving it.


Exploratory Data Analysis of Tropical Storms in R

Tuesday, September 26th, 2017

Exploratory Data Analysis of Tropical Storms in R by Scott Stoltzman.

From the post:

The disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious about the history of hurricanes and tropical storms so I found a data set on and started some basic Exploratory data analysis (EDA).

EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are handy and provide a great place to start. They are an easy-to-use method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kick-off the EDA process but those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.

The original source of the data was can be found at

Great walk through on exploratory data analysis.

Everyone talks about the weather but did you know there is a forty (40) year climate lag between cause and effect?

The human impact on the environment today, won’t be felt for another forty (40) years.

Can to predict the impact of a hurricane in 2057?

Some other data/analysis resources on hurricanes, Climate Prediction Center, Hurricane Forecast Computer Models, National Hurricane Center.

PS: Is a Category 6 Hurricane Possible? by Brian Donegan is an interesting discussion on going beyond category 5 for hurricanes. For reference on speeds, see: Fujita Scale (tornadoes).


Monday, September 18th, 2017

RStartHere by Garrett Grolemund.

R packages organized by their role in data science:

This is very cool! Use and share!

@rstudio Cheatsheets Now B&W Printer Friendly

Saturday, September 9th, 2017

Mara Averick, @dataandme, tweets:

All the @rstudio Cheatsheets have been B&W printer-friendlier-ized

It’s a small thing but appreciated when documentation is B&W friendly.

PS: The @rstudio cheatsheets are also good examples layout and clarity.

.Rddj (data journalism with R)

Wednesday, June 21st, 2017

.Rddj Hand-curated, high quality resources for doing data journalism with R by Timo Grossenbacher.

From the webpage:

The R Project is a great software environment for doing all sorts of data-driven journalism. It can be used for any of the stages of a typical data project: data collection, cleaning, analysis and even (interactive) visualization. And it’s all reproducible and transparent! Sure, it requires a fair amount of scripting, yet…

Do not fear! With this hand-curated (and opinionated) list of resources, you will be guided through the thick jungle of countless R packages, from learning the basics of R’s syntax, to scraping HTML tables, to a guide on how to make your work comprehensible and reproducible.

Now, enjoy your journey.

Some more efforts at persuasion: As I work in the media, I know how a lot of journalists are turned off by everything that doesn’t have a graphical interface with buttons to click on. However, you don’t need to spend days studying programming concepts in order to get started with R, as most operations can be carried out without applying scary things such as loops or conditionals – and, nowadays, high-level abstrations like dplyr make working with data a breeze. My advice if you’re new to data journalism or data processing in general: Better learn R than Excel, ’cause getting to know Excel (and the countless other tools that each do a single thing) doesn’t come for free, either.

This list is (partially) inspired by R for Journalists by Ed Borasky, which is another great resource for getting to know R.

… (emphasis in original)

The topics are familiar:

  • RStudio
  • Syntax and basic R programming
  • Collecting Data (from the Web)
  • Data cleaning and manipulation
  • Text mining / natural language processing
  • Exploratory data analysis and plotting
  • Interactive data visualization
  • Publication-quality graphics
  • Reproducibility
  • Examples of using R in (data) journalism
  • What makes this list of resources different from search results?

    Hand curation.

    How much of a difference?

    Compare the search results of “R” + any of these categories to the resources here.

    Bookmark .Rddj for data journalism and R, then ping me with the hand curated list of resources you are creating.

    Save yourself and the rest of us from search. Thanks!

    Copy-n-Paste Security Alert!

    Wednesday, June 7th, 2017

    Security: The Dangers Of Copying And Pasting R Code.

    From the post:

    Most of the time when we stumble across a code snippet online, we often blindly copy and paste it into the R console. I suspect almost everyone does this. After all, what’s the harm?

    The post illustrates how innocent appearing R code can conceal unhappy surprises!

    Concealment isn’t limited to R code.

    Any CSS controlled display is capable of concealing code for you to copy-n-paste into a console, terminal window, script or program.

    Endless possibilities for HTML pages/emails with code + a “little something extra.”

    What are your copy-n-paste practices?

    Network analysis of Game of Thrones family ties [A Timeless Network?]

    Monday, May 15th, 2017

    Network analysis of Game of Thrones family ties by Shirin Glander.

    From the post:

    In this post, I am exploring network analysis techniques in a family network of major characters from Game of Thrones.

    Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.

    The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.

    Glander improves network data for the Game of Thrones and walks you through the use of R to analyze that network.

    It’s useful work and will repay close study.

    Network analysis can used with all social groups, activists, bankers, hackers, members of Congress (U.S.), terrorists, etc.

    But just as Ned Stark has no relationship with dire wolves when the story begins, networks of social groups develop, change, evolve if you will, over time.

    Moreover, events, interactions, involving one or more members of the network, occur in time sequence. A social network that fails to capture those events and their sequencing, from one or more points of view, is a highly constrained network.

    A useful network as Glander demonstrates but one cannot answer simple questions about the order in which characters gained knowledge that a particular character hurled another character from a very high window.

    If I were investigating say a leak of NSA cybertools, time sequencing like that would be one of my top priorities.


    R Weekly – Update

    Friday, February 24th, 2017

    R Weekly

    A community based aggregation resource on R.

    Seventy-two (72) links plus R project updates in R Weekly 2017 Issue 8.

    Great way to stay up on R resources and get a sense for the R community.


    PS: The first post of R Weekly that I reviewed had 6 links. R Weekly [Another Word for It post]

    “Tidying” Up Jane Austen (R)

    Thursday, February 16th, 2017

    Text Mining the Tidy Way by Julia Silge.

    Thanks to Julia’s presentation I now know there is an R package with all of Jane Austen’s novels ready for text analysis.

    OK, Austen may not be at the top of your reading list, but the Tidy techniques Julia demonstrates are applicable to a wide range of textual data.

    Among those mentioned in the presentation, NASA datasets!

    Julia, along with Dave Robinson, wrote: Text Mining with R: A Tidy Approach, available online now and later this year from O’Reilly.

    Predicting Police Cellphone Locations – Weaponizing Open Data

    Wednesday, February 8th, 2017

    Predicting And Mapping Arrest Types in San Francisco with LightGBM, R, ggplot2 by Max Woolf.

    Max does a great job of using open source data SF OpenData to predict arrest types in San Francisco.

    It takes only a small step to realize that Max is also predicting the locations of police officers and their cellphones.

    Without police officers, you aren’t going to have many arrests. 😉

    Anyone operating a cellphone surveillance device can use Max’s predictions to gather data from police cellphones and other electronic gear. For particular police officers, for particular types of arrests, or at particular times of day, etc.

    From the post:

    The new hotness in the world of data science is neural networks, which form the basis of deep learning. But while everyone is obsessing about neural networks and how deep learning is magic and can solve any problem if you just stack enough layers, there have been many recent developments in the relatively nonmagical world of machine learning with boring CPUs.

    Years before neural networks were the Swiss army knife of data science, there were gradient-boosted machines/gradient-boosted trees. GBMs/GBTs are machine learning methods which are effective on many types of data, and do not require the traditional model assumptions of linear/logistic regression models. Wikipedia has a good article on the advantages of decision tree learning, and visual diagrams of the architecture:

    GBMs, as implemented in the Python package scikit-learn, are extremely popular in Kaggle machine learning competitions. But scikit-learn is relatively old, and new technologies have emerged which implement GBMs/GBTs on large datasets with massive parallelization and and in-memory computation. A popular big data machine learning library, H2O, has a famous GBM implementation which, per benchmarks, is over 10x faster than scikit-learn and is optimized for datasets with millions of records. But even faster than H2O is xgboost, which can hit a 5x-10x speed-ups relative to H2O, depending on the dataset size.

    Enter LightGBM, a new (October 2016) open-source machine learning framework by Microsoft which, per benchmarks on release, was up to 4x faster than xgboost! (xgboost very recently implemented a technique also used in LightGBM, which reduced the relative speedup to just ~2x). As a result, LightGBM allows for very efficient model building on large datasets without requiring cloud computing or nVidia CUDA GPUs.

    A year ago, I wrote an analysis of the types of police arrests in San Francisco, using data from the SF OpenData initiative, with a followup article analyzing the locations of these arrests. Months later, the same source dataset was used for a Kaggle competition. Why not give the dataset another look and test LightGBM out?

    Cellphone data gathered as a result of Max’s predictions can be tested against arrest and other police records to establish the presence and/or absence of particular police officers at a crime scene.

    After a police office corroborates the presence of a gun in a suspect’s hand, cellphone evidence they were blocks away, in the presence of other police officers, could prove to be inconvenient.

    A Data Driven Exploration of Kung Fu Films

    Tuesday, January 24th, 2017

    A Data Driven Exploration of Kung Fu Films by Jim Vallandingham.

    From the post:

    Recently, I’ve been a bit caught up in old Kung Fu movies. Shorting any technical explorations, I have instead been diving head-first into any and all Netflix accessible martial arts masterpieces from the 70’s and 80’s.

    While I’ve definitely been enjoying the films, I realized recently that I had little context for the movies I was watching. I wondered if some films, like our latest favorite, Executioners from Shaolin, could be enjoyed even more, with better understanding of the context in which these films exist in the Kung Fu universe.

    So, I began a data driven quest for truth and understanding (or at least a semi-interesting dataset to explore) of all Shaw Brothers Kung Fu movies ever made!

    If you’re not familiar with the genre, here is a three-minute final fight collage from YouTube:

    When I saw the title, I was hopeful that Jim had captured the choreography of the movies for comparison.

    No such luck! 😉

    That would be an extremely difficult and labor intensive task.

    Just in case you are curious, there is a Dance Notation Bureau with extensive resources should you decide to capture one or more Kung Fu films in notation.

    Or try Notation Reloaded: eXtensible Dance Scripting Notation by Matthew Gough.

    A search using “xml dance notation” produces a number of interesting resources.

    Three More Reasons To Learn R

    Friday, January 6th, 2017

    Three reasons to learn R today by David Smith.

    From the post:

    If you're just getting started with data science, the Sharp Sight Labs blog argues that R is the best data science language to learn today.

    The blog post gives several detailed reasons, but the main arguments are:

    1. R is an extremely popular (arguably the most popular) data progamming language, and ranks highly in several popularity surveys.
    2. Learning R is a great way of learning data science, with many R-based books and resources for probability, frequentist and Bayesian statistics, data visualization, machine learning and more.
    3. Python is another excellent language for data science, but with R it's easier to learn the foundations.

    Once you've learned the basics, Sharp Sight also argues that R is also a great data science to master, even though it's an old langauge compared to some of the newer alternatives. Every tool has a shelf life, but R isn't going anywhere and learning R gives you a foundation beyond the language itself.

    If you want to get started with R, Sharp Sight labs offers a data science crash course. You might also want to check out the Introduction to R for Data Science course on EdX.

    Sharp Sight Labs: Why R is the best data science language to learn today, and Why you should master R (even if it might eventually become obsolete)

    If you need more reasons to learn R:

    • Unlike Facebook, R isn’t a sinkhole of non-testable propositions.
    • Unlike Instagram, R is rarely NSFW.
    • Unlike Twitter, R is a marketable skill.

    Glad to hear you are learning R!

    How to weigh a dog with a ruler? [Or Price a US Representative?]

    Wednesday, December 14th, 2016

    How to weigh a dog with a ruler? (looking for translators)

    From the post:

    We are working on a series of comic books that introduce statistical thinking and could be used as activity booklets in primary schools. Stories are built around adventures of siblings: Beta (skilled mathematician) and Bit (data hacker).

    What is the connection between these comic books and R? All plots are created with ggplot2.

    The first story (How to weigh a dog with a ruler?) is translated to English, Polish and Czech. If you would like to help us to translate this story to your native language, just write to me (przemyslaw.biecek at gmail) or create an issue on GitHub. It’s just 8 pages long, translations are available on Creative Commons BY-ND licence.

    The key is to chart animals by their height as against their weight.

    Pricing US Representatives is likely to follow a similar relationship where their priced goes up by years of service in Congress.

    I haven’t run the data but such a chart would keep “people” (includes corporations in the US) from paying too much or offering too little. To the embarrassment of all concerned.

    Identifying Speech/News Writers

    Friday, December 2nd, 2016

    David Smith’s post: Stylometry: Identifying authors of texts using R details the use of R to distinguish tweets by president-elect Donald Trump from his campaign staff. (Hmmm, sharing a Twitter account password, there’s bad security for you.)

    The same techniques may distinguish texts delivered “live” versus those “inserted” into Congressional Record.

    What other texts are ripe for distinguishing authors?

    From the post:

    Few people expect politicians to write every word they utter themselves; reliance on speechwriters and spokepersons is a long-established political practice. Still, it's interesting to know which statements are truly the politician's own words, and which are driven primarily by advisors or influencers.

    Recently, David Robinson established a way of figuring out which tweets from Donald Trump's Twitter account came from him personally, as opposed to from campaign staff, whcih he verified by comparing the sentiment of tweets from Android vs iPhone devices. Now, Ali Arsalan Kazmi has used stylometric analysis to investigate the provenance of speeches by the Prime Minister of Pakistan

    A small amount of transparency can go a long way.

    Email archives anyone?