Archive for the ‘R’ Category

War and Peace & R

Friday, December 2nd, 2016

No, not a post about R versus Python but about R and Tolstoy‘s War and Peace.

Using R to Gain Insights into the Emotional Journeys in War and Peace by Wee Hyong Tok.

From the post:

How do you read a novel in record time, and gain insights into the emotional journey of main characters, as they go through various trials and tribulations, as an exciting story unfolds from chapter to chapter?

I remembered my experiences when I start reading a novel, and I get intrigued by the story, and simply cannot wait to get to the last chapter. I also recall many conversations with friends on some of the interesting novels that I have read awhile back, and somehow have only vague recollection of what happened in a specific chapter. In this post, I’ll work through how we can use R to analyze the English translation of War and Peace.

War and Peace is a novel by Leo Tolstoy, and captures the salient points about Russian history from the period 1805 to 1812. The novel consists of the stories of five families, and captures the trials and tribulations of various characters (e.g. Natasha and Andre). The novel consists of about 1400 pages, and is one of the longest novels that have been written.

We hypothesize that if we can build a dashboard (shown below), this will allow us to gain insights into the emotional journey undertaken by the characters in War and Peace.

Impressive work, even though I would not use it as a short-cut to “read a novel in record time.”

Rather I take this as an alternative way of reading War and Peace, one that can capture insights a casual reader may miss.

Moreover, the techniques demonstrated here could be used with other works of literature, or even non-fictional works.

Imagine conducting this analysis over the reportedly more than 7,000 page full CIA Torture Report, for example.

A heatmap does not connect any dots, but points a user towards places where interesting dots may be found.

Certainly a tool for exploring large releases/leaks of text data.

Enjoy!

PS: Large, tiresome, obscure-on-purpose, government reports to practice on with this method?

Learning R programming by reading books: A book list

Thursday, November 24th, 2016

Learning R programming by reading books: A book list by Liang-Cheng Zhang.

From the post:

Despite R’s popularity, it is still very daunting to learn R as R has no click-and-point feature like SPSS and learning R usually takes lots of time. No worries! As self-R learner like us, we constantly receive the requests about how to learn R. Besides hiring someone to teach you or paying tuition fees for online courses, our suggestion is that you can also pick up some books that fit your current R programming level. Therefore, in this post, we would like to share some good books that teach you how to learn programming in R based on three levels: elementary, intermediate, and advanced levels. Each level focuses on one task so you will know whether these books fit your needs. While the following books do not necessarily focus on the task we define, you should focus the task when you reading these books so you are not lost in contexts.

Books and reading form the core of my most basic prejudice: Literacy is the doorway to unlimited universes.

A prejudice so strong that I have to work hard at realizing non-literates live in and sense worlds not open to literates. Not less complex, not poorer, just different.

But book lists in particular appeal to that prejudice and since my blog is read by literates, I’m indulging that prejudice now.

I do have a title to add to the list: Practical Data Science with R by Nina Zumel and John Mount.

Judging from the other titles listed, Practical Data Science with R falls in the intermediate range. Should not be your first R book but certainly high on the list for your second R book.

Avoid the rush! Start working on your Amazon wish list today! 😉

How to get started with Data Science using R

Sunday, November 20th, 2016

How to get started with Data Science using R by Karthik Bharadwaj.

From the post:

R being the lingua franca of data science and is one of the popular language choices to learn data science. Once the choice is made, often beginners find themselves lost in finding out the learning path and end up with a signboard as below.

In this blog post I would like to lay out a clear structural approach to learning R for data science. This will help you to quickly get started in your data science journey with R.

You won’t find anything you don’t already know but this is a great short post to pass onto others.

Point out R skills will help them expose and/or conceal government corruption.

The new Tesseract package: High Quality OCR in R

Thursday, November 17th, 2016

The new Tesseract package: High Quality OCR in R by Jeroen Ooms.

From the post:

Optical character recognition (OCR) is the process of extracting written or typed text from images such as photos and scanned documents into machine-encoded text. The new rOpenSci package tesseract brings one of the best open-source OCR engines to R. This enables researchers or journalists, for example, to search and analyze vast numbers of documents that are only available in printed form.

People looking to extract text and metadata from pdf files in R should try our pdftools package.

Reading too quickly at first I thought I had missed a new version of Tesseract (tesseract-ocr Github), an OCR program that I use on a semi-regular basis.

Reading a little slower, ;-), I discovered Ooms is describing a new package for R, which uses Tesseract for OCR.

This is great news but be aware that Tesseract (whether called by an R package or standalone) can generate a large amount of output in a fairly short period of time.

One of the stumbling blocks of OCR is the labor intensive process of cleaning up the inevitable mistakes.

Depending on how critical accuracy is for searching, for example, you may choose to verify and clean only quotes for use in other publications.

Best to make those decisions up front and not be faced with a mountain of output that isn’t useful unless and until it has been corrected.

Useful Listicle: The 5 most downloaded R packages

Tuesday, November 15th, 2016

The 5 most downloaded R packages

From the post:

Curious which R packages your colleagues and the rest of the R community are using? Thanks to Rdocumentation.org you can now see for yourself! Rdocumentation.org aggregates R documentation and download information from popular repositories like CRAN, BioConductor and GitHub. In this post, we’ll take a look at the top 5 R packages with the most direct downloads!

Sorry! No spoiler!

Do check out:

Rdocumentation.org aggregates help documentation for R packages from CRAN, BioConductor, and GitHub – the three most common sources of current R documentation. RDocumentation.org goes beyond simply aggregating this information, however, by bringing all of this documentation to your fingertips via the RDocumentaion package. The RDocumentation package overwrites the basic help functions from the utils package and gives you access to RDocumentation.org from the comfort of your RStudio IDE. Look up the newest and most popular R packages, search through documentation and post community examples.

As they say:

Create an RDocumentation account today!

I’m always sympathetic to documentation but more so today because I have wasted hours over the past two or three days on issues that could have been trivially documented.

I will be posting “corrected” documentation later this week.

PS: If you have or suspect you have poorly written documentation, I have some time available for paid improvement of the same.

We Should Feel Safer Than We Do

Tuesday, November 8th, 2016

We Should Feel Safer Than We Do by Christian Holmes.

Christian’s Background and Research Goals:

Background

Crime is a divisive and important issue in the United States. It is routinely ranked as among the most important issue to voters, and many politicians have built their careers around their perceived ability to reduce crime. Over 70% of Americans believe that crime is increasing, according to a recent Gallup poll, but is that really the case? I seek to answer this question in this post, as well as determine if there is any clear correlation between government spending and crime.

Research Goals

-Is crime increasing or decreasing in this country?
-Is there a clear link between government spending and crime?

provide an interesting contrast with his conclusions:

From the crime data, it is abundantly clear that crime is on the decline, and has been for around 20 years. The reasons behind this decrease are quite nuanced, though, and I found no clear link between either increased education or police spending and decreasing crime rates. This does not mean that such a relationship does not exist. Rather, it merely means that there is no obvious correlation between the two variables over this specific time frame.

In his background, Christian says:

Over 70% of Americans believe that crime is increasing, according to a recent Gallup poll, but is that really the case? I seek to answer this question in this post,…

Christian presumes, without proof, a relationship between: public beliefs about crime rates (rising or falling) and crime rates as recorded by government agencies.

Which also presumes:

  1. The public is aware that government collects crime statistics.
  2. The public is aware of current crime statistics.
  3. Current crime statistics influence public beliefs about the incidence of crime.

If the central focus of the paper is a comparison of “crime rates” as measured by government with other data on government spending, why even mention the disparity between public “belief” about crime and crime statistics?

I suspect, just as a rhetorical move, Christian is attempting to draw a favorable inference for his “evidence” by contrasting it with “public belief.” “Public belief” that is contrary to the “evidence” in this instance.

Christian doesn’t offer us any basis for judgments about public opinion on crime one way or the other. Any number of factors could be influencing public opinion on that issue, the crime rate as measured by government being only one of those.

The violent crime rate may be very low, statistically speaking, but if you are the victim of a violent crime, from your perspective crime is very prevalent.

Of R and Relationships

Christian uses R to compare crime date with government spending on education and policing.

The unhappy result is that no relationship is evidenced between government spending and a reduction in crime so Christian cautions:

…This does not mean that such a relationship does not exist. Rather, it merely means that there is no obvious correlation between the two variables over this specific time frame….

There is where we switch from relying on data and explore the realms of “the data didn’t prove I was wrong.”

Since it isn’t possible to prove the absence of a relationship between the “crime rate” and government spending on education/police, no, the evidence didn’t prove Christian to be wrong.

On the other hand, it clearly shows that Christopher has no evidence for that “relationship.”

The caution here is that using R and “reliable” data may lead to conclusions you would rather avoid.

PS: Crime and the public’s fear of crime are both extremely complex issues. Aggregate data can justify previously chosen positions, but little more.

How the Ghana Floods animation was created [Animating Your Local Flood Data With R]

Monday, November 7th, 2016

How the Ghana Floods animation was created by David Quartey.

From the post:

Ghana has always been getting flooded, but it seems that only floods in Accra are getting attention. I wrote about it here, and the key visualization was an animated map showing the floods in Ghana, and built in R. In this post, I document how I did it, hopefully you can do one also!

David’s focus is on Ghana but the same techniques work for data of greater local interest.

ggplot2 cheatsheet updated – other R spreadsheets

Wednesday, November 2nd, 2016

RStudio Cheat Sheets

I saw a tweet that the ggplot2 cheatsheet has been updated.

Here’s a list of all the cheatsheets available at RStudio:

  • R Markdown Cheat Sheet
  • RStudio IDE Cheat Sheet
  • Shiny Cheat Sheet
  • Data Visualization Cheat Sheet
  • Package Development Cheat Sheet
  • Data Wrangling Cheat Sheet
  • R Markdown Reference Guide

Contributed Cheatsheets

  • Base R
  • Advanced R
  • Regular Expressions
  • How big is your graph? (base R graphics)

I have deliberately omitted links as when cheat sheets are updated, the links will break and/or you will get outdated information.

Use and reference the RStudio Cheat Sheets page.

Enjoy!

Data Science for Political and Social Phenomena [Special Interest Search Interface]

Sunday, October 23rd, 2016

Data Science for Political and Social Phenomena by Chris Albon.

From the webpage:

I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

If you like learning from examples, this is the site for you!

Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

Serious question.

Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

An introduction to data cleaning with R

Tuesday, October 4th, 2016

An introduction to data cleaning with R by Edwin de Jonge and Mark van der Loo.

Summary:

Data cleaning, or data preparation is an essential part of statistical analysis. In fact, in practice it is often more time-consuming than the statistical analysis itself. These lecture notes describe a range of techniques, implemented in the R statistical environment, that allow the reader to build data cleaning scripts for data suffering from a wide range of errors and inconsistencies, in textual format. These notes cover technical as well as subject-matter related aspects of data cleaning. Technical aspects include data reading, type conversion and string matching and manipulation. Subject-matter related aspects include topics like data checking, error localization and an introduction to imputation methods in R. References to relevant literature and R packages are provided throughout.

These lecture notes are based on a tutorial given by the authors at the useR!2013 conference in Albacete, Spain.

Pure gold!

Plus this tip (among others):

Tip. To become an R master, you must practice every day.

The more data you clean, the better you will become!

Enjoy!

ggplot2 2.2.0 coming soon! [Testers Needed!]

Friday, September 30th, 2016

ggplot2 2.2.0 coming soon! by Hadley Wickham.

From the post:

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version 2.1.0.9001. Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

Install the pre-release version with:

# install.packages("devtools")
devtools::install_github("hadley/ggplot2")

If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:

install.packages("ggplot2")

ggplot2 2.2.0 will be a relatively major release including:

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: ggraph, ggforce, and tweenr.

Just in case you are casual about time, tomorrow is October 1st. Which on most calendars means that “early November” isn’t far off.

Here’s an easy opportunity to test ggplot2.2.2.0 and related visualization packages. Before the official release.

Enjoy!

The Simpsons by the Data [South Park as well]

Thursday, September 29th, 2016

The Simpsons by the Data by Todd Schneider.

From the post:

The Simpsons needs no introduction. At 27 seasons and counting, it’s the longest-running scripted series in the history of American primetime television.

The show’s longevity, and the fact that it’s animated, provides a vast and relatively unchanging universe of characters to study. It’s easier for an animated show to scale to hundreds of recurring characters; without live-action actors to grow old or move on to other projects, the denizens of Springfield remain mostly unchanged from year to year.

As a fan of the show, I present a few short analyses about Springfield, from the show’s dialogue to its TV ratings. All code used for this post is available on GitHub.

Alert! You must run Flash in order to access Simpsons World, the source of Todd’s data.

Advice: Treat Flash as malware and run in a VM.

Todd covers the number of words spoken per character, gender imbalance, focus on characters, viewership, and episode summaries (tf-idf).

Other analysis awaits your imagination and interest.

BTW, if you want comedy data a bit closer to the edge, try Text Mining South Park by Kaylin Walker. Kaylin uses R for her analysis as well.

Other TV programs with R-powered analysis?

R Weekly

Monday, September 12th, 2016

R Weekly

A new weekly publication of R resources that began on 21 May 2016 with Issue 0.

Mostly titles of post and news articles, which is useful, but not as useful as short summaries, including the author’s name.

Predicting American Politics

Saturday, September 3rd, 2016

Presidential Election Predictions 2016 (an ASA competition) by Jo Hardin.

From the post:

In this election year, the American Statistical Association (ASA) has put together a competition for students to predict the exact percentages for the winner of the 2016 presidential election. They are offering cash prizes for the entry that gets closest to the national vote percentage and that best predicts the winners for each state and the District of Columbia. For more details see:

http://thisisstatistics.org/electionprediction2016/

To get you started, I’ve written an analysis of data scraped from fivethirtyeight.com. The analysis uses weighted means and a formula for the standard error (SE) of a weighted mean. For your analysis, you might consider a similar analysis on the state data (what assumptions would you make for a new weight function?). Or you might try some kind of model – either a generalized linear model or a Bayesian analysis with an informed prior. The world is your oyster!

Interesting contest but it is limited to high school and college students. Separate prizes, one for high school and one for college, $200.00 each. Oh, plus ASA memberships and a 2016 Election Prediction t-shirt.

For adults in the audience, strike up a prediction pool by state and/or for the nation.

The Next Generation R Documentation System [Dynamic R Documentation?]

Wednesday, August 31st, 2016

The R Documentation Task Force: The Next Generation R Documentation System by Joseph Rickert and Hadley Wickham.

From the post:

Andrew Redd received $10,000 to lead a new ISC working group, The R Documentation Task Force, which has a mission to design and build the next generation R documentation system. The task force will identify issues with documentation that currently exist, abstract the current Rd system into an R compatible structure, and extend this structure to include new considerations that were not concerns when the Rd system was first implemented. The goal of the project is to create a system that allows for documentation to exist as objects that can be manipulated inside R. This will make the process of creating R documentation much more flexible enabling new capabilities such as porting documentation from other languages or creating inline comments. The new capabilities will add rigor to the documentation process and enable the the system to operate more efficiently than any current methods allow. For more detail have a look at the R Documentation Task Force proposal (Full Text).

The task force team hopes to complete the new documentation system in time for the International R Users Conference, UseR! 2017, which begins July 4th 2017. If you are interested in participating in this task force, please contact Andrew Redd directly via email (andrew.redd@hsc.utah.edu). Outline your interest in the project, you experience with documentation any special skills you may have. The task force team is particularly interested in experience with documentation systems for languages other than R and C/C++.

OK, I have a weakness for documentation projects!

See the full proposal for all the details but:


There are two aspects of R documentation I intend to address which will make R an exemplary system for documentation.

The first aspect is storage. The mechanism of storing documentation in separate Rd files hinders the development process and ties documentation to the packaging system, and this need not be so. Life does not always follow the ideal; code and data are not always distributed via nice packages. Decoupling the documentation from the packaging system will allow for more dynamic and flexible documentation strategies, while also simplifying the process of transitioning to packages distributed through CRAN or other outlets.

The second aspect is flexibility of defining documentation. R is a language of flexibility and preference. There are many paths to the same outcome in R. While this has often been a source of confusion to new users of R, however it is also one of R’s greatest strengths. With packages flexibility has allowed for many contributions, some have fallen in favor while others have proven superior. Adding flexibility in documentation methods will allow for newer, and ideally improved, methods to be developed.

Have you seen the timeline?

  • Mid-August 2016 notification of approval.
  • September 1, 2016 Kickoff for the R Documentation Task Force with final members.
  • September 16, 2016 Deadline for submitting posts to the R-consortium blog, the R-announce, Rpackage-devel, and R-devel mailing lists, announcing the project.
  • September 1 through November 27th 2016 The task force conducts bi-weekly meetings via Lync to address issues in documentation.
  • November 27th, 2016 Deadline for preliminary recommendations of documentation extensions. Recommendations and conflicts written up and submitted to the R journal to be published in the December 2016 issue.
  • December 2016 Posts made to the R Consortium blog, and R mailing lists to coincide with the R Journal article to call for public participation.
  • January 27, 2017 Deadline for general comments on recommendations. Work begins to finalize new documentation system.
  • February 2017 Task force meets to finalize decisions after public input.
  • February-May 2017 Task force meets monthly as necessary to monitor progress on code development.
  • May 2017 Article is submitted outlining final recommendations and the subsequent tools developed to the R Journal for review targeting the June 2017 issue.
  • July 4-7, 2017 Developments will be presented at the International R users conference in Brussels, Belgium.

A very ambitious schedule and one that leaves me wondering if December of 2016 is the first opportunity for public participation, will notes/discussions from the bi-weekly meetings be published before then?

Probably incorrect but I have the impression from the proposal that documentation is regarded as a contiguous mass of text. Yes?

I ask because the “…contiguous mass of text…” model for documentation is a very poor one.

Documentation can present to a user as though it were a “…contiguous mass of text…,” but as I said, a very poor model for documentation itself.

Imagine R documentation that automatically updates itself from R-Bloggers, for example, to include the latest tutorials on a package.

Or that updates to include new data sets, issued since the last documentation update.

Treating documentation as though it must be episodically static should have been abandoned years ago.

The use of R and R development are not static, why should its documentation be?

DataScience+ (R Tutorials)

Monday, August 29th, 2016

DataScience+

From the webpage:

We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!

I encountered DataScience+ while running down David Kun’s RDBL post.

As of today, there are 120 tutorials with 451,129 reads.

That’s impressive! Whether you are looking for tutorials or you are looking to post your R tutorial where it will be appreciated.

Enjoy!

RDBL – manipulate data in-database with R code only

Monday, August 29th, 2016

RDBL – manipulate data in-database with R code only by David Kun.

From the post:

In this post I introduce our own package RDBL, the R DataBase Layer. With this package you can manipulate data in-database without writing SQL code. The package interprets the R code and sends out the corresponding SQL statements to the database, fully transparently. To minimize overhead, the data is only fetched when absolutely necessary, allowing the user to create the relevant joins (merge), filters (logical indexing) and groupings (aggregation) in R code before the SQL is run on the database. The core idea behind RDBL is to let R users with little or no SQL knowledge to utilize the power of SQL database engines for data manipulation.

It is important to note that the SQL statements generated in the background are not executed unless explicitly requested by the command as.data.frame. Hence, you can merge, filter and aggregate your dataset on the database side and load only the result set into memory for R.

In general the design principle behind RDBL is to keep the models as close as possible to the usual data.frame logic, including (as shown later in detail) commands like aggregate, referencing columns by the \($\) operator and features like logical indexing using the \([]\) operator.

RDBL supports a connection to any SQL-like data source which supports a DBI interface or an ODBC connection, including but not limited to Oracle, MySQL, SQLite, SQL Server, MS Access and more.

Not as much fun as surfing mall wifi for logins/passwords, but it is something you can use at work.

The best feature is that you load resulting data sets only. RDBL uses databases for what they do well. Odd but efficient practices do happen from time to time.

I first saw this in a tweet by Christophe Lalanne.
Enjoy!

@rstudio Easter egg: Alt-Shift-K (shows all keyboard shortcuts)

Saturday, August 20th, 2016

Carl Schemertmann asks:

rstudio-easter-egg-460

Forty-two people have retweeted Carl’s tweet without answering Carl’s question.

If you have an answer, please reply to Carl. Otherwise, remember:

Alt-Shift-K

shows all keyboard shortcuts in RStudio.

Enjoy!

Everybody Discusses The Weather In R (+ Trigger Warning)

Saturday, August 20th, 2016

Well, maybe not everybody but if you are interested in weather statistics, there’s a trio of posts at R-Bloggers made for you.

Trigger Warning: If you are a climate change denier, you won’t like the results presented by the posts cited below. Facts dead ahead.

Tracking Precipitation by Day-of-Year

From the post:

Plotting cumulative day-of-year precipitation can helpful in assessing how the current year’s rainfall compares with long term averages. This plot shows the cumulative rainfall by day-of-year for Philadelphia International Airports rain gauge.

Checking Historical Precipitation Data Quality

From the post:

I am interested in evaluating potential changes in precipitation patterns caused by climate change. I have been working with daily precipitation data for the Philadelphia International Airport, site id KPHL, for the period 1950 to present time using R.

I originally used the Pennsylvania State Climatologist web site to download a CSV file of daily precipitation data from 1950 to the present. After some fits and starts analyzing this data set, I discovered that data for January was missing for the period 1950 – 1969. This data gap seriously limited the usable time record.

John Yagecic, (Adventures In Data) told me about the weatherData package which provides easy to use functions to retrieve Weather Underground data. I have found several precipitation data quality issues that may be of interest to other investigators.

Access and Analyze 170 Monthly Climate Time Series Using Simple R Scripts

From the post:

Open Mind, a climate trend data analysis blog, has a great Climate Data Service that provides updated consolidated csv file with 170 monthly climate time series. This is a great resource for those interested in studying climate change. Quick, reliable access to 170 up-to-date climate time series will save interested analysts hundreds – thousands of data wrangling hours of work.

This post presents a simple R script to show how a user can select one of the 170 data series and generate a time series plot like this:

All of these posts originated at RClimate, a new blog that focuses on R and climate data.

Drop by to say hello to D Kelly O’Day, PE (professional engineer) Retired.

Relevant searches at R-Bloggers (as of today):

Climate – 218 results

Flood – 61 results

Rainfall – 55 results

Weather – 291 results

Caution: These results contain duplicates.

Enjoy!

R Markdown

Wednesday, August 17th, 2016

R Markdown

From the webpage:

R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both

  • save and execute code
  • generate high quality reports that can be shared with an audience

R Markdown documents are fully reproducible and support dozens of static and dynamic output formats. This 1-minute video provides a quick tour of what’s possible with R Markdown:

I started to omit this posting, reasoning that with LaTeX and XML, what other languages for composing documents are really necessary?

😉

I don’t suppose it will hurt to have a third language option for your authoring needs.

Enjoy!

Text [R, Scraping, Text]

Wednesday, August 17th, 2016

Text by Amelia McNamara.

Covers “scraping, text, and timelines.”

Using R, focuses on scraping, works through some of “…Scott, Karthik, and Garrett’s useR tutorial.”

In case you don’t know the useR tutorial:

Also known as (AKA) Extracting data from the web APIs and beyond:

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

Covers other resources and materials.

Enjoy!

An analysis of Pokémon Go types, created with R

Thursday, July 21st, 2016

An analysis of Pokémon Go types, created with R by David Smith.

From the post:

As anyone who has tried Pokémon Go recently is probably aware, Pokémon come in different types. A Pokémon’s type affects where and when it appears, and the types of attacks it is vulnerable to. Some types, like Normal, Water and Grass are common; others, like Fairy and Dragon are rare. Many Pokémon have two or more types.

To get a sense of the distribution of Pokémon types, Joshua Kunst used R to download data from the Pokémon API and created a treemap of all the Pokémon types (and for those with more than 1 type, the secondary type). Johnathon’s original used the 800+ Pokémon from the modern universe, but I used his R code to recreate the map for the 151 original Pokémon used in Pokémon Go.

If you or your dog:

via SIZZLE

need a break from Pokémon Go, check out this post!

You will get some much needed rest, polish up your R skills and perhaps learn something about the Pokémon API.

The Pokémon Go craze brings to mind the potential for the creation of alternative location-based games. Accessing locations which require steady nerves and social engineering skills. That definitely has potential.

Say a spy-vs-spy character at a location near a “secret” military base? 😉

Developing Expert p-Hacking Skills

Saturday, July 2nd, 2016

Introducing the p-hacker app: Train your expert p-hacking skills by Ned Bicare.

Ned’s p-hacker app will be welcomed by everyone who publishes where p-values are accepted.

Publishers should mandate authors and reviewers to submit six p-hacker app results along with any draft that contains, or is a review of, p-values.

The p-hacker app results won’t improve a draft and/or review, but when compared to the draft, will improve the publication in which it might have appeared.

From the post:

My dear fellow scientists!

“If you torture the data long enough, it will confess.”

This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if it was wrong to do creative data analysis.

In fact, the art of creative data analysis has experienced despicable attacks over the last years. A small but annoyingly persistent group of second-stringers tries to denigrate our scientific achievements. They drag psychological science through the mire.

These people propagate stupid method repetitions; and what was once one of the supreme disciplines of scientific investigation – a creative data analysis of a data set – has been crippled to conducting an empty-headed step-by-step pre-registered analysis plan. (Come on: If I lay out the full analysis plan in a pre-registration, even an undergrad student can do the final analysis, right? Is that really the high-level scientific work we were trained for so hard?).

They broadcast in an annoying frequency that p-hacking leads to more significant results, and that researcher who use p-hacking have higher chances of getting things published.

What are the consequence of these findings? The answer is clear. Everybody should be equipped with these powerful tools of research enhancement!

The art of creative data analysis

Some researchers describe a performance-oriented data analysis as “data-dependent analysis”. We go one step further, and call this technique data-optimal analysis (DOA), as our goal is to produce the optimal, most significant outcome from a data set.

I developed an online app that allows to practice creative data analysis and how to polish your p-values. It’s primarily aimed at young researchers who do not have our level of expertise yet, but I guess even old hands might learn one or two new tricks! It’s called “The p-hacker” (please note that ‘hacker’ is meant in a very positive way here. You should think of the cool hackers who fight for world peace). You can use the app in teaching, or to practice p-hacking yourself.

Please test the app, and give me feedback! You can also send it to colleagues: http://shinyapps.org/apps/p-hacker.

Enjoy!

Computerworld’s advanced beginner’s guide to R

Wednesday, June 29th, 2016

Computerworld’s advanced beginner’s guide to R by David Smith.

From the post:

Many newcomers to R got their start learning the language with Computerworld’s Beginner’s Guide to R, a 6-part introduction to the basics of the language. Now, budding R users who want to take their skills to the next level have a new guide to help them: Computerword’s Advanced Beginner’s Guide to R. Written by Sharon Machlis, author of the prior Beginner’s guide and regular reporter of R news at Computerworld, this new 72-page guide dives into some trickier topics related to R: extracting data via API, data wrangling, and data visualization.

Well, what are you waiting for?

Either read it or pass it along!

Enjoy!

Integrated R labs for high school students

Tuesday, June 28th, 2016

Integrated R labs for high school students by Amelia McNamara.

From the webpage:

Amelia McNamara, James Molyneux, Terri Johnson

This looks like a very promising approach for capturing the interests of high school students in statistics and R.

From the larger project, Mobilize, curriculum page:

Mobilize centers its curricula around participatory sensing campaigns in which students use their mobile devices to collect and share data about their communities and their lives, and to analyze these data to gain a greater understanding about their world.Mobilize breaks barriers by teaching students to apply concepts and practices from computer science and statistics in order to learn science and mathematics. Mobilize is dynamic: each class collects its own data, and each class has the opportunity to make unique discoveries. We use mobile devices not as gimmicks to capture students’ attention, but as legitimate tools that bring scientific enquiry into our everyday lives.

Mobilize comprises four key curricula: Introduction to Data Science (IDS), Algebra I, Biology, and Mobilize Prime, all focused on preparing students to live in a data-driven world. The Mobilize curricula are a unique blend of computational and statistical thinking subject matter content that teaches students to think critically about and with data. The Mobilize curricula utilize innovative mobile technology to enhance math and science classroom learning. Mobilize brings “Big Data” into the classroom in the form of participatory sensing, a hands-on method in which students use mobile devices to collect data about their lives and community, then use Mobilize Visualization tools to analyze and interpret the data.

I like the approach of having the student collect their own and process their own data. If they learn to question their own data and processes, hopefully they will ask questions about data processing results presented as “facts.” (Since 2016 is a presidential election year in the United States, questioning claimed data results is especially important.)

Enjoy!

ggplot2 – Elegant Graphics for Data Analysis – At Last Call

Thursday, June 9th, 2016

ggplot2 – Elegant Graphics for Data Analysis by Hadley Wickham.

Hadley tweeted today that “ggplot2” is still up but will be removed after publication.

If you want/need a digital copy, now would be a good time to acquire one.

Modeling data with functional programming – State based systems

Sunday, May 22nd, 2016

Modeling data with functional programming – State based systems by Brian Lee Yung Rowe.

Brian has just released chapter 8 of his Modeling data with functional programming in R, State based systems.

BTW, Brian mentions that his editor is looking for more proof reviewers.

Enjoy!

Efficient R programming

Thursday, May 5th, 2016

Efficient R programming by Colin Gillespie and Robin Lovelace.

From the present text of Chapter 2:

An efficient computer set-up is analogous to a well-tuned vehicle: its components work in harmony, it is well-serviced, and it is fast. This chapter describes the software decisions that will enable a productive workflow. Starting with the basics and moving to progressively more advanced topics, we explore how the operating system, R version, startup files and IDE can make your R work faster (though IDE could be seen as basic need for efficient programming). Ensuring correct configuration of these elements will have knock-on benefits in many aspects of your R workflow. That’s why we cover them at this early stage (hardware, the other fundamental consideration, is covered in the next chapter). By the end of this chapter you should understand how to set-up your computer and R installation (skip to section 2.3 if R is not already installed on your computer) for optimal computational and programmer efficiency. It covers the following topics:

  • R and the operating systems: system monitoring on Linux, Mac and Windows
  • R version: how to keep your base R installation and packages up-to-date
  • R start-up: how and why to adjust your .Rprofile and .Renviron files
  • RStudio: an integrated development environment (IDE) to boost your programming productivity
  • BLAS and alternative R interpreters: looks at ways to make R faster

For lazy readers, and to provide a taster of what’s to come, we begin with our ‘top 5’ tips for an efficient R set-up. It is important to understand that efficient programming is not simply the result of following a recipe of tips: understanding is vital for knowing when to use a memorised solution to a problem and when to go back to first principles. Thinking about and understanding R in depth, e.g. by reading this chapter carefully, will make efficiency second nature in your R workflow.

Nope, go see Chapter 2 if you want the top 5 tips for efficient R set-up.

The text and code are being developed at the website and the authors welcome “pull requests and general comments.”

Don’t be shy!

NSA Grade – Network Visualization with Gephi

Sunday, April 10th, 2016

Network Visualization with Gephi by Katya Ognyanova.

It’s not possible to cover Gephi in sixteen (16) pages but you will wear out more than one printed copy of these sixteen (16) pages as you become experienced with Gephi.

This version is from a Gephi workshop at Sunbelt 2016.

Katya‘s homepage offers a wealth of network visualization posts and extensive use of R.

Follow her at @Ognyanova.

PS: Gephi equals or exceeds visualization capabilities in use by the NSA, depending upon your skill as an analyst and the quality of the available data.

Advanced Data Mining with Weka – Starts 25 April 2016

Wednesday, April 6th, 2016

Advanced Data Mining with Weka by Ian Witten.

From the webpage:

This course follows on from Data Mining with Weka and More Data Mining with Weka. It provides a deeper account of specialized data mining tools and techniques. Again the emphasis is on principles and practical data mining using Weka, rather than mathematical theory or advanced details of particular algorithms. Students will analyse time series data, mine data streams, use Weka to access other data mining packages including the popular R statistical computing language, script Weka in Python, and deploy it within a cluster computing framework. The course also includes case studies of applications such as classifying tweets, functional MRI data, image classification, and signal peptide prediction.

The syllabus: https://weka.waikato.ac.nz/advanceddataminingwithweka/assets/pdf/syllabus.pdf.

Advanced Data Mining with Weka is open for enrollment and starts 25 April 2016.

Five very intense weeks await!

Will you be there?

I first saw this in a tweet by Alyona Medelyan.