Archive for the ‘R’ Category

DataScience+ (R Tutorials)

Monday, August 29th, 2016

DataScience+

From the webpage:

We share R tutorials from scientists at academic and scientific institutions with a goal to give everyone in the world access to a free knowledge. Our tutorials cover different topics including statistics, data manipulation and visualization!

I encountered DataScience+ while running down David Kun’s RDBL post.

As of today, there are 120 tutorials with 451,129 reads.

That’s impressive! Whether you are looking for tutorials or you are looking to post your R tutorial where it will be appreciated.

Enjoy!

RDBL – manipulate data in-database with R code only

Monday, August 29th, 2016

RDBL – manipulate data in-database with R code only by David Kun.

From the post:

In this post I introduce our own package RDBL, the R DataBase Layer. With this package you can manipulate data in-database without writing SQL code. The package interprets the R code and sends out the corresponding SQL statements to the database, fully transparently. To minimize overhead, the data is only fetched when absolutely necessary, allowing the user to create the relevant joins (merge), filters (logical indexing) and groupings (aggregation) in R code before the SQL is run on the database. The core idea behind RDBL is to let R users with little or no SQL knowledge to utilize the power of SQL database engines for data manipulation.

It is important to note that the SQL statements generated in the background are not executed unless explicitly requested by the command as.data.frame. Hence, you can merge, filter and aggregate your dataset on the database side and load only the result set into memory for R.

In general the design principle behind RDBL is to keep the models as close as possible to the usual data.frame logic, including (as shown later in detail) commands like aggregate, referencing columns by the \($\) operator and features like logical indexing using the \([]\) operator.

RDBL supports a connection to any SQL-like data source which supports a DBI interface or an ODBC connection, including but not limited to Oracle, MySQL, SQLite, SQL Server, MS Access and more.

Not as much fun as surfing mall wifi for logins/passwords, but it is something you can use at work.

The best feature is that you load resulting data sets only. RDBL uses databases for what they do well. Odd but efficient practices do happen from time to time.

I first saw this in a tweet by Christophe Lalanne.
Enjoy!

@rstudio Easter egg: Alt-Shift-K (shows all keyboard shortcuts)

Saturday, August 20th, 2016

Carl Schemertmann asks:

rstudio-easter-egg-460

Forty-two people have retweeted Carl’s tweet without answering Carl’s question.

If you have an answer, please reply to Carl. Otherwise, remember:

Alt-Shift-K

shows all keyboard shortcuts in RStudio.

Enjoy!

Everybody Discusses The Weather In R (+ Trigger Warning)

Saturday, August 20th, 2016

Well, maybe not everybody but if you are interested in weather statistics, there’s a trio of posts at R-Bloggers made for you.

Trigger Warning: If you are a climate change denier, you won’t like the results presented by the posts cited below. Facts dead ahead.

Tracking Precipitation by Day-of-Year

From the post:

Plotting cumulative day-of-year precipitation can helpful in assessing how the current year’s rainfall compares with long term averages. This plot shows the cumulative rainfall by day-of-year for Philadelphia International Airports rain gauge.

Checking Historical Precipitation Data Quality

From the post:

I am interested in evaluating potential changes in precipitation patterns caused by climate change. I have been working with daily precipitation data for the Philadelphia International Airport, site id KPHL, for the period 1950 to present time using R.

I originally used the Pennsylvania State Climatologist web site to download a CSV file of daily precipitation data from 1950 to the present. After some fits and starts analyzing this data set, I discovered that data for January was missing for the period 1950 – 1969. This data gap seriously limited the usable time record.

John Yagecic, (Adventures In Data) told me about the weatherData package which provides easy to use functions to retrieve Weather Underground data. I have found several precipitation data quality issues that may be of interest to other investigators.

Access and Analyze 170 Monthly Climate Time Series Using Simple R Scripts

From the post:

Open Mind, a climate trend data analysis blog, has a great Climate Data Service that provides updated consolidated csv file with 170 monthly climate time series. This is a great resource for those interested in studying climate change. Quick, reliable access to 170 up-to-date climate time series will save interested analysts hundreds – thousands of data wrangling hours of work.

This post presents a simple R script to show how a user can select one of the 170 data series and generate a time series plot like this:

All of these posts originated at RClimate, a new blog that focuses on R and climate data.

Drop by to say hello to D Kelly O’Day, PE (professional engineer) Retired.

Relevant searches at R-Bloggers (as of today):

Climate – 218 results

Flood – 61 results

Rainfall – 55 results

Weather – 291 results

Caution: These results contain duplicates.

Enjoy!

R Markdown

Wednesday, August 17th, 2016

R Markdown

From the webpage:

R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both

  • save and execute code
  • generate high quality reports that can be shared with an audience

R Markdown documents are fully reproducible and support dozens of static and dynamic output formats. This 1-minute video provides a quick tour of what’s possible with R Markdown:

I started to omit this posting, reasoning that with LaTeX and XML, what other languages for composing documents are really necessary?

😉

I don’t suppose it will hurt to have a third language option for your authoring needs.

Enjoy!

Text [R, Scraping, Text]

Wednesday, August 17th, 2016

Text by Amelia McNamara.

Covers “scraping, text, and timelines.”

Using R, focuses on scraping, works through some of “…Scott, Karthik, and Garrett’s useR tutorial.”

In case you don’t know the useR tutorial:

Also known as (AKA) Extracting data from the web APIs and beyond:

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

Covers other resources and materials.

Enjoy!

An analysis of Pokémon Go types, created with R

Thursday, July 21st, 2016

An analysis of Pokémon Go types, created with R by David Smith.

From the post:

As anyone who has tried Pokémon Go recently is probably aware, Pokémon come in different types. A Pokémon’s type affects where and when it appears, and the types of attacks it is vulnerable to. Some types, like Normal, Water and Grass are common; others, like Fairy and Dragon are rare. Many Pokémon have two or more types.

To get a sense of the distribution of Pokémon types, Joshua Kunst used R to download data from the Pokémon API and created a treemap of all the Pokémon types (and for those with more than 1 type, the secondary type). Johnathon’s original used the 800+ Pokémon from the modern universe, but I used his R code to recreate the map for the 151 original Pokémon used in Pokémon Go.

If you or your dog:

via SIZZLE

need a break from Pokémon Go, check out this post!

You will get some much needed rest, polish up your R skills and perhaps learn something about the Pokémon API.

The Pokémon Go craze brings to mind the potential for the creation of alternative location-based games. Accessing locations which require steady nerves and social engineering skills. That definitely has potential.

Say a spy-vs-spy character at a location near a “secret” military base? 😉

Developing Expert p-Hacking Skills

Saturday, July 2nd, 2016

Introducing the p-hacker app: Train your expert p-hacking skills by Ned Bicare.

Ned’s p-hacker app will be welcomed by everyone who publishes where p-values are accepted.

Publishers should mandate authors and reviewers to submit six p-hacker app results along with any draft that contains, or is a review of, p-values.

The p-hacker app results won’t improve a draft and/or review, but when compared to the draft, will improve the publication in which it might have appeared.

From the post:

My dear fellow scientists!

“If you torture the data long enough, it will confess.”

This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if it was wrong to do creative data analysis.

In fact, the art of creative data analysis has experienced despicable attacks over the last years. A small but annoyingly persistent group of second-stringers tries to denigrate our scientific achievements. They drag psychological science through the mire.

These people propagate stupid method repetitions; and what was once one of the supreme disciplines of scientific investigation – a creative data analysis of a data set – has been crippled to conducting an empty-headed step-by-step pre-registered analysis plan. (Come on: If I lay out the full analysis plan in a pre-registration, even an undergrad student can do the final analysis, right? Is that really the high-level scientific work we were trained for so hard?).

They broadcast in an annoying frequency that p-hacking leads to more significant results, and that researcher who use p-hacking have higher chances of getting things published.

What are the consequence of these findings? The answer is clear. Everybody should be equipped with these powerful tools of research enhancement!

The art of creative data analysis

Some researchers describe a performance-oriented data analysis as “data-dependent analysis”. We go one step further, and call this technique data-optimal analysis (DOA), as our goal is to produce the optimal, most significant outcome from a data set.

I developed an online app that allows to practice creative data analysis and how to polish your p-values. It’s primarily aimed at young researchers who do not have our level of expertise yet, but I guess even old hands might learn one or two new tricks! It’s called “The p-hacker” (please note that ‘hacker’ is meant in a very positive way here. You should think of the cool hackers who fight for world peace). You can use the app in teaching, or to practice p-hacking yourself.

Please test the app, and give me feedback! You can also send it to colleagues: http://shinyapps.org/apps/p-hacker.

Enjoy!

Computerworld’s advanced beginner’s guide to R

Wednesday, June 29th, 2016

Computerworld’s advanced beginner’s guide to R by David Smith.

From the post:

Many newcomers to R got their start learning the language with Computerworld’s Beginner’s Guide to R, a 6-part introduction to the basics of the language. Now, budding R users who want to take their skills to the next level have a new guide to help them: Computerword’s Advanced Beginner’s Guide to R. Written by Sharon Machlis, author of the prior Beginner’s guide and regular reporter of R news at Computerworld, this new 72-page guide dives into some trickier topics related to R: extracting data via API, data wrangling, and data visualization.

Well, what are you waiting for?

Either read it or pass it along!

Enjoy!

Integrated R labs for high school students

Tuesday, June 28th, 2016

Integrated R labs for high school students by Amelia McNamara.

From the webpage:

Amelia McNamara, James Molyneux, Terri Johnson

This looks like a very promising approach for capturing the interests of high school students in statistics and R.

From the larger project, Mobilize, curriculum page:

Mobilize centers its curricula around participatory sensing campaigns in which students use their mobile devices to collect and share data about their communities and their lives, and to analyze these data to gain a greater understanding about their world.Mobilize breaks barriers by teaching students to apply concepts and practices from computer science and statistics in order to learn science and mathematics. Mobilize is dynamic: each class collects its own data, and each class has the opportunity to make unique discoveries. We use mobile devices not as gimmicks to capture students’ attention, but as legitimate tools that bring scientific enquiry into our everyday lives.

Mobilize comprises four key curricula: Introduction to Data Science (IDS), Algebra I, Biology, and Mobilize Prime, all focused on preparing students to live in a data-driven world. The Mobilize curricula are a unique blend of computational and statistical thinking subject matter content that teaches students to think critically about and with data. The Mobilize curricula utilize innovative mobile technology to enhance math and science classroom learning. Mobilize brings “Big Data” into the classroom in the form of participatory sensing, a hands-on method in which students use mobile devices to collect data about their lives and community, then use Mobilize Visualization tools to analyze and interpret the data.

I like the approach of having the student collect their own and process their own data. If they learn to question their own data and processes, hopefully they will ask questions about data processing results presented as “facts.” (Since 2016 is a presidential election year in the United States, questioning claimed data results is especially important.)

Enjoy!

ggplot2 – Elegant Graphics for Data Analysis – At Last Call

Thursday, June 9th, 2016

ggplot2 – Elegant Graphics for Data Analysis by Hadley Wickham.

Hadley tweeted today that “ggplot2” is still up but will be removed after publication.

If you want/need a digital copy, now would be a good time to acquire one.

Modeling data with functional programming – State based systems

Sunday, May 22nd, 2016

Modeling data with functional programming – State based systems by Brian Lee Yung Rowe.

Brian has just released chapter 8 of his Modeling data with functional programming in R, State based systems.

BTW, Brian mentions that his editor is looking for more proof reviewers.

Enjoy!

Efficient R programming

Thursday, May 5th, 2016

Efficient R programming by Colin Gillespie and Robin Lovelace.

From the present text of Chapter 2:

An efficient computer set-up is analogous to a well-tuned vehicle: its components work in harmony, it is well-serviced, and it is fast. This chapter describes the software decisions that will enable a productive workflow. Starting with the basics and moving to progressively more advanced topics, we explore how the operating system, R version, startup files and IDE can make your R work faster (though IDE could be seen as basic need for efficient programming). Ensuring correct configuration of these elements will have knock-on benefits in many aspects of your R workflow. That’s why we cover them at this early stage (hardware, the other fundamental consideration, is covered in the next chapter). By the end of this chapter you should understand how to set-up your computer and R installation (skip to section 2.3 if R is not already installed on your computer) for optimal computational and programmer efficiency. It covers the following topics:

  • R and the operating systems: system monitoring on Linux, Mac and Windows
  • R version: how to keep your base R installation and packages up-to-date
  • R start-up: how and why to adjust your .Rprofile and .Renviron files
  • RStudio: an integrated development environment (IDE) to boost your programming productivity
  • BLAS and alternative R interpreters: looks at ways to make R faster

For lazy readers, and to provide a taster of what’s to come, we begin with our ‘top 5’ tips for an efficient R set-up. It is important to understand that efficient programming is not simply the result of following a recipe of tips: understanding is vital for knowing when to use a memorised solution to a problem and when to go back to first principles. Thinking about and understanding R in depth, e.g. by reading this chapter carefully, will make efficiency second nature in your R workflow.

Nope, go see Chapter 2 if you want the top 5 tips for efficient R set-up.

The text and code are being developed at the website and the authors welcome “pull requests and general comments.”

Don’t be shy!

NSA Grade – Network Visualization with Gephi

Sunday, April 10th, 2016

Network Visualization with Gephi by Katya Ognyanova.

It’s not possible to cover Gephi in sixteen (16) pages but you will wear out more than one printed copy of these sixteen (16) pages as you become experienced with Gephi.

This version is from a Gephi workshop at Sunbelt 2016.

Katya‘s homepage offers a wealth of network visualization posts and extensive use of R.

Follow her at @Ognyanova.

PS: Gephi equals or exceeds visualization capabilities in use by the NSA, depending upon your skill as an analyst and the quality of the available data.

Advanced Data Mining with Weka – Starts 25 April 2016

Wednesday, April 6th, 2016

Advanced Data Mining with Weka by Ian Witten.

From the webpage:

This course follows on from Data Mining with Weka and More Data Mining with Weka. It provides a deeper account of specialized data mining tools and techniques. Again the emphasis is on principles and practical data mining using Weka, rather than mathematical theory or advanced details of particular algorithms. Students will analyse time series data, mine data streams, use Weka to access other data mining packages including the popular R statistical computing language, script Weka in Python, and deploy it within a cluster computing framework. The course also includes case studies of applications such as classifying tweets, functional MRI data, image classification, and signal peptide prediction.

The syllabus: https://weka.waikato.ac.nz/advanceddataminingwithweka/assets/pdf/syllabus.pdf.

Advanced Data Mining with Weka is open for enrollment and starts 25 April 2016.

Five very intense weeks await!

Will you be there?

I first saw this in a tweet by Alyona Medelyan.

APL in R “The past isn’t dead. It isn’t even past.”*

Monday, March 14th, 2016

APL in R by Jan de Leeuw and Masanao Yajima.

From the introduction:

APL was introduced by Iverson (1962). It is an array language, with many functions to manipulate multidimensional arrays. R also has multidimensional arrays, but not as many functions to work with them.

In R there are no scalars, there are vectors of length one. For a vector x in R we have dim(x) equal to NULL and length(x) > 0. For an array, including a matrix, we have length(dim(x)) > 0. APL is an array language, which means everything is an array. For each array both the shape ⍴A and the rank ⍴⍴A are defined. Scalars are arrays with shape equal to one, vectors are arrays with rank equal to one.

If you want to evaluate APL expressions using a traditional APL virtual keyboard, we recommend the nice webpage at ngn.github.io/apl/web/index.html. EliStudio at fastarray.appspot.com/default.html is essentially an APL interpreter running in a Qt GUI, using ascii symbols and symbol-pairs to replace traditional APL symbols (Chen and Ching (2013)). Eli does not have nested arrays. It does have ecc, which compiles eli to C.

In 1994 one of us coded most APL array operations in XLISP-STAT. The code is still available at gifi.stat.ucla.edu/apl.

Certain this will be useful for R programmers but more generally curious if there is a genealogy of functions across programming languages?

Enjoy!

*Apologies to William Faulkner.

Visualizing the Clinton Email Network in R

Wednesday, February 24th, 2016

Visualizing the Clinton Email Network in R by Bob Rudis.

From the post:

This isn’t a post about politics. I do have opinions about the now infamous e-mail server (which will no doubt come out here), but when the WSJ folks made it possible to search the Clinton email releases I though it would be fun to get the data into R to show how well the igraph and ggnetwork packages could work together, and also show how to use svgPanZoom to make it a bit easier to poke around the resulting hairball network.

NOTE: There are a couple “Assignment” blocks in here. My Elements of Data Science students are no doubt following the blog by now so those are meant for you 🙂 Other intrepid readers can ignore them.

A great walk through on importing, analyzing, and visualizing any email archive, not just Hillary’s.

You will quickly find that “…connecting the dots…” isn’t as useful as the intelligence community would have you believe.

Yes, yes! There is a call to Papa John’s! Oh, that’s not a code name, that’s a pizza place. (Even suspected terrorists have to eat.)

Great to have the dots. Great to have connections. Not so great if that is all that you have.

I found a number of other interesting posts at Bob’s blog: http://rud.is/b/.

Including: Dairy-free Parker House Rolls! I bake fairly often so am susceptible to this sort of posting. Looks very good!

How I build up a ggplot2 figure [Class Response To ggplot2 criticism]

Friday, February 19th, 2016

How I build up a ggplot2 figure by John Muschelli.

From the post:

Recently, Jeff Leek at Simply Statistics discussed why he does not use ggplot2. He notes “The bottom line is for production graphics, any system requires work.” and describes a default plot that needs some work:

John responds to perceived issues with using ggplot2 by walking through each issue and providing you with examples of how to solve it.

That doesn’t mean that you will switch to ggplot2, but it does mean you will be better informed of your options.

An example to be copied!

Rectal and Other Data

Wednesday, February 17th, 2016

Hadley Wickham has posted neiss:

The neiss package provides access to all data (2009-2014) from the National Electronic Injury Surveillance System, which is a sample of all accidents reported to emergency rooms in the US.

You will recall this is the data set used by Nathan Yau in NSFW: Million to One Shot, Doc,, an analysis of rectal injuries.

A lack of features in the data prevents some types of analysis, such as the type of objects plotted as a function of weight, for example.

I’m sure there are other patterns, seasonal?, that you can derive from the data.

Enjoy!

PS: R library.

Katia – rape screening in R

Tuesday, February 16th, 2016

Katia – rape screening in R

From the webpage:

It’s Not Enough to Condemn Violence Against Women. We Need to End It.

All 12 innocent female victims above were atrociously killed, sexually assaulted, or registered missing after meeting strangers on mainstream dating, personals, classifieds, or social networking services.

INTRODUCTION TO THE KATIA RAPE SCREEN

Those 12 beautiful faces in the gallery above, are our sisters and daughters. Looking at their pictures is like looking through a tiny pinhole onto an unprecedented rape and domestic violence crisis that is destroying the American family unit.

Verified by science, the KATIA rape screen, coded in the computer programming language, R, can provably stop a woman from ever meeting her attacker.

The technology is named after a RAINN-counseled first degree aggravated rape survivor named Katia.

It is based on the work of a Google engineer from the Reverse Image Search project and a RAINN (Rape, Abuse & Incest National Network) counselor, with a clinical background in mathematical statistics, who has over a period of 15 years compiled a linguistic pattern analysis of the messages that rapists use to lure women online.

Learn more about the science behind Katia.

This project is taking concrete steps to reduce violence against women.

What more is there to say?

networkD3: D3 JavaScript Network Graphs from R

Monday, February 15th, 2016

networkD3: D3 JavaScript Network Graphs from R by Christopher Gandrud, JJ Allaire, & Kent Russell.

From the post:

This is a port of Christopher Gandrud’s R package d3Network for creating D3 network graphs to the htmlwidgets framework. The htmlwidgets framework greatly simplifies the package’s syntax for exporting the graphs, improves integration with RStudio’s Viewer Pane, RMarkdown, and Shiny web apps. See below for examples.

It currently supports three types of network graphs:

I haven’t compared this to GraphViz but the Sankey diagram option is impressive!

Tufte in R

Friday, February 12th, 2016

Tufte in R by Lukasz Piwek.

From the post:

The idea behind Tufte in R is to use R – the most powerful open-source statistical programming language – to replicate excellent visualisation practices developed by Edward Tufte. It’s not a novel approach – there are plenty of excellent R functions and related packages wrote by people who have much more expertise in programming than myself. I simply collect those resources in one place in an accessible and replicable format, adding a few bits of my own coding discoveries.

Piwek says his idea isn’t novel but I am sure this will be of interest to both R and Tufte fans!

Is anyone else working through the Tufte volumes in R or Processing?

Those would be great projects to have bookmarked.

Build your own neural network classifier in R

Wednesday, February 10th, 2016

Build your own neural network classifier in R by Jun Ma.

From the post:

Image classification is one important field in Computer Vision, not only because so many applications are associated with it, but also a lot of Computer Vision problems can be effectively reduced to image classification. The state of art tool in image classification is Convolutional Neural Network (CNN). In this article, I am going to write a simple Neural Network with 2 layers (fully connected). First, I will train it to classify a set of 4-class 2D data and visualize the decision boundary. Second, I am going to train my NN with the famous MNIST data (you can download it here: https://www.kaggle.com/c/digit-recognizer/download/train.csv) and see its performance. The first part is inspired by CS 231n course offered by Stanford: http://cs231n.github.io/, which is taught in Python.

One suggestion, based on some unrelated reading, don’t copy-n-paste the code.

Key in the code so you will get accustomed to your typical typing mistakes, which are no doubt different from mine!

Plus you will develop muscle memory in your fingers and code will either “look right” or not.

Enjoy!

PS: For R, Jun’s blog looks like one you need to start following!

Data from the World Health Organization API

Monday, February 8th, 2016

Data from the World Health Organization API by Peter’s stats stuff – R.

From the post:

Eric Persson released yesterday a new WHO R package which allows easy access to the World Health Organization’s data API. He’s also done a nice vignette introducing its use.

I had a play and found it was easy access to some interesting data. Some time down the track I might do a comparison of this with other sources, the most obvious being the World Bank’s World Development Indicators, to identify relative advantages – there’s a lot of duplication of course. It’s a nice problem to have, too much data that’s too easy to get hold of. I wish we’d had that problem when I studied aid and development last century – I vividly remember re-keying numbers from almanac-like hard copy publications, and pleased we were to have them too!

Here’s a plot showing country-level relationships between the latest data of three indicators – access to contraception, adolescent fertility, and infant mortality – that help track the Millennium Development Goals.

With visualizations and R code!

A nice way to start off your data mining week!

Enjoy!

I first saw this in a tweet by Christophe Lalanne.

Using ‘R’ for betting analysis [Data Science For The Rest Of Us]

Wednesday, January 13th, 2016

Using ‘R’ for betting analysis by Minrio Mella.

From the post:

Gaining an edge in betting often boils down to intelligent data analysis, but faced with daunting amounts of data it can be hard to know where to start. If this sounds familiar, R – an increasingly popular statistical programming language widely used for data analysis – could be just what you’re looking for.

What is R?

R is a statistical programming language that is used to visualize and analyse data. Okay, this sounds a little intimidating but actually it isn’t as scary as it may appear. Its creators – two professors from New Zealand – wanted an intuitive statistical platform that their students could use to slice and dice data and create interesting visual representation like 3D graphs.

Given its relative simplicity but endless scope for applications (packages) R has steadily gained momentum amongst the world’s brightest statisticians and data scientists. Facebook use R for statistical analysis of status updates and many of the complex word clouds you might see online are powered by R.

There are now thousands of user created libraries to enhance R functionality and given how much successful betting boils down to effective data analysis, packages are being created to perform betting related analysis and strategies.

On a day when the PowerBall lottery has a jackpot of $1.5 billion, a post on betting analysis is appropriate.

Especially since most data science articles are about sentiment analysis, recommendations, all of which is great if you are marketing videos in a streaming environment across multiple media channels.

At home? Not so much.

Mirio’s introduction to R walks you through getting R installed along with a library for Pinnacle Sports for odds conversion.

No guarantees on your betting performance but having a subject you are interested in, betting, makes it much more likely you will learn R.

Enjoy!

Ggplot2 Quickref

Wednesday, January 6th, 2016

Ggplot2 Quickref by Selva Prabhakaran.

If you use ggplot2, map this to a “hot” key on your keyboard.

Enjoy!

Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction [Gatekeeping]

Tuesday, January 5th, 2016

Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction by Cameron Blevins and Lincoln Mullen.

Abstract:

This article describes a new method for inferring the gender of personal names using large historical datasets. In contrast to existing methods of gender prediction that treat names as if they are timelessly associated with one gender, this method uses a historical approach that takes into account how naming practices change over time. It uses historical data to measure the likelihood that a name was associated with a particular gender based on the time or place under study. This approach generates more accurate results for sources that encompass changing periods of time, providing digital humanities scholars with a tool to estimate the gender of names across large textual collections. The article first describes the methodology as implemented in the gender package for the R programming language. It goes on to apply the method to a case study in which we examine gender and gatekeeping in the American historical profession over the past half-century. The gender package illustrates the importance of incorporating historical approaches into computer science and related fields.

An excellent introduction to the gender package for R, historical grounding of the detection of gender by name, with the highlight of the article being the application of this technique to professional literature in American history.

It isn’t uncommon to find statistical techniques applied to texts whose authors and editors are beyond the reach of any critic or criticism.

It is less than common to find statistical techniques applied to extant members of a profession.

Kudos to both Blevins and Mullen for refinement the detection of gender and for applying that refinement publishing in American history.

rOpenSci (updated tutorials) [Learn Something, Write Something]

Monday, January 4th, 2016

rOpenSci has updated 16 of its tutorials!

More are on the way!

Need a detailed walk through of what our packages allow you to do? Click on a package below, quickly install it and follow along. We’re in the process of updating existing package tutorials and adding several more in the coming weeks. If you find any bugs or have comments, drop a note in the comments section or send us an email. If a tutorial is available in multiple languages we indicate that with badges, e.g., (English) (Português).

  • alm    Article-level metrics
  • antweb    AntWeb data
  • aRxiv    Access to arXiv text
  • bold    Barcode data
  • ecoengine    Biodiversity data
  • ecoretriever    Retrieve ecological datasets
  • elastic    Elasticsearch R client
  • fulltext    Text mining client
  • geojsonio    GeoJSON/TopoJSON I/O
  • gistr    Work w/ GitHub Gists
  • internetarchive    Internet Archive client
  • lawn    Geospatial Analysis
  • musemeta    Scrape museum metadata
  • rAltmetric    Altmetric.com client
  • rbison    Biodiversity data from USGS
  • rcrossref    Crossref client
  • rebird    eBird client
  • rentrez    Entrez client
  • rerddap    ERDDAP client
  • rfisheries    OpenFisheries.org client
  • rgbif    GBIF biodiversity data
  • rinat    Inaturalist data
  • RNeXML    Create/consume NeXML
  • rnoaa    Client for many NOAA datasets
  • rplos    PLOS text mining
  • rsnps    SNP data access
  • rvertnet    VertNet.org biodiversity data
  • rWBclimate    World Bank Climate data
  • solr    SOLR database client
  • spocc    Biodiversity data one stop shop
  • taxize    Taxonomic toolbelt
  • traits    Trait data
  • treebase

     

        Treebase data
  • wellknown    Well-known text <-> GeoJSON
  • More tutorials on the way.

Good documentation is hard to come by and good tutorials even more so.

Yet, here are rOpenSci you will find thirty-four (34) tutorials and more on the way.

Let’s answer that moronic security saying: See Something, Say Something, with:

Learn Something, Write Something.

Great R packages for data import, wrangling & visualization [+ XQuery]

Tuesday, December 29th, 2015

Great R packages for data import, wrangling & visualization by Sharon Machlis.

From the post:

One of the great things about R is the thousands of packages users have written to solve specific problems in various disciplines — analyzing everything from weather or financial data to the human genome — not to mention analyzing computer security-breach data.

Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in the table are clickable if you want more information. To find out more about a package once you’ve installed it, type help(package = "packagename") in your R console (of course substituting the actual package name ).

Forty-seven (47) “favorites” sounds a bit on the high side but some people have more than one “favorite” ice cream, or obsession. 😉

You know how I feel about sort-order and I could not detect an obvious one in Sharon’s listing.

So, I extracted the package links/name plus the short description into a new table:

car data wrangling
choroplethr mapping
data.table data wrangling, data analysis
devtools package development, package installation
downloader data acquisition
dplyr data wrangling, data analysis
DT data display
dygraphs data visualization
editR data display
fitbitScraper misc
foreach data wrangling
ggplot2 data visualization
gmodels data wrangling, data analysis
googlesheets data import, data export
googleVis data visualization
installr misc
jsonlite data import, data wrangling
knitr data display
leaflet mapping
listviewer data display, data wrangling
lubridate data wrangling
metricsgraphics data visualization
openxlsx misc
plotly data visualization
plotly data visualization
plyr data wrangling
psych data analysis
quantmod data import, data visualization, data analysis
rcdimple data visualization
RColorBrewer data visualization
readr data import
readxl data import
reshape2 data wrangling
rga Web analytics
rio data import, data export
RMySQL data import
roxygen2 package development
RSiteCatalyst Web analytics
rvest data import, web scraping
scales data wrangling
shiny data visualization
sqldf data wrangling, data analysis
stringr data wrangling
tidyr data wrangling
tmap mapping
XML data import, data wrangling
zoo data wrangling, data analysis

Enjoy!


I want to use XQuery at least once a day in 2016 on my blog. To keep myself honest, I will be posting any XQuery I use.

To sort and extract two of the columns from Mary’s table, I copied the table to a separate file and ran this XQuery:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

One of the nifty aspects of XQuery is that you can sort, as on line 5, in all lower-case on the first <td> element, while returning the same element as written in the original table. Which gives better (IMHO) sort order than UPPERCASE followed by lowercase.

This same technique should make you the master of any simple tables you encounter on the web.

PS: You should always acknowledge the source of your data and the original author.

I first saw Sharon’s list in a tweet by Christophe Lalanne.

Fun with facets in ggplot2 2.0

Saturday, December 26th, 2015

Fun with facets in ggplot2 2.0 by Bob Rudis.

From the post:

ggplot2 2.0 provides new facet labeling options that make it possible to create beautiful small multiple plots or panel charts without resorting to icky grob manipulation.

Very appropriate for this year in Georgia (US) at any rate. Facets are used to display temperature by year and temperature versus Kwh by year.

The high today, 26th of December, 2015, is projected to be 77°F.

Sigh, that’s just not December weather.