Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 17, 2015

Building Software, Building Community: Lessons from the rOpenSci Project

Filed under: Open Source,R,Science — Patrick Durusau @ 6:36 pm

Building Software, Building Community: Lessons from the rOpenSci Project by Carl Boettiger, Scott Chamberlain, Edmund Hart, Karthik Ram.

Abstract:

rOpenSci is a developer collective originally formed in 2011 by graduate students and post-docs from ecology and evolutionary biology to collaborate on building software tools to facilitate a more open and synthetic approach in the face of transformative rise of large and heterogeneous data. Born on the internet (the collective only began through chance discussions over social media), we have grown into a widely recognized effort that supports an ecosystem of some 45 software packages, engages scores of collaborators, has taught dozens of workshops around the world, and has secured over $480,000 in grant support. As young scientists working in an academic context largely without direct support for our efforts, we have first hand experience with most of the the technical and social challenges WSSSPE seeks to address. In this paper we provide an experience report which describes our approach and success in building an effective and diverse community.

Given the state of world affairs, I can’t think of a better time for the publication of this article.

The key lesson that I urge you to draw from this paper is the proactive stance of the project in involving and reaching out to build a community around this project.

Too many projects (and academic organizations for that matter) take the approach that others know they exist and so they sit waiting for volunteers and members to queue up.

Very often they are surprised and bitter that the queue of volunteers and members is so sparse. If anyone dares to venture that more outreach might be helpful, the response is nearly always, sure, you go do that and let us know when it is successful.

How proactive are you in promoting your favorite project?

PS: The rOpenSci website.

November 14, 2015

4 Tips to Learn More About ACS Data [$400 Billion Market, 3X Big Data]

Filed under: BigData,Census Data,R — Patrick Durusau @ 9:59 pm

4 Tips to Learn More About ACS Data by Ari Lamstein.

From the post:

One of the highlights of my recent east coast trip was meeting Ezra Haber Glenn, the author of the acs package in R. The acs package is my primary tool for accessing census data in R, and I was grateful to spend time with its author. My goal was to learn how to “take the next step” in working with the census bureau’s American Community Survey (ACS) dataset. I learned quite a bit during our meeting, and I hope to share what I learned over the coming weeks on my blog.

Today I’ll share 4 tips to help you get started in learning more. Before doing that, though, here is some interesting trivia: did you know that the ACS impacts how over $400 billion is allocated each year?

If the $400 billion got your attention, follow the tips in Ari’s post first, look for more posts in that series second, then visit the American Community Survey (ACS) website.

For comparison purposes, keep in mind that Forbes projects the Big Data Analytics market in 2015 to be a paltry $125 Billion.

The ACS data market is over 3 times larger ($400 Billion (ACS) versus $125 Billion (BigData) for 2015.

Suddenly, ACS data and R look quite attractive.

November 12, 2015

R for cats

Filed under: Programming,R — Patrick Durusau @ 5:36 pm

An intro to R for new programmers by Scott Chamberlain.

From the webpage:

This is an introduction to R. I promise this will be fun. Since you have never used a programming language before, or any language for that matter, you won’t be tainted by other programming languages with different ways of doing things. This is good – we can teach you the R way of doing things.

Scott says this site is a rip off of JSforcats.com and I suggest we take his word for it.

If being “for cats” interests people who would not otherwise study either language, great.

Enjoy!

November 10, 2015

Editors’ Choice: An Introduction to the Textreuse Package [+ A Counter Example]

Filed under: R,Similarity,Similarity Retrieval,Text Mining — Patrick Durusau @ 5:58 pm

Editors’ Choice: An Introduction to the Textreuse Package by Lincoln Mullen.

From the post:

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers. Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out. (emphasis added)

Kudos to Lincoln on this important contribution to the digital humanities! Not to mention the package will also be useful for researchers who want to compare the “similarity” of texts as “subjects” for purposes of elimination of duplication (called merging in some circles) for presentation to a reader.

I highlighted

Put most simply, full text goes in and measures of similarity come out.

to offer a cautionary tale about the assumption that a high measure of similarity is an indication of the “source” of a text.

Louisiana, my home state, is the only civilian jurisdiction in the United States. Louisiana law, more at one time than now, is based upon Roman law.

Roman law and laws based upon it have a very deep and rich history that I won’t even attempt to summarize.

It is sufficient for present purposes to say the Digest of the Civil Laws now in Force in the Territory of Orleans (online version, English/French) was enacted in 1808.

A scholarly dispute arose (1971-1972) between Professor Batiza (Tulane), who considered the Digest to reflect the French civil code and Professor Pascal (LSU), who argued that despite quoting the French civil code quite liberally, that the redactors intended to codify the Spanish civil law in force at the time of the Louisiana Purchase.

The Batiza vs. Pascal debate was carried out at length and in public:

Batiza, The Louisiana Civil Code of 1808: Its Actual Sources and Present Relevance, 46 TUL. L. REV. 4 (1971); Pascal, Sources of the Digest of 1808: A Reply to Professor Batiza, 46 TUL.L.REV. 603 (1972); Sweeney, Tournament of Scholars over the Sources of the Civil Code of 1808, 46 TUL. L. REV. 585 (1972); Batiza, Sources of the Civil Code of 1808, Facts and Speculation: A Rejoinder, 46 TUL. L. REV. 628 (1972).

I could not find any freely available copies of those articles online. (Don’t encourage paywalls accessing such material. Find it at your local law library.)

There are a couple of secondary articles that discuss the dispute: A.N. Yiannopoulos, The Civil Codes of Louisiana, 1 CIV. L. COMMENT. 1, 1 (2008) at http://www.civil-law.org/v01i01-Yiannopoulos.pdf, and John W. Cairns, The de la Vergne Volume and the Digest of 1808, 24 Tulane European & Civil Law Forum 31 (2009), which are freely available online.

You won’t get the full details from the secondary articles but they do capture some of the flavor of the original dispute. I can report (happily) that over time, Pascal’s position has prevailed. Textual history is more complex than rote counting techniques can capture.

A far more complex case of “text similarity” than Lincoln addresses in the Textreuse package, but once you move beyond freshman/doctoral plagiarism, the “interesting cases” are all complicated.

November 8, 2015

600 websites about R [How to Avoid Duplicate Content?]

Filed under: Indexing,R,Searching — Patrick Durusau @ 9:44 pm

600 websites about R by Laetitia Van Cauwenberge.

From the post:

Anyone interested in categorizing them? It could be an interesting data science project, scraping these websites, extracting keywords, and categorizing them with a simple indexation or tagging algorithm. For instance, some of these blogs cater about stats, or Bayesian stats, or R libraries, or R training, or visualization, or anything else. This indexation technique was used here to classify 2,500 data science websites. For web crawling tutorials, click here or here.

BTW, Laetitia lists, with links, all 600 R sites.

How many of those R sites will you visit?

Or will you scan the list for your site or your favorite R site?

For that matter, how duplicated content are you going to find at those R sites?

All have some unique content, but neither an index nor classification will help you find unique content.

Thinking of this as a potential data science experiment, we have a list of 600 sites with content related to R.

What would be your next step towards avoiding duplicated content?

By what criteria would you judge “success” in avoiding duplicate content?

November 6, 2015

Learn R From Scratch

Filed under: Programming,R — Patrick Durusau @ 11:52 am

Learn R From Scratch

From the description:

A Channel dedicated to R Programming – The language of Data Science. We notice people learning the language in parts, so the initial lectures are dedicated to teach the language to aspiring Data Science Professionals, in a structured fashion so that you learn the language completely and be able to contribute back to the community. Upon taking the course, you will appreciate the inherent brilliance of R.

If I haven’t missed anything, thirty-seven (37) R videos await your viewing pleasure!

None of the videos are long, the vast majority shorter than four (4) minutes but a skilled instructor can put a lot in a four minute video.

The short length means you can catch a key concept and go on to practice it before it fades from memory. Plus you can find time for a short video when finding time for an hour lecture is almost impossible.

Enjoy!

November 2, 2015

Visualizing Chess Data With ggplot

Filed under: Games,Ggplot2,R,Visualization — Patrick Durusau @ 11:33 am

Visualizing Chess Data With ggplot by Joshua Kunst.

Sales of traditional chess sets peak during the holiday season. The following graphic does not include sales of chess gadgets, chess software, or chess computers:

trends-081414-weeklydollar

(Source: Terapeak Trends: Which Tabletop Games Sell Best on eBay? by Aron Hsiao.)

Joshua’s post is a guide to using and visualizing chess data under the following topics:

  1. The Data
  2. Piece Movements
  3. Survival rates
  4. Square usage by player
  5. Distributions for the first movement
  6. Who captures whom

Joshua is using public chess data but it’s just a short step to using data from your own chess games or those of friends from your local chess club. 😉

Visualize the play of openings, defenses, players + openings/defenses, you are limited only by your imagination.

Give a chess friend a visualization they can’t buy in any store!

PS: Check out: rchess a Chess Package for R also by Joshua Kunst.

I first saw this in a tweet by Christophe Lalanne.

October 19, 2015

Introduction to Data Science (3rd Edition)

Filed under: Data Science,R — Patrick Durusau @ 9:05 pm

Introduction to Data Science, 3rd Edition by Jeffrey Stanton.

From the webpage:

In this Introduction to Data Science eBook, a series of data problems of increasing complexity is used to illustrate the skills and capabilities needed by data scientists. The open source data analysis program known as “R” and its graphical user interface companion “R-Studio” are used to work with real data examples to illustrate both the challenges of data science and some of the techniques used to address those challenges. To the greatest extent possible, real datasets reflecting important contemporary issues are used as the basis of the discussions.

A very good introductory text on data science.

I originally saw a tweet about the second edition but searching on the title and Stanton uncovered this later version.

In the timeless world of the WWW, the amount of out-dated information vastly exceeds the latest. Check for updates before broadcasting your latest “find.”

October 7, 2015

Treasure Trove of R Scripts…

Filed under: Open Source,R — Patrick Durusau @ 8:30 pm

Treasure Trove of R Scripts for Auto Classification, Chart Generation, Solr, Mongo, MySQL and Ton More by Jitender Aswani.

From the post:

In this repository hosted at github, the datadolph.in team is sharing all of the R codebase that it developed to analyze large quantities of data.

datadolph.in team has benefited tremendously from fellow R bloggers and other open source communities and is proud to contribute all of its codebase into the community.

The codebase includes ETL and integration scripts on –

  • R-Solr Integration
  • R-Mongo Interaction
  • R-MySQL Interaction
  • Fetching, cleansing and transforming data
  • Classification (identify column types)
  • Default chart generation (based on simple heuristics and matching a dimension with a measure)

Github Source: https://github.com/datadolphyn/R

I count twenty-two (22) R scripts in this generous donation back to the R community!

Enjoy!

Some key Win-Vector serial data science articles

Filed under: Data Science,R,Statistics — Patrick Durusau @ 8:20 pm

Some key Win-Vector serial data science articles by John Mount.

From the post:

As readers have surely noticed the Win-Vector LLC blog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence.

  • Statistics to English translation.

    This series tries to find vibrant applications and explanations of standard good statistical practices, to make them more approachable to the non statistician.

  • Statistics as it should be.

    This series tries to cover cutting edge machine learning techniques, and then adapt and explain them in traditional statistical terms.

  • R as it is.

    This series tries to teach the statistical programming language R “warts and all” so we can see it as the versatile and powerful data science tool that it is.

More than enough reasons to start haunting the the Win-Vector LLC blog on a regular basis.

Perhaps an inspiration to do more long-form posts as well.

October 2, 2015

Workflow for R & Shakespeare

Filed under: Literature,R,Text Corpus,Text Mining — Patrick Durusau @ 2:00 pm

A new data processing workflow for R: dplyr, magrittr, tidyr, ggplot2

From the post:

Over the last year I have changed my data processing and manipulation workflow in R dramatically. Thanks to some great new packages like dplyr, tidyr and magrittr (as well as the less-new ggplot2) I've been able to streamline code and speed up processing. Up until 2014, I had used essentially the same R workflow (aggregate, merge, apply/tapply, reshape etc) for more than 10 years. I have added a few improvements over the years in the form of functions in packages doBy, reshape2 and plyr and I also flirted with the package data.table (which I found to be much faster for big datasets but the syntax made it difficult to work with) — but the basic flow has remained remarkably similar. Until now…

Given how much I've enjoyed the speed and clarity of the new workflow, I thought I would share a quick demonstration.

In this example, I am going to grab data from a sample SQL database provided by Google via Google BigQuery and then give examples of manipulation using dplyr, magrittr and tidyr (and ggplot2 for visualization).

This is a great introduction to a work flow in R that you can generalize for your own purposes.

Word counts won’t impress your English professor but you will have a base for deeper analysis of Shakespeare.

I first saw this in a tweet by Christophe Lalanne.

September 21, 2015

Python & R codes for Machine Learning

Filed under: Machine Learning,Python,R — Patrick Durusau @ 7:54 pm

While I am thinking about machine learning, I wanted to mention: Cheatsheet – Python & R codes for common Machine Learning Algorithms by Manish Saraswat.

From the post:

In his famous book – Think and Grow Rich, Napolean Hill narrates story of Darby, who after digging for a gold vein for a few years walks away from it when he was three feet away from it!

Now, I don’t know whether the story is true or false. But, I surely know of a few Data Darby around me. These people understand the purpose of machine learning, its execution and use just a set 2 – 3 algorithms on whatever problem they are working on. They don’t update themselves with better algorithms or techniques, because they are too tough or they are time consuming.

Like Darby, they are surely missing from a lot of action after reaching this close! In the end, they give up on machine learning by saying it is very computation heavy or it is very difficult or I can’t improve my models above a threshold – what’s the point? Have you heard them?

Today’s cheat sheet aims to change a few Data Darby’s to machine learning advocates. Here’s a collection of 10 most commonly used machine learning algorithms with their codes in Python and R. Considering the rising usage of machine learning in building models, this cheat sheet is good to act as a code guide to help you bring these machine learning algorithms to use. Good Luck!

Here’s a very good idea! Whether you want to learn these algorithms or a new Emacs mode. 😉

Sure, you can always look up the answer but that breaks your chain of thought, over and over again.

Enjoy!

September 10, 2015

Spark Release 1.5.0

Filed under: Data Frames,GraphX,Machine Learning,R,Spark,Streams — Patrick Durusau @ 1:42 pm

Spark Release 1.5.0

From the post:

Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page.

You can consult JIRA for the detailed changes. We have curated a list of high level changes here:

Time for your Fall Spark Upgrade!

Enjoy!

September 1, 2015

Looking after Datasets

Filed under: Curation,Dataset,R — Patrick Durusau @ 9:06 pm

Looking after Datasets by Antony Unwin.

Some examples that Antony uses to illustrate the problems with datasets in R:

You might think that supplying a dataset in an R package would be a simple matter: You include the file, you write a short general description mentioning the background and giving the source, you define the variables. Perhaps you provide some sample analyses and discuss the results briefly. Kevin Wright's agridat package is exemplary in these respects.

As it happens, there are a couple of other issues that turn out to be important. Is the dataset or a version of it already in R and is the name you want to use for the dataset already taken? At this point the experienced R user will correctly guess that some datasets have the same name but are quite different (e.g., movies, melanoma) and that some datasets appear in many different versions under many different names. The best example I know is the Titanic dataset, which is availble in the datasets package. You will also find titanic (COUNT, prLogistic, msme), titanic.dat (exactLoglinTest), titan.Dat (elrm), titgrp (COUNT), etitanic (earth), ptitanic (rpart.plot), Lifeboats (vcd), TitanicMat (RelativeRisk), Titanicp (vcdExtra), TitanicSurvival (effects), Whitestar (alr4), and one package, plotrix, includes a manually entered version of the dataset in one of its help examples. The datasets differ on whether the crew is included or not, on the number of cases, on information provided, on formatting, and on discussion, if any, of analyses. Versions with the same names in different packages are not identical. There may be others I have missed.

The issue came up because I was looking for a dataset of the month for the website of my book "Graphical Data Analysis with R". The plan is to choose a dataset from one of the recently released or revised R packages and publish a brief graphical analysis to illustrate and reinforce the ideas presented in the book while showing some interesting information about the data. The dataset finch in dynRB looked rather nice: five species of finch with nine continuous variables and just under 150 cases. It looked promising and what’s more it is related to Darwin’s work and there was what looked like an original reference from 1904.

As if Antony’s list of issues wasn’t enough, how do you capture your understanding of a problem with a dataset?

That is you have discovered the meaning of a variable that isn’t recorded with the dataset. Where are you going to put that information?

You could modify the original dataset to capture that new information but then people will have to discover your version of the original dataset. Not to mention you need to avoid stepping on something else in the original dataset.

Antony concludes:

…returning to Moore’s definition of data, wouldn’t it be a help to distinguish proper datasets from mere sets of numbers in R?

Most people have an intersecting idea of a “proper dataset” but I would spend less time trying to define that and more on capturing the context of whatever appears to me to be a “proper dataset.”

More data is never a bad thing.

August 29, 2015

DataPyR

Filed under: Data Science,Python,R — Patrick Durusau @ 3:19 pm

DataPyR by Kranthi Kumar.

Twenty (20) lists of programming resources on data science, Python and R.

A much easier collection of resources to scan than attempting to search for resources on any of these topics.

At the same time, you have to visit each resource and mine it for an answer to any particular problem.

For example, there is a list of Python Packages for Datamining, which is useful, but even more useful would be a list of common datamining tasks with pointers to particular data mining libraries. That would enable users to search across multiple libraries by task, as opposed to exploring each library.

Expand that across a set of resources on data science, Python and R and you’re talking about saving time and resources across the entire community.

I first saw this in a tweet by Kirk Borne.

August 28, 2015

Mass Shootings [Don’t] Fudg[e] The Numbers

Filed under: News,R,Reporting — Patrick Durusau @ 8:21 pm

Mass Shootings Are Horrifying Enough Without Fudging The Numbers by Bob Rudis (@hrbrmstr).

From the post:

Business Insider has a piece titled “We are now averaging more than one mass shooting per day in 2015”, with a lead paragraph of:

As of August 26th, the US has had 247 mass shootings in the 238 days of 2015.

They go on to say that the data they used in their analysis comes from the Mass Shootings Tracker. That site lists 249 incidents of mass shootings from January 1st to January 28th.

The problem is you can’t just use simple, inflammatory math to make the point about the shootings. A shooting did not occur every day. In fact, there were only 149 days with shootings. Let’s take a look at the data.

We’ll first verify that we are working with the same data that’s on the web site by actually grabbing the data from the web site:

Complete with R code and graphs to show days with multiple mass shootings on one day.

Be mindful that the Mass Shooting Tracker counts four (4) or more people being shot as a mass shooting. Under earlier definitions four (4) or more people had to be murdered for it to be a mass shooting.

BTW, there were ninety-one (91) days with no mass shootings this year. (so far)

August 14, 2015

Proofing R Functions Cheatsheet?

Filed under: R — Patrick Durusau @ 8:04 pm

Lillian Pierson has posted a cheatsheet of R functions.

If you want to do a good deed this weekend, lend a hand with the proofing.

What do you make of the ordering under each heading? I prefer alphabetical. You?

Enjoy!

July 29, 2015

Text Processing in R

Filed under: R,Text Mining — Patrick Durusau @ 1:08 pm

Text Processing in R by Matthew James Denny.

From the webpage:

This tutorial goes over some basic concepts and commands for text processing in R. R is not the only way to process text, nor is it really the best way. Python is the de-facto programming language for processing text, with a lot of builtin functionality that makes it easy to use, and pretty fast, as well as a number of very mature and full featured packages such as NLTK and textblob. Basic shell scripting can also be many orders of magnitude faster for processing extremely large text corpora — for a classic reference see Unix for Poets. Yet there are good reasons to want to use R for text processing, namely that we can do it, and that we can fit it in with the rest of our analyses. I primarily make use of the stringr package for the following tutorial, so you will want to install it:

Perhaps not the best tool for text processing but if you are inside R and have text processing needs, this will get you started.

June 30, 2015

RStudio Cheatsheets

Filed under: R — Patrick Durusau @ 7:29 pm

RStudio Cheatsheets

RStudio, from whence so many good things for R come, has spreadsheets on:

  • Shiny Cheat Sheet (interactive web apps)
  • Data Visualization Cheat Sheet (ggplot2)
  • Package Development Cheat Sheet (devtools)
  • Data Wrangling Cheat Sheet (dplyr and tidyr)
  • R Markdown Cheat Sheet
  • R Markdown Reference Guide

And, all of the above are offered in Chinese, Dutch, French, German, and Spanish translations.

Have an R related cheatsheet about to burn a hole in your pocket? Or a high quality translation? RStudio is ready with details and templates at How to Contribute a Cheatsheet.

Enjoy!

June 29, 2015

Streaming Data IO in R

Filed under: JSON,MongoDB,R,Streams — Patrick Durusau @ 2:59 pm

Streaming Data IO in R – curl, jsonlite, mongolite by Jeroem Ooms.

Abstract:

The jsonlite package provides a powerful JSON parser and generator that has become one of standard methods for getting data in and out of R. We discuss some recent additions to the package, in particular support streaming (large) data over http(s) connections. We then introduce the new mongolite package: a high-performance MongoDB client based on jsonlite. MongoDB (from “humongous”) is a popular open-source document database for storing and manipulating very big JSON structures. It includes a JSON query language and an embedded V8 engine for in-database aggregation and map-reduce. We show how mongolite makes inserting and retrieving R data to/from a database as easy as converting it to/from JSON, without the bureaucracy that comes with traditional databases. Users that are already familiar with the JSON format might find MongoDB a great companion to the R language and will enjoy the benefits of using a single format for both serialization and persistency of data.

R, JSON, MongoDB, what’s there not to like? 😉

From UseR! 2015.

Enjoy!

June 26, 2015

Top 10 data mining algorithms in plain R

Filed under: Data Mining,R — Patrick Durusau @ 3:40 pm

Top 10 data mining algorithms in plain R by Raymond Li.

From the post:

Knowing the top 10 most influential data mining algorithms is awesome.

Knowing how to USE the top 10 data mining algorithms in R is even more awesome.

That’s when you can slap a big ol’ “S” on your chest…

…because you’ll be unstoppable!

Today, I’m going to take you step-by-step through how to use each of the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

By the end of this post…

You’ll have 10 insanely actionable data mining superpowers that you’ll be able to use right away.

The table of contents follows his Top 10 data mining algorithms in plain English, with additions for R:

I would not be at all surprised to see these top ten (10) algorithms show up in other popular data mining languages.

Enjoy!

June 20, 2015

Creating-maps-in-R

Filed under: Mapping,Maps,R — Patrick Durusau @ 2:37 pm

Creating-maps-in-R by Robin Lovelace.

From the webpage:

Introductory tutorial on graphical display of geographical information in R, to contribute to teaching material. For the context of this tutorial and a video introduction, please see here: http://robinlovelace.net/r/2014/01/30/spatial-data-with-R-tutorial.html

All of the information needed to run the tutorial is contained in a single pdf document that is kept updated: see github.com/Robinlovelace/Creating-maps-in-R/raw/master/intro-spatial-rl.pdf.

By the end of the tutorial you should have the confidence and skills needed to convert a diverse range of geographical and non-geographical datasets into meaningful analyses and visualisations. Using data and code provided in this repository all of the results are reproducible, culminating in publication-quality maps such as the faceted map of London’s population below:

Quite a treat in thirty (30) pages! You will have R and some basic spatial data packages installed and be well on your way to creating maps in R. From a topic map perspective, the joining of attributes to polygons is quite similar to adding properties to topics. Assuming you want to treat each polygon as a subject to be represented by a topic.

Enjoy!

PS:

You will also enjoy:

Cheshire, J. & Lovelace, R. (2014). Spatial data visualisation with R. In Geocomputation, a Practical Primer. In Press with Sage. Preprint available online

and other publications by Robin.

June 18, 2015

stationaRy (R package)

Filed under: R,Weather Data — Patrick Durusau @ 5:20 pm

stationaRy by Richard Iannone.

From the webpage:

Get hourly meteorological data from one of thousands of global stations.

Want some tools to acquire and process meteorological and air quality monitoring station data? Well, you’ve come to the right repo. So far, because this is merely the beginning, there’s only a few functions that get you data. These are:

  • get_ncdc_station_info
  • select_ncdc_station
  • get_ncdc_station_data

They will help you get the hourly met data you need from a met station located somewhere on Earth.

I’m old school about the weather. I go outside to check on it. 😉

But, my beloved is interested in earthquakes, volcanoes, hurricanes, weather, etc. so I track resources for those.

Some weather conditions lend themselves more to some activities than others. As Hitler discovered in the winter of 1943-44. Weather can help or hinder your plans, whatever those may be.

You may like the Farmer’s Almanac, but it isn’t a good source for strategic weather data. Try stationaRy.

If you know of any unclassified military strategy guides that cover collection and analysis of weather data, give me a shout.

June 15, 2015

htmlwidgets for Rich Data Visualization in R

Filed under: Graphics,R,Visualization — Patrick Durusau @ 6:04 pm

htmlwidgets for Rich Data Visualization in R

From the webpage:

With the booming popularity of big data and data science, nice visualizations are getting a lot of attention. Sure, R and Python have built-in support for basic graphs and charts, but what if you want more. What if you want interaction so you can mouse-over or rotate a visualization. What if you want to explore more than a static image? Enter Rich Visualizations.

And, creating them is not as hard as you might think!

Four compelling examples of interactive graphics using htmlwidgets to bring interactivity to R code.

At first I thought this might be useful for an interactive map of cybersecurity incompetence inside the DC beltway but quickly realized that a map with only one uniform feature isn’t all that useful.

I am sure htmlwidgets will be useful for many other visualizations!

Enjoy!

15 Easy Solutions To Your Data Frame Problems In R

Filed under: Data Frames,R,Spark — Patrick Durusau @ 3:40 pm

15 Easy Solutions To Your Data Frame Problems In R.

From the post:

R’s data frames regularly create somewhat of a furor on public forums like Stack Overflow and Reddit. Starting R users often experience problems with the data frame in R and it doesn’t always seem to be straightforward. But does it really need to be so?

Well, not necessarily.

With today’s post, DataCamp wants to show you that data frames don’t need to be hard: we offer you 15 easy, straightforward solutions to the most frequently occurring problems with data.frame. These issues have been selected from the most recent and sticky or upvoted Stack Overflow posts. If, however, you are more interested in getting an elaborate introduction to data frames, you might consider taking a look at our Introduction to R course.

If you are having trouble with frames in R, you are going to have trouble with frames in Spark.

Questions and solutions you will see here:

  • How To Create A Simple Data Frame in R
  • How To Change A Data Frame’s Row And Column Names
  • How To Check A Data Frame’s Dimensions
  • How To Access And Change A Data Frame’s Values …. Through The Variable Names
  • … Through The [,] and $ Notations
  • Why And How To Attach Data Frames
  • How To Apply Functions To Data Frames
  • How To Create An Empty Data Frame
  • How To Extract Rows And Colums, Subseting Your Data Frame
  • How To Remove Columns And Rows From A Data Frame
  • How To Add Rows And Columns To A Data Frame
  • Why And How To Reshape A Data Frame From Wide To Long Format And Vice Versa
  • Using stack() For Simply Structured Data Frames
  • Using reshape() For Complex Data Frames
  • Reshaping Data Frames With tidyr
  • Reshaping Data Frames With reshape2
  • How To Sort A Data Frame
  • How To Merge Data Frames
  • Merging Data Frames On Row Names
  • How To Remove Data Frames’ Rows And Columns With NA-Values
  • How To Convert Lists Or Matrices To Data Frames And Back
  • Changing A Data Frame To A Matrix Or List

Rather than looking for a “cheatsheet” on data frames, suggest you work your way through these solutions, more than once. Over time you will learn the ones relevant to your particular domain.

Enjoy!

June 10, 2015

Announcing SparkR: R on Spark [Spark Summit next week – free live streaming]

Filed under: Conferences,R,Spark — Patrick Durusau @ 7:37 pm

Announcing SparkR: R on Spark by Shivaram Venkataraman.

From the post:

I am excited to announce that the upcoming Apache Spark 1.4 release will include SparkR, an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell.

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the runtime is single-threaded and can only process data sets that fit in a single machine’s memory. SparkR, an R package initially developed at the AMPLab, provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows us to run large scale data analysis from the R shell.

The short news here or go to the Spark Summit to get the full story. (Code Databricks20 gets a 20% discount) (That’s next week, June 15 – 17, San Francisco. You need to act quickly.)

BTW, you can register for free live streaming!

Looking forward to this!

May 22, 2015

Harvesting Listicles

Filed under: R,Web Scrapers — Patrick Durusau @ 7:51 pm

Scrape website data with the new R package rvest by hkitson@zevross.com.

From the post:

Copying tables or lists from a website is not only a painful and dull activity but it’s error prone and not easily reproducible. Thankfully there are packages in Python and R to automate the process. In a previous post we described using Python’s Beautiful Soup to extract information from web pages. In this post we take advantage of a new R package called rvest to extract addresses from an online list. We then use ggmap to geocode those addresses and create a Leaflet map with the leaflet package. In the interest of coding local, we opted to use, as the example, data on wineries and breweries here in the Finger Lakes region of New York.

Lists and listicles are a common form of web content. Unfortunately, both are difficult to improve without harvesting the content and recasting it.

This post will put you on the right track to harvesting with rvest!

BTW, as a benefit to others, post data that you clean/harvest in a clean format. Yes?

May 19, 2015

Fast parallel computing with Intel Phi coprocessors

Filed under: Parallel Programming,R — Patrick Durusau @ 1:41 pm

Fast parallel computing with Intel Phi coprocessors by Andrew Ekstrom.

Andrew tells a tale of going from more than a week processing a 10,000×10,000 matrix raised to 10^17 to 6-8 hours and then substantially shorter times. Sigh, using Windows but still an impressive feat! As you might expect, using Revolution Analytics RRO, Intel’s Math Kernel Library (MKL), Intel Phi coprocessors, etc.

There’s enough detail (I suspect) for you to duplicate this feat on your own Windows box, or perhaps more easily on Linux.

Enjoy!

I first saw this in a tweet by David Smith.

April 14, 2015

Hash Table Performance in R: Part I + Part 2

Filed under: Hashing,R — Patrick Durusau @ 10:53 am

Hash Table Performance in R: Part I + Part 2 by Jeffrey Horner.

From part 1:

A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment.

But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time, n being the number of elements in the hash table.

For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table.

This rocks! Talk about performance increases!

My current Twitter client doesn’t dedupe my home feed and certainly doesn’t dedupe it against search based feeds. I’m not so concerned with retweets as with authors that repeat the same tweet several times in a row. What I don’t know is what period of uniqueness would be best? Will have to experiment with that.

I originally saw this series at Hash Table Performance in R: Part II In Part I of this series, I explained how R hashed… on R-Bloggers, the source of so much excellent R related content.

April 6, 2015

Combining the power of R and D3.js

Filed under: D3,R,Visualization — Patrick Durusau @ 6:10 pm

Combining the power of R and D3.js by Andries Van Humbeeck.

From the post:

According to wikipedia, the amount of unstructured data might account for more than 70%-80% of all data in organisations. Because everyone wants to find hidden treasures in these mountains of information, new tools for processing, analyzing and visualizing data are being developed continually. This post focuses on data processing with R and visualization with the D3 JavaScript library.

Great post with fully worked examples of using R with D3.js to create interactive graphics.

Unfortunate that it uses the phrase “immutable images.” A more useful dichotomy is static versus interactive. And it lowers the number of false positives for anyone searching on “immutable.”

Enjoy!

I first saw this in a tweet by Christophe Lalanne.

« Newer PostsOlder Posts »

Powered by WordPress