Data Mining « Another Word For It

November 25, 2014

Documents Released in the Ferguson Case

Filed under: Data Mining,Ferguson,Text Mining — Patrick Durusau @ 4:15 pm

Documents Released in the Ferguson Case (New York Times)

The New York Times has posted the following documents from the Ferguson case:

24 Volumes of Grand Jury Testimony
30 Interviews of Witnesses by Law Enforcement Officials
23 Forensic and Other Reports
254 Photographs

Assume you are interested in organizing these materials for rapid access and cross-linking between them.

What are your requirements?

Accessing Grand Jury Testimony by volume and page number?
Accessing Interviews of Witnesses by report and page number?
Linking people to reports, testimony and statements?
Linking comments to particular photographs?
Linking comments to a timeline?
Linking Forensic reports to witness statements and/or testimony?
Linking physical evidence into witness statements and/or testimony?
Others?

It’s a lot of material so which requirements, these or others, would be your first priority?

It’s not a death march project but on the other hand, you need to get the most valuable tasks done first.

Suggestions?

Comments Off

October 3, 2014

Open Challenges for Data Stream Mining Research

Filed under: BigData,Data Mining,Data Streams,Text Mining — Patrick Durusau @ 4:58 pm

Open Challenges for Data Stream Mining Research, SIGKDD Explorations, Volume 16, Number 1, June 2014.

Abstract:

Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, over-looking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

Under entity stream mining, the authors describe the challenge of aggregation:

The first challenge of entity stream mining task concerns information summarization: how to aggregate into each entity e at each time point t the information available on it from the other streams? What information should be stored for each entity? How to deal with differences in the speeds of individual streams? How to learn over the streams efficiently? Answering those questions in a seamless way would allow us to deploy conventional stream mining methods for entity stream mining after aggregation.
…

Sounds remarkably like an issue for topic maps doesn’t it? Well, not topic maps in the sense that every entity has an IRI subjectIdentifier but in the sense that merging rules define the basis on which two or more entities are considered to represent the same subject.

The entire issue is on “big data” and if you are looking for research “gaps,” it is a great starting point. Table of Contents: SIGKDD explorations, Volume 16, Number 1, June 2014.

I included the TOC link because for reasons only known to staff at the ACM, the articles in this issue don’t show up in the library index. One of the many “features” of the ACM Digital Library.

In addition to the committee which oversees the Digital Library being undisclosed to members and available for contact only by staff.

Comments Off

September 15, 2014

Norwegian Ethnological Research [The Early Years]

Filed under: Data,Data Mining,Ethnological — Patrick Durusau @ 10:49 am

Norwegian Ethnological Research [The Early Years] by Lars Marius Garshol.

From the post:

The definitive book on Norwegian farmhouse ale is Odd Nordland’s “Brewing and beer traditions in Norway,” published in 1969. That book is now sadly totally unavailable, except from libraries. In the foreword Nordland writes that the book is based on a questionnaire issued by Norwegian Ethnological Research in 1952 and 1957. After digging a little I discovered that this material is actually still available at the institute. The questionnaire is number 35, running to 103 questions.

Because the questionnaire responses in general often contain descriptions of quite personal matters, access to the answers is restricted. However, by paying a quite stiff fee, describing the research I wanted to use the material for, and signing a legal agreement, I was sent a CD with all the answers to questionnaire 35. The contents are quite daunting: 1264 numbered JPEG files, with no metadata of any kind. The files are scans of individual pages of responses, plus one cover page for each Norwegian province. Most of the responses are handwritten, and legibility varies dramatically. Some, happily, are typewritten.

I appended “[The Early Years]” to the title because Lars has embarked on an adventure that can last as long as he remains interested.

Sixty-two year old survey results leave Lars wondering exactly what was meant in some cases. Keep that in mind the next time you search for word usage across centuries. Matching exact strings isn’t the same thing as matching the meanings attached to those strings.

You can imagine what gaps and ambiguities might exist when the time period stretches to centuries, if not millennia, and our knowledge of the languages is learned in a modern context.

The understanding we capture is our own, which hopefully has some connection to earlier witnesses. Recording that process is a uniquely human activity and one that I am glad Lars is sharing with a larger audience.

Looking forward to hearing about more results!

PS: Do you have a similar “data mining” story to share? Including the use of command line tool stories but working with non-electronic resources as well.

Comments Off

September 3, 2014

Data Sciencing by Numbers:…

Filed under: Data Mining,Text Analytics,Text Mining — Patrick Durusau @ 3:28 pm

Data Sciencing by Numbers: A Walk-through for Basic Text Analysis by Jason Baldridge.

From the post:

My previous post “Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics” discusses a simple exploration I did into algorithmically rating SXSW titles, most of which I did while on a plane trip last week. What I did was pretty basic, and to demonstrate that, I’m following up that post with one that explicitly shows you how you can do it yourself, provided you have access to a Mac or Unix machine.

There are three main components to doing what I did for the blog post:

Topic modeling code: the Mallet toolkit’s implementation of Latent Dirichlet Allocation

Language modeling code: the BerkeleyLM Java package for training and using n-gram language models

Unix command line tools for processing raw text files with standard tools and the topic modeling and language modeling code

I’ll assume you can use the Unix command line at at least a basic level, and I’ve packaged up the topic modeling and language modeling code in the Github repository maul to make it easy to try them out. To keep it really simple: you can download the Maul code and then follow the instructions in the Maul README. (By the way, by giving it the name “maul” I don’t want to convey that it is important or anything — it is just a name I gave the repository, which is just a wrapper around other people’s code.)

…

Jason’s post should help get you starting doing data exercises. It is up to you if you continue those exercises and branch out to other data and new tools.

Like everything else, data exploration proficiency requires regular exercise.

Are you keeping a data exercise calendar?

I first saw this in a post by Jason Baldridge.

Comments Off

Titillating Titles:…

Filed under: Data Mining,Text Analytics — Patrick Durusau @ 3:13 pm

Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics by Jason Baldridge.

From the post:

The proposals for SXSW 2015 have been posted for several weeks now, and the community portion of the process ends this week on Friday, September 5. As a proposer myself for Are You In A Social Media Experiment?, I’ve been meaning to find a chance to look into the titles and see whether some straight-forward Unix commands, text analytics and natural language processing can reveal anything interesting about them.

People reportedly put a lot of thought into their titles since that is a big part of getting your proposal noticed in the community part of the voting process for panels. The creators of proposals for SXSW are given lots of feedback, including things like on their titles.

“Vague, non-descriptive language is a common mistake on titles — but if readers can’t comprehend the basic focus of your proposal without also reading the description, then you probably need to re-think your approach. If you can make the title witty and attention-getting, then wonderful. But please don’t let wit sidetrack you from the more significant goals of simple, accurate and succinct.”

In short, a title should stand out while remaining informative. It turns out that there has been research in computational linguistics into how to craft memorable quotes that is interesting with respect to standing out. Danescu-Niculescu-Mizil, Cheng, Kleinberg, and Lee’s (2012) “You had me at hello: How phrasing affects memorability” found that memorable movie quotes use less common words built on a scaffold of common syntactic patterns (BTW, the paper itself has great section titles). Chan, Lee and Pang (2014) go to the next step of building a model that predicts which of two versions of a tweet will have a better response (in terms of obtaining retweets) (see the demo).

Are you read to take your titles beyond spell-check and grammar correction?

What if you could check your titles at least to make them more memorable? Would you do it?

Jason provides an example of how checking your title for “impact” may not be all that far fetched.

PS: Be sure to try the demo for “better” tweets.

Comments Off

FCC Net Neutrality Plan – 800,000 Comments

Filed under: Data Analysis,Data Mining,Government — Patrick Durusau @ 1:40 pm

What can we learn from 800,000 public comments on the FCC’s net neutrality plan? by Bob Lannon and Andrew Pendleton.

From the post:

On Aug. 5, the Federal Communications Commission announced the bulk release of the comments from its largest-ever public comment collection. We’ve spent the last three weeks cleaning and preparing the data and leveraging our experience in machine learning and natural language processing to try and make sense of the hundreds-of-thousands of comments in the docket. Here is a high-level overview, as well as our cleaned version of the full corpus which is available for download in the hopes of making further research easier.

A great story of cleaning dirty data. Beyond eliminating both Les Misérables and War and Peace as comments, the authors detected statements by experts, form letters, etc.

If you’re interested in doing your own analysis with this data, you can download our cleaned-up versions below. We’ve taken the six XML files released by the FCC and split them out into individual files in JSON format, one per comment, then compressed them into archives, one for each of XML file. Additionally, we’ve taken several individual records from the FCC data that represented multiple submissions grouped together, and split them out into individual files (these JSON files will have hyphens in their filenames, where the value before the hyphen represents the original record ID). This includes email messages to openinternet@fcc.gov, which had been aggregated into bulk submissions, as well as mass submissions from CREDO Mobile, Sen. Bernie Sanders’ office and others. We would be happy to answer any questions you may have about how these files were generated, or how to use them.

14-28-RAW-Solr-1-v1.tar.gz

14-28-RAW-Solr-2-v1.tar.gz

14-28-RAW-Solr-3a-v1.tar.gz

14-28-RAW-Solr-3b-v1.tar.gz/li>
14-28-RAW-Solr-4-v1.tar.gz

14-28-RAW-Solr-5-v1.tar.gz

All the code use in the project is available at: https://github.com/sunlightlabs/fcc-net-neutrality-comments

I first saw this in a tweet by Scott Chamberlain.

Comments Off

September 1, 2014

Extracting images from scanned book pages

Filed under: Data Mining,Image Processing,Image Recognition — Patrick Durusau @ 7:14 pm

Extracting images from scanned book pages by Chris Adams.

From the post:

I work on a project which has placed a number of books online. Over the years we’ve improved server performance and worked on a fast, responsive viewer for scanned books to make our books as accessible as possible but it’s still challenging to help visitors find something of interest out of hundreds of thousands of scanned pages.

Trevor and I have discussed various ways to improve the situation and one idea which seemed promising was seeing how hard it would be to extract the images from digitized pages so we could present a visual index of an item. Trevor’s THATCamp CHNM post on Freeing Images from Inside Digitized Books and Newspapers got a favorable reception and since it kept coming up at work I decided to see how far I could get using OpenCV.

Everything you see below is open-source and comments are highly welcome. I created a book-illustration-detection branch in my image mining project (see my previous experiment reconstructing higher-resolution thumbnails from the masters) so feel free to fork it or open issues.

Just in case you are looking for a Fall project. 😉

Consider capturing the images and their contents in associations with authors, publishers, etc. To enable mining those associations for patterns.

Comments Off

August 24, 2014

Who Dat?

Filed under: Annotation,Data Mining — Patrick Durusau @ 6:50 pm

Dat

From the about page:

Dat is an grant funded, open source project housed in the US Open Data Institute. While dat is a general purpose tool, we have a focus on open science use cases.

The high level goal of the dat project is to build a streaming interface between every database and file storage backend in the world. By building tools to build and share data pipelines we aim to bring to data a style of collaboration similar to what git brings to source code.

The first alpha release is now out!

More on this project later this coming week.

I first saw this in Nat Torkington’s Four short links: 21 August 2014.

Comments Off

August 21, 2014

Data Carpentry (+ Sorted Nordic Scores)

Filed under: Data Mining,Data Science — Patrick Durusau @ 7:03 pm

Data Carpentry by David Mimno.

From the post:

The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. The fact that raw data is rarely usable for analysis without significant work is a point I try hard to make with my students. I told them “do not underestimate the difficulty of data preparation”. When they turned in their projects, many of them reported that they had underestimated the difficulty of data preparation. Recognizing this as a hard problem is great.

What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry.

Note: data carpentry seems to already be a thing

I’m not convinced that “carpentry” is the best prestige target.

The first mention of carpenters on a sorted version of the Nordic Scores (Colorado Adoption Project: Resources for Researchers. Institute for Behavioral Genetics, University of Colorado Boulder) is at 147.*

I would go for data scientist since mercenary isn’t listed as an occupation. 😉

The usual cautions apply. Prestige is as difficult or perhaps more so to measure than any other social construct. The data is from 1989 and so may not reflect “current” prestige rankings.

*(I have removed the classes and sorted by prestige score, to create Sorted Nordic Scores.)

Comments Off

August 8, 2014

ContentMine

Filed under: Artificial Intelligence,Data Mining,Machine Learning — Patrick Durusau @ 6:45 pm

ContentMine

From the webpage:

The ContentMine uses machines to liberate 100,000,000 facts from the scientific literature.

We believe that Content Mining has huge potential to make knowledge available to everyone (including machines). This can enable new and exciting research, technology developments such as in Artificial Intelligence, and opportunities for wealth creation.

Manual content-mining has been routine for 150 years, but many publishers feel threatened by machine-content-mining. It’s certainly disruptive technology but we argue that if embraced wholeheartedly it will take science forward massively and create completely new opportunities. Nevertheless many mainstream publishers have actively campaigned against it.

Although content mining can be done without breaking current laws, the borderline between legal and illegal is usually unclear. So we campaign for reform, and we work on the basis that anything that is legal for a human should also be legal for a machine.

* The right to read is the right to mine *

Well, when I went to see what facts had been discovered:

We don’t have any facts yet – there should be some here very soon!

Well, at least now you have the URL and the pitch. Curious when facts are going to start to appear?

I’m not entirely comfortable with the term “facts” because it is usually used to put some particular “fact” off-limits from discussion or debate. “It’s a fact that ….” (you fill in the blank) To disagree with such a statement makes the questioner appear stupid, obstinate or even rude.

Which is, of course, the purpose of any statement “It’s a fact that….” It is intended to end debate on that “fact” and to exclude anyone who continues to disagree.

While we wait for “facts” to appear at ContentMine, research the history of claims of various “facts” in history. You can start with some “facts” about beavers.

Comments Off

July 25, 2014

Pussy Stalking [Geo-Location as merge point]

Filed under: Data Mining,Merging,Privacy — Patrick Durusau @ 12:45 pm

Cat stalker knows where your kitty lives (and it’s your fault) by Lisa Vaas.

From the post:

Ever posted a picture of your cat online?

Unless your privacy settings avoid making APIs publicly available on sites like Flickr, Twitpic, Instagram or the like, there’s a cat stalker who knows where your liddl’ puddin’ lives, and he’s totally pwned your pussy by geolocating it.

That’s right, fat-faced grey one from Al Khobar in Saudi Arabia, Owen Mundy knows you live on Tabari Street.

Mundy, a data analyst, artist, and Associate Professor in the Department of Art at Florida State University, has been working on the data visualisation project, which is called I Know Where Your Cat Lives.
….

See Lisa’s post for the details about the “I Know Where Your Cat Lives” project.

The same data leakage is found in other types of photographs as well. Such as photographs by military personnel.

An enterprising collector could use geolocation as a merge point to collect all the photos made at a particular location. Or using geolocation ask “who?” for some location X.

Or perhaps a city map using geolocated images to ask “who?” Everyone may not know your name but with a large enough base of users, someone will.

PS: There is at least one app for facial recognition, NameTag. I don’t have a cellphone so you will have to comment on how well it works. I still like the idea of a “who?” site. Perhaps because I prefer human intell over data vacuuming.

Comments Off

July 19, 2014

First complex, then simple

Filed under: Bioinformatics,Data Analysis,Data Mining,Data Models — Patrick Durusau @ 4:18 pm

First complex, then simple by James D Malley and Jason H Moore. (BioData Mining 2014, 7:13)

Abstract:

At the start of a data analysis project it is often suggested that the researcher look first at multiple simple models. That is, always begin with simple, one variable at a time analyses, such as multiple single-variable tests for association or significance. Then, later, somehow (how?) pull all the separate pieces together into a single comprehensive framework, an inclusive data narrative. For detecting true compound effects with more than just marginal associations, this is easily defeated with simple examples. But more critically, it is looking through the data telescope from wrong end.

I would have titled this article: “Data First, Models Later.”

That is the author’s start with no formal theories about what data will prove and upon finding signals in the data, then generate simple models to explain the signals.

I am sure their questions of the data are driven by a suspicion of what the data may prove, but that isn’t the same thing as asking questions designed to prove a model generated before the data is queried.

Comments Off

July 8, 2014

Setting up your own Data Refinery

Filed under: Data Mining — Patrick Durusau @ 4:00 pm

Setting up your own Data Refinery by Shawn Graham.

From the post:

I’ve been playing with a Mac. I’ve been a windows person for a long time, so bear with me.

I’m setting up a number of platforms locally for data mining. But since what I’m *really* doing is smelting the ore of data scraped using things like Outwit Hub or Import.io (the ‘mining operation’, in this tortured analogy), what I’m setting up is a data refinery. Web based services are awesome, but if you’re dealing with sensitive data (like oral history interviews, for example) you need something local – this will also help with your ethics board review too. Onwards!

Shawn provides basic Mac setup instructions for:

Voyant-Tools

Overview

Raw

The same software is available for Windows and *nix platforms.

Enjoy!

Comments Off

July 2, 2014

Verticalize

Filed under: Bioinformatics,Data Mining,Text Mining — Patrick Durusau @ 3:05 pm

Verticalize by Pierre Lindenbum.

From the webpage:

Simple tool to verticalize text delimited files.

Pierre works in bioinformatics and is the author of many useful tools.

Definitely one for the *nix toolbox.

Comments Off

June 24, 2014

DAMEWARE:…

Filed under: Astroinformatics,BigData,Data Mining — Patrick Durusau @ 6:16 pm

DAMEWARE: A web cyberinfrastructure for astrophysical data mining by Massimo Brescia, et al.

Abstract:

Astronomy is undergoing through a methodological revolution triggered by an unprecedented wealth of complex and accurate data. The new panchromatic, synoptic sky surveys require advanced tools for discovering patterns and trends hidden behind data which are both complex and of high dimensionality. We present DAMEWARE (DAta Mining & Exploration Web Application REsource): a general purpose, web-based, distributed data mining environment developed for the exploration of large datasets, and finely tuned for astronomical applications. By means of graphical user interfaces, it allows the user to perform classification, regression or clustering tasks with machine learning methods. Salient features of DAMEWARE include its capability to work on large datasets with minimal human intervention, and to deal with a wide variety of real problems such as the classification of globular clusters in the galaxy NGC1399, the evaluation of photometric redshifts and, finally, the identification of candidate Active Galactic Nuclei in multiband photometric surveys. In all these applications, DAMEWARE allowed to achieve better results than those attained with more traditional methods. With the aim of providing potential users with all needed information, in this paper we briefly describe the technological background of DAMEWARE, give a short introduction to some relevant aspects of data mining, followed by a summary of some science cases and, finally, we provide a detailed description of a template use case.

Despite the progress made in the creation of DAMEWARE, the authors conclude in part:

The harder problem for the future will be heterogeneity of platforms, data and applications, rather than simply the scale of the deployed resources. The goal should be to allow scientists to explore the data easily, with suﬃcient processing power for any desired algorithm to eﬃciently process it. Most existing ML methods scale badly with both increasing number of records and/or of dimensionality (i.e., input variables or features). In other words, the very richness of astronomical data sets makes them diﬃcult to analyze….

The size of data sets is an issue, but heterogeneity issues with platforms, data and applications are several orders of magnitude more complex.

I remain curious when that is going to dawn on the the average “big data” advocate.

Comments Off

June 21, 2014

What’s On Your Desktop?

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 7:55 pm

The Analyst’s Toolbox by Simon Raper.

From the post:

There are hundreds, maybe thousands, of open source/free/online tools out there that form part of the analyst’s toolbox. Here’s what I have on my mac for day to day work. Click on the leaf node labels to be redirected to the relevant sites. Visualisation in D3.

Tools in day to day use by a live data analyst. Nice presentation as well.

What’s on your desktop?

Comments Off

June 12, 2014

Condensing News

Filed under: Data Mining,Information Overload,News,Reporting,Summarization — Patrick Durusau @ 7:27 pm

Information Overload: Can algorithms help us navigate the untamed landscape of online news? by Jason Cohn.

From the post:

Digital journalism has evolved to a point of paradox: we now have access to such an overwhelming amount of news that it’s actually become more difficult to understand current events. IDEO New York developer Francis Tseng is—in his spare time—searching for a solution to the problem by exploring its root: the relationship between content and code. Tseng received a grant from the Knight Foundation to develop Argos*, an online news aggregation app that intelligently collects, summarizes and provides contextual information for news stories. Having recently finished version 0.1.0, which he calls the first “complete-ish” release of Argos, Tseng spoke with veteran journalist and documentary filmmaker Jason Cohn about the role technology can play in our consumption—and comprehension—of the news.

Great story and very interesting software. And as Alyona notes in her tweet, it’s open source!

Any number of applications, particularly for bloggers who are scanning lots of source material everyday.

Intended for online news but a similar application would be useful for TV news as well. In the Altanta, Georgia area a broadcast could be prefaced by:

Accidents (gristly ones) 25%
Crimes (various) 30%
News previously reported but it’s a slow day today 15%
News to be reported on a later broadcast 10%
Politics (non-contextualized posturing) 10%
Sports (excluding molesting stories reported under crimes) 5%
Weather 5%

I haven’t timed the news and some channels are worse than others but take that as a recurrent, public domain summary of Atlanta news. 😉

For digital news feeds, check out the Argos software!

I first saw this in a tweet by Alyona Medelyan.

Comments Off

June 11, 2014

Exploring FBI Crime Statistics…

Filed under: Data Mining,FBI,Python,Statistics — Patrick Durusau @ 2:30 pm

Exploring FBI Crime Statistics with Glue and plotly by Chris Beaumont.

From the post:

Glue is a project I’ve been working on to interactively visualize multidimensional datasets in Python. The goal of Glue is to make trivially easy to identify features and trends in data, to inform followup analysis.

This notebook shows an example of using Glue to explore crime statistics collected by the FBI (see this notebook for the scraping code). Because Glue is an interactive tool, I’ve included a screencast showing the analysis in action. All of the plots in this notebook were made with Glue, and then exported to plotly (see the bottom of this page for details).
….

FBI crime statistics are used for demonstration purposes but Glue should be generally useful for exploring multidimensional datasets.

It isn’t possible to tell how “clean” or “consistent” the FBI reported crime data may or may not be. And as the FBI itself points out, comparison between locales is fraught with peril.

Comments Off

June 4, 2014

Health Intelligence

Filed under: Data Mining,Intelligence,Visualization — Patrick Durusau @ 4:55 pm

Health Intelligence: Analyzing health data, generating and communicating evidence to improve population health. by Ramon Martinez.

I was following a link to Ramon’s Data Sources page when I discovered his site. The list of data resources is long and impressive.

But there is so much more under Resources!

Data Tools
Database (DB) Blogs
Data Visualization Tools
Data Viz Blogs
Reading for Data Visualizations
Best of the Web…
Tableau Training
Going to School
Reading for Health Analysis

You will probably like the rest of the site as well!

Data tools/visualization are very ecumenical.

Comments Off

May 30, 2014

Tablib: Pythonic Tabular Datasets

Filed under: Data Mining,String Matching — Patrick Durusau @ 2:40 pm

Tablib: Pythonic Tabular Datasets by Kenneth Reitz and Bessie Monke.

From the post:

Tablib is an MIT Licensed format-agnostic tabular dataset library, written in Python. It allows you to import, export, and manipulate tabular data sets. Advanced features include, segregation, dynamic columns, tags & filtering, and seamless format import & export.

Definitely an add to your Python keychair USB drive.

I first saw this in a tweet by Gregory Piatetsky.

Comments Off

May 18, 2014

12 Free (as in beer) Data Mining Books

Filed under: Bayesian Data Analysis,Data Mining,Machine Learning,Probabilistic Programming,R,Statistical Learning — Patrick Durusau @ 10:56 am

12 Free (as in beer) Data Mining Books by Chris Leonard.

While all of these volumes could be shelved under “data mining” in a bookstore, I would break them out into smaller categories:

Bayesian Analysis/Methods
Data Mining
Data Science
Machine Learning
R
Statistical Learning

Didn’t want you to skip over Chris’ post because it was “just about data mining.” 😉

Check your hard drive to see what you are missing.

I first saw this in a tweet by Carl Anderson.

Comments Off

May 9, 2014

Large Scale Web Scraping

Filed under: Data Mining,Web Scrapers — Patrick Durusau @ 7:03 pm

We Just Ran Twenty-Three Million Queries of the World Bank’s Website – Working Paper 362 by Sarah Dykstra, Benjamin Dykstra, and Justin Sandefur.

Abstract:

Much of the data underlying global poverty and inequality estimates is not in the public domain, but can be accessed in small pieces using the World Bank’s PovcalNet online tool. To overcome these limitations and reproduce this database in a format more useful to researchers, we ran approximately 23 million queries of the World Bank’s web site, accessing only information that was already in the public domain. This web scraping exercise produced 10,000 points on the cumulative distribution of income or consumption from each of 942 surveys spanning 127 countries over the period 1977 to 2012. This short note describes our methodology, briefly discusses some of the relevant intellectual property issues, and illustrates the kind of calculations that are facilitated by this data set, including growth incidence curves and poverty rates using alternative PPP indices. The full data can be downloaded at www.cgdev.org/povcalnet.

That’s what I would call large scale web scraping!

Useful model to follow for many sources, such as the U.S. Department of Agriculture. A gold mine of reports, data, statistics, but all broken up for the manual act of reading. Or at least that is a charitable explanation for their current data organization.

Comments Off

May 7, 2014

Data Manipulation with Pig

Filed under: Data Mining,Pig — Patrick Durusau @ 7:13 pm

Data Manipulation with Pig by Wes Floyd.

A great slide deck on Pig! BTW, there is a transcript of the presentation available just under the slides.

I first saw this at: The essence of Pig by Alex Popescu.

Comments Off

April 26, 2014

Social Media Mining: An Introduction

Filed under: Data Mining,Social Media — Patrick Durusau @ 6:38 pm

Social Media Mining: An Introduction by Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu.

From the webpage:

The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. Understanding and processing this new type of data to glean actionable patterns presents challenges and opportunities for interdisciplinary research, novel algorithms, and tool development. Social Media Mining integrates social media, social network analysis, and data mining to provide a convenient and coherent platform for students, practitioners, researchers, and project managers to understand the basics and potentials of social media mining. It introduces the unique problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for network analysis and data mining. Suitable for use in advanced undergraduate and beginning graduate courses as well as professional short courses, the text contains exercises of different degrees of difficulty that improve understanding and help apply concepts, principles, and methods in various scenarios of social media mining.

Another Cambridge University Press title that is available in pre-publication PDF format.

If you are contemplating writing a textbook, Cambridge University Press access policies should be one of your considerations in seeking a publisher.

You can download the entire books, chapters, and slides from Social Media Mining: An Introduction

Do remember that only 14% of the U.S. adult population uses Twitter. Whatever “trends” you extract from Twitter may or may not reflect “trends” in the larger population.

I first saw this in a tweet by Stat Fact.

Comments Off

April 25, 2014

PourOver

Filed under: Data Mining,News,Processing,Programming — Patrick Durusau @ 7:11 pm

PourOver

From the webpage:

PourOver is a library for simple, fast filtering and sorting of large collections – think 100,000s of items – in the browser. It allows you to build data-exploration apps and archives that run at 60fps, that don’t have to to wait for a database call to render query results.

PourOver is built around the ideal of simple queries that can be arbitrarily composed with each other, without having to recalculate their results. You can union, intersect, and difference queries. PourOver will remember how your queries were constructed and can smartly update them when items are added or modified. You also get useful features like collections that buffer their information periodically, views that page and cache, fast sorting, and much, much more.

If you just want to get started using PourOver, I would skip to “Preface – The Best Way to Learn PourOver”. There you will find extensive examples. If you are curious about why we made PourOver or what it might offer to you, I encourage you to skip down to “Chp 1. – The Philosophy of PourOver”.

This looks very cool!

Imagine doing client side merging of content from multiple topic map servers.

This type of software development and open release is making me consider a subscription to the New York Times.

You?

I first saw this at Nathan Yau’s PourOver Allows Filtering of Large Datasets In Your Browser. If you are interested in data visualization and aren’t following Nathan’s blog, you should be.

Comments Off

March 22, 2014

Opening data: Have you checked your pipes?

Filed under: Data Mining,ETL,Open Access,Open Data — Patrick Durusau @ 7:44 pm

Opening data: Have you checked your pipes? by Bob Lannon.

From the post:

Code for America alum Dave Guarino had a post recently entitled “ETL for America”. In it, he highlights something that open data practitioners face with every new project: the problem of Extracting data from old databases, Transforming it to suit a new application or analysis and Loading it into the new datastore that will support that new application or analysis. Almost every technical project (and every idea for one) has this process as an initial cost. This cost is so pervasive that it’s rarely discussed by anyone except for the wretched “data plumber” (Dave’s term) who has no choice but to figure out how to move the important resources from one place to another.

Why aren’t we talking about it?

The up-front costs of ETL don’t come up very often in the open data and civic hacking community. At hackathons, in funding pitches, and in our definitions of success, we tend to focus on outputs (apps, APIs, visualizations) and treat the data preparation as a collateral task, unavoidable and necessary but not worth “getting into the weeds” about. Quoting Dave:

The fact that I can go months hearing about “open data” without a single
mention of ETL is a problem. ETL is the pipes of your house: it’s how you
open data.

It’s difficult to point to evidence that this is really the case, but I personally share Dave’s experience. To me, it’s still the elephant in the room during the proceedings of any given hackathon or open data challenge. I worry that the open data community is somehow under the false impression that, eventually in the sunny future, data will be released in a more clean way and that this cost will decrease over time.

It won’t. Open data might get cleaner, but no data source can evolve to the point where it serves all possible needs. Regardless of how easy it is to read, the data published by government probably wasn’t prepared with your new app idea in mind.

Data transformation will always be necessary, and it’s worth considering apart from the development of the next cool interface. It’s a permanent cost of developing new things in our space, so why aren’t we putting more resources toward addressing it as a problem in its own right? Why not spend some quality time (and money) focused on data preparation itself, and then let a thousand apps bloom?

…

If you only take away this line:

Open data might get cleaner, but no data source can evolve to the point where it serves all possible needs. (emphasis added)

From Bob’s entire post, reading it has been time well spent.

Your “clean data” will at times be my “dirty data” and vice versa.

Documenting the semantics we “see” in data and that drives our transformations into “clean” data for us, stands a chance of helping the next person in the line to use that data.

Think of it as an accumulation of experience with a data sets and the results obtained from it.

Or you can just “wing it” with ever data set you encounter and so shall we all.

Your call.

I first saw this in a tweet by Dave Guarino.

Comments Off

March 18, 2014

March 12, 2014

Data Mining the Internet Archive Collection [Librarians Take Note]

Filed under: Archives,Data Mining,Librarian/Expert Searchers,MARC,MARCXML,Python — Patrick Durusau @ 4:48 pm

Data Mining the Internet Archive Collection by Caleb McDaniel.

From the “Lesson Goals:”

The collections of the Internet Archive (IA) include many digitized sources of interest to historians, including early JSTOR journal content, John Adams’s personal library, and the Haiti collection at the John Carter Brown Library. In short, to quote Programming Historian Ian Milligan, “The Internet Archive rocks.”

In this lesson, you’ll learn how to download files from such collections using a Python module specifically designed for the Internet Archive. You will also learn how to use another Python module designed for parsing MARC XML records, a widely used standard for formatting bibliographic metadata.

For demonstration purposes, this lesson will focus on working with the digitized version of the Anti-Slavery Collection at the Boston Public Library in Copley Square. We will first download a large collection of MARC records from this collection, and then use Python to retrieve and analyze bibliographic information about items in the collection. For example, by the end of this lesson, you will be able to create a list of every named place from which a letter in the antislavery collection was written, which you could then use for a mapping project or some other kind of analysis.

This rocks!

In particular for librarians and library students who will already be familiar with MARC records.

Some 7,000 items from the Boston Public Library’s anti-slavery collection at Copley Square are the focus of this lesson.

That means historians have access to rich metadata, full images, and partial descriptions for thousands of antislavery letters, manuscripts, and publications.

Would original anti-slavery materials, written by actual participants, have interested you as a student? Do you think such materials would interest students now?

I first saw this in a tweet by Gregory Piatetsky.

Comments Off

March 3, 2014

Data Science – Chicago

Filed under: Challenges,Data Mining,Government Data,Visualization — Patrick Durusau @ 8:19 pm

OK, I shortened the headline.

The full headline reads: Accenture and MIT Alliance in Business Analytics launches data science challenge in collaboration with Chicago: New annual contest for MIT students to recognize best data analytics and visualization ideas.: The Accenture and MIT Alliance in Business Analytics

Don’t try that without coffee in the morning.

From the post:

The Accenture and MIT Alliance in Business Analytics have launched an annual data science challenge for 2014 that is being conducted in collaboration with the city of Chicago.

The challenge invites MIT students to analyze Chicago’s publicly available data sets and develop data visualizations that will provide the city with insights that can help it better serve residents, visitors, and businesses. Through data visualization, or visual renderings of data sets, people with no background in data analysis can more easily understand insights from complex data sets.

The headline is longer than the first paragraph of the story.

I didn’t see an explanation for why the challenge is limited to:

The challenge is now open and ends April 30. Registration is free and open to active MIT students 18 and over (19 in Alabama and Nebraska). Register and see the full rule here: http://aba.mit.edu/challenge.

Find a sponsor and setup an annual data mining challenge for your school or organization.

Although I would suggest you take a pass on Bogata, Mexico City, Rio de Janeiro, Moscow, Washington, D.C. and similar places where truthful auditing could be hazardous to your health.

Or as one of my favorite Dilbert cartoons had the pointy-haired boss observing:

“When you find a big pot of crazy it’s best not to stir it.“

Comments Off

March 2, 2014

Data Mining with Weka (2014)

Filed under: CS Lectures,Data Mining,Weka — Patrick Durusau @ 9:17 pm

Data Mining with Weka

From the course description:

Everybody talks about Data Mining and Big Data nowadays. Weka is a powerful, yet easy to use tool for machine learning and data mining. This course introduces you to practical data mining.

The 5-week course starts on 3rd March 2014.

Apologies, somehow I missed the notice on this class.

This will be followed by More Data Mining with Weka in late April of 2014.

Based on my experience with the Weka Machine Learning course, also with Professor Witten, I recommend either one or both of these courses without reservation.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 25, 2014

October 3, 2014

September 15, 2014

September 3, 2014

September 1, 2014

August 24, 2014

August 21, 2014

August 8, 2014

* The right to read is the right to mine *

July 25, 2014

July 19, 2014

July 8, 2014

July 2, 2014

June 24, 2014

June 21, 2014

June 12, 2014

June 11, 2014

June 4, 2014

May 30, 2014

May 18, 2014

May 9, 2014

May 7, 2014

April 26, 2014

April 25, 2014

March 22, 2014

Why aren’t we talking about it?

March 18, 2014

March 12, 2014

March 3, 2014

March 2, 2014