Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 30, 2013

Using R For Statistical Analysis – Two Useful Videos

Filed under: Data Mining,R,Statistics — Patrick Durusau @ 6:29 pm

Using R For Statistical Analysis – Two Useful Videos by Bruce Berriman.

Bruce has uncovered two interesting videos on using R:

Introduction to R – A Brief Tutorial for R (Software for Statistical Analysis), and,

An Introduction to R for Data Mining by Joseph Rickert. (Recording of the webinar by the same name.)

Bruce has additional links that will be useful with the videos.

Enjoy!

March 29, 2013

The Artful Business of Data Mining…

Filed under: Data Mining,Software,Statistics — Patrick Durusau @ 8:25 am

David Coallier has two presentations under that general title:

Distributed Schema-less Document-Based Databases

and,

Computational Statistics with Open Source Tools

Neither one of which is a “…death by powerpoint…” type presentation where the speaker reads text you can read for yourself.

Which is good, except that with minimal slides, you get an occasional example, names of software/techniques, but you have to fill in a lot of context.

A pointer to videos of either of these presentations would be greatly appreciated!

March 26, 2013

Tensor Decompositions and Applications

Filed under: Data Mining,Tensors — Patrick Durusau @ 4:44 pm

Tensor Decompositions and Applications by Tamara G. Kolda and Brett W. Bader.

Abstract:

This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or N-way array. Decompositions of higher-order tensors (i.e., N-way arrays with N ≥ 3) have applications in psychometrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and elsewhere. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decomposition:CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank-one tensors, and the Tucker decomposition is a higher-order form of principal component analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox, Tensor Toolbox, and Multilinear Engine are examples of software packages for working with tensors.

At forty-five pages and two hundred and forty-five (245) references, this is a broad survey of tensor decompostion with numerous pointers to other survey and more specialized works.

I found this shortly after discovering the post I cover in: Tensors and Their Applications…

As I said in the earlier post, this has a lot of promise.

Although it isn’t yet clear to me how you would compare/contrast tensors with different dimensions and perhaps even a different number of dimensions.

Still, a lot of reading to do so perhaps I haven’t reached that point yet.

Massive online data stream mining with R

Filed under: Data Mining,Data Streams,R — Patrick Durusau @ 5:31 am

Massive online data stream mining with R

From the post:

A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.

The stream package is currently focussed on clustering algorithms available in MOA (http://moa.cms.waikato.ac.nz/details/stream-clustering/) and also eases interfacing with some clustering already available in R which are suited for data stream clustering. Classification algorithms based on MOA are on the todo list. Current available clustering algorithms are BIRCH, CluStream, ClusTree, DBSCAN, DenStream, Hierarchical, Kmeans and Threshold Nearest Neighbor.

What if data were always encountered as a stream?

Could request a “re-streaming” of data but best to do analysis in one streaming.

How would that impact your notion of subject identity?

How would you compensate for information learned later in the stream?

March 23, 2013

Data Mining and Visualization: Bed Bug Edition

Filed under: Data Mining,Graphics,Visualization — Patrick Durusau @ 12:49 pm

Data Mining and Visualization: Bed Bug Edition by Brooke Borel.

A very good example of data mining and visualization making a compelling case for conventional wisdom being wrong!

What I wonder about and what isn’t shown by the graphics, is what relationships, if any, existed between the authors of papers on bed bugs?

Were there communities, so to speak, of bed bug authors who cited each other? But not authors from parallel bed bug communities?

Not to mention the usual semantic gaps between authors from different traditions.

It sounds like Brooke is going to make a compelling read about all things, bed bugs!

The power of data mining!

March 20, 2013

Scenes from a Dive

Filed under: BigData,Data Mining,Open Data,Public Data — Patrick Durusau @ 10:27 am

Scenes from a Dive – what’s big data got to do with fighting poverty and fraud? by Prasanna Lal Das.

From the post:

A more detailed recap will follow soon but here’s a very quick hats off to the about 150 data scientists, civic hackers, visual analytics savants, poverty specialists, and fraud/anti-corruption experts that made the Big Data Exploration at Washington DC over the weekend such an eye-opener.We invite you to explore the work that the volunteers did (these are rough documents and will likely change as you read them so it’s okay to hold off if you would rather wait for a ‘final’ consolidated  document). The projects that the volunteers worked on include: 

Here are some visualizations that some project teams built. A few photos from the event are here (thanks @neilfantom). More coming soon (and yes, videos too!). Thanks @francisgagnon for the first blog about the event. The event hashtag was #data4good (follow @datakind and @WBopenfinances for more updates on Twitter).

Great meeting and projects but I would suggest a different sort of “big data”

Requiring recipients to grant reporting access to all bank accounts where funds will be transferred and requiring the same for any entity paid out of those accounts to the point where transfers over 90 days are less than $1,000 for any entity (or related entity), would be a better start.

With the exception of the “related entity” information, banks already keep transfer of funds information as a matter of routine business. It would be “big data” that is rich in potential for spotting fraud and waste.

The reporting banks should also be required to deliver other banking records they have on the accounts where funds are transferred and other activity in those accounts.

Before crying “invasion of privacy,” remember World Bank funding is voluntary.

As is acceptance of payment from World Bank funded projects. Anyone and everyone is free to decline such funding and avoid the proposed reporting requirements.

“Big data” to track fraud and waste is already collected by the banking industry.

The question is whether we will use that “big data” to effectively track fraud and waste or wait for particularly egregious cases to come to light?

March 19, 2013

Knowledge Discovery from Mining Big Data [Astronomy]

Filed under: Astroinformatics,BigData,Data Mining,Knowledge Discovery — Patrick Durusau @ 10:17 am

Knowledge Discovery from Mining Big Data – Presentation by Kirk Borne by Bruce Berriman.

From the post:

My friend and colleague Kirk Borne, of George Mason University, is a specialist in the modern field of data mining and astroinformatics. I was delighted to learn that he was giving a talk on an introduction to this topic as part of the Space Telescope Engineering and Technology Colloquia, and so I watched on the webcast. You can watch the presentation on-line, and you can download the slides from the same page. The presentation is a comprehensive introduction to data mining in astronomy, and I recommend it if you want to grasp the essentials of the field.

Kirk began by reminding us that responding to the data tsunami is a national priority in essentially all fields of science – a number of nationally commissioned working groups have been unanimous in reaching this conclusion and in emphasizing the need for scientific and educational programs in data mining. The slides give a list of publications in this area.

Deeply entertaining presentation on big data.

The first thirty minutes or so are good for “big data” quotes and hype but the real meat comes at about slide 22.

Extends the 3 V’s (Volume, Variety, Velocity) to include Veracity, Variability, Venue, Vocabulary, Value.

And outlines classes of discovery:

  • Class Discovery
    • Finding new classes of objects and behaviors
    • Learning the rules that constrain the class boundaries
  • Novelty Discovery
    • Finding new, rare, one-in-a-million(billion)(trillion) objects and events
  • Correlation Discovery
    • Finding new patterns and dependencies, which reveal new natural laws or new scientific principles
  • Association Discovery
    • Finding unusual (improbable) co-occurring associations

A great presentation with references and other names you will want to follow on big data and astroinformatics.

March 16, 2013

Finding Shakespeare’s Favourite Words With Data Explorer

Filed under: Data Explorer,Data Mining,Excel,Microsoft,Text Mining — Patrick Durusau @ 2:07 pm

Finding Shakespeare’s Favourite Words With Data Explorer by Chris Webb.

From the post:

The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…

In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.

If Data Explorer is a GUI on top of M (outdated but a point of origin), it goes up in importance.

From the M link:

The Microsoft code name “M” Modeling Language, hereinafter referred to as M, is a language for modeling domains using text. A domain is any collection of related concepts or objects. Modeling domain consists of selecting certain characteristics to include in the model and implicitly excluding others deemed irrelevant. Modeling using text has some advantages and disadvantages over modeling using other media such as diagrams or clay. A goal of the M language is to exploit these advantages and mitigate the disadvantages.

A key advantage of modeling in text is ease with which both computers and humans can store and process text. Text is often the most natural way to represent information for presentation and editing by people. However, the ability to extract that information for use by software has been an arcane art practiced only by the most advanced developers. The language feature of M enables information to be represented in a textual form that is tuned for both the problem domain and the target audience. The M language provides simple constructs for describing the shape of a textual language – that shape includes the input syntax as well as the structure and contents of the underlying information. To that end, M acts as both a schema language that can validate that textual input conforms to a given language as well as a transformation language that projects textual input into data structures that are amenable to further processing or storage.

I try to not run examples using Shakespeare. I get distracted by the elegance of the text, which isn’t the point of the exercise. 😉

March 10, 2013

SPMF

Filed under: Algorithms,Data Mining — Patrick Durusau @ 3:14 pm

SPMF: A Sequential Pattern Mining Framework

From the webpage:

SPMF is an open-source data mining mining platform written in Java.

It is distributed under the GPL v3 license.

It offers implementations of 52 data mining algorithms for:

  • sequential pattern mining,
  • association rule mining,
  • frequent itemset mining,
  • sequential rule mining,
  • clustering

It can be used as a standalone program with a user interface or from the command line. Moreover, the source code of each algorithm can be integrated in other Java software.

The documentation consists entirely of examples of using SPMF for data mining tasks.

The algorithms page details the fifty-two (52) algorithms of SPMF by references to the literature.

I first saw this at: SPMF: Sequential Pattern Mining Framework.

March 8, 2013

Crossfilter

Filed under: Data Mining,Dataset,Filters,Javascript,Top-k Query Processing — Patrick Durusau @ 4:34 pm

Crossfilter: Fast Multidimensional Filtering for Coordinated Views

From the webpage:

Crossfilter is a JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Crossfilter uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Crossfilter works, see the API reference.

See the webpage for an impressive demonstration with a 5.3 MB dataset.

Is there a trend towards “big data” manipulation on clusters and “less big data” in browsers?

Will be interesting to see how the benchmarks for “big” and “less big” move over time.

I first saw this in Nat Torkington’s Four Short links: 4 March 2013.

March 5, 2013

A nice collaborative filtering tutorial “for dummies”

Filed under: Data Mining,Filters — Patrick Durusau @ 2:12 pm

A nice collaborative filtering tutorial “for dummies”

Danny Bickson writes:

I got from M. Burhan, one of our GraphChi users from Germany, the following link to an online book called: A Programmer’s Guide to Data Mining.

There are two relevant chapters that may help beginners understand the basic concepts.

The first one of them is Chapter 2: Collaborative Filtering and Chapter 3: Implicit Ratings and Item Based Filtering.

February 28, 2013

Public Preview of Data Explorer

Filed under: Data Explorer,Data Mining,Microsoft — Patrick Durusau @ 5:26 pm

Public Preview of Data Explorer by Chris Webb.

From the post:

In a nutshell, Data Explorer is self-service ETL for the Excel power user – it is to SSIS what PowerPivot is to SSAS. In my opinion it is just as important as PowerPivot for Microsoft’s self-service BI strategy.

I’ll be blogging about it in detail over the coming days (and also giving a quick demo in my PASS Business Analytics Virtual Chapter session tomorrow), but for now here’s a brief list of things it gives you over Excel’s native functionality for importing data:

  • It supports a much wider range of data sources, including Active Directory, Facebook, Wikipedia, Hive, and tables already in Excel
  • It has better functionality for data sources that are currently supported, such as the Azure Marketplace and web pages
  • It can merge data from multiple files that have the same structure in the same folder
  • It supports different types of authentication and the storing of credentials
  • It has a user-friendly, step-by-step approach to transforming, aggregating and filtering data until it’s in the form you want
  • It can load data into the worksheet or direct into the Excel model

There’s a lot to it, so download it and have a play! It’s supported on Excel 2013 and Excel 2010 SP1.

Download: Microsoft “Data Explorer” Preview for Excel

Chris has collected a number of links to Data Explorer resources so look to his post for more details.

It looks like a local install is required for the preview. I have been meaning to add Windows 7 to a VM and MS Office with that.

Guess it may be time to take the plunge. 😉 (I have XP/Office on a separate box that uses the same monitors/keyboard but sharing data is problematic.)

ICDM 2013: IEEE International Conference on Data Mining

Filed under: Conferences,Data Mining — Patrick Durusau @ 1:31 pm

ICDM 2013: IEEE International Conference on Data Mining December 8-11, 2013, Dallas, Texas.

Dates:

  • Workshop proposals: April 2
  • Workshop notification: April 30
  • ICDM contest proposals: April 30
  • Full paper submissions: June 21
  • Demo and tutorial proposals: August 3
  • Workshop paper submissions: August 3
  • Conference paper, tutorial, demo notifications: September 20
  • Workshop paper notifications: September 24
  • Conference dates: December 8-11 (Sunday-Wednesday)

From the call for papers:

The IEEE International Conference on Data Mining (ICDM) has established itself as the world's premier research conference in data mining. The 13th ICDM conference (ICDM '13) provides a premier forum for the dissemination of innovative, practical development experiences as well as original research results in data mining, spanning applications, algorithms, software and systems. The conference draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems and high performance computing. By promoting high quality and novel research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state of the art in data mining. As an important part of the conference, the workshops program will focus on new research challenges and initiatives, and the tutorials program will cover emerging data mining technologies and the latest developments in data mining.

Topics of Interest

Topics related to the design, analysis and implementation of data mining theory, systems and applications are of interest. These include, but are not limited to the following areas:

  • Foundations of data mining
  • Data mining and machine learning algorithms and methods in traditional areas (such as classification, regression, clustering, probabilistic modeling, and association analysis), and in new areas
  • Mining text and semi-structured data, and mining temporal, spatial and multimedia data
  • Mining data streams
  • Mining spatio-temporal data
  • Mining with data clouds and Big Data
  • Link and graph mining
  • Pattern recognition and trend analysis
  • Collaborative filtering/personalization
  • Data and knowledge representation for data mining
  • Query languages and user interfaces for mining
  • Complexity, efficiency, and scalability issues in data mining
  • Data pre-processing, data reduction, feature selection and feature transformation
  • Post-processing of data mining results
  • Statistics and probability in large-scale data mining
  • Soft computing (including neural networks, fuzzy logic, evolutionary computation, and rough sets) and uncertainty management for data mining
  • Integration of data warehousing, OLAP and data mining
  • Human-machine interaction and visual data mining
  • High performance and parallel/distributed data mining
  • Quality assessment and interestingness metrics of data mining results
  • Visual Analytics
  • Security, privacy and social impact of data mining
  • Data mining applications in bioinformatics, electronic commerce, Web, intrusion detection, finance, marketing, healthcare, telecommunications and other fields

I saw a post recently that made the case for data mining being the next “hot” topic in cybersecurity.

As in data mining that can track you across multiple social media sites, old email posts, etc.

Curious that it is always phrased in terms of government or big corporations spying on little people.

Since there are a lot more “little people,” shouldn’t crowd sourcing data mining of governments and big corporations work the other way too?

And for that matter, like the BLM (Bureau of Land Management), there really isn’t any “government,” or “government agency” that is responsible for harm to the public’s welfare.

There are specific people with relationships to the oil and gas industry, meetings, etc.

Let’s use data mining to pierce the government veil!

February 26, 2013

AstroML: data mining and machine learning for Astronomy

Filed under: Astroinformatics,Data Mining,Machine Learning — Patrick Durusau @ 1:53 pm

AstroML: data mining and machine learning for Astronomy by Jake Vanderplas, Alex Gray, Andrew Connolly and Zeljko Ivezic.

Description:

Python is currently being adopted as the language of choice by many astronomical researchers. A prominent example is in the Large Synoptic Survey Telescope (LSST), a project which will repeatedly observe the southern sky 1000 times over the course of 10 years. The 30,000 GB of raw data created each night will pass through a processing pipeline consisting of C++ and legacy code, stitched together with a python interface. This example underscores the need for astronomers to be well-versed in large-scale statistical analysis techniques in python. We seek to address this need with the AstroML package, which is designed to be a repository for well-tested data mining and machine learning routines, with a focus on applications in astronomy and astrophysics. It will be released in late 2012 with an associated graduate-level textbook, ‘Statistics, Data Mining and Machine Learning in Astronomy’ (Princeton University Press). AstroML leverages many computational tools already available available in the python universe, including numpy, scipy, scikit- learn, pymc, healpy, and others, and adds efficient implementations of several routines more specific to astronomy. A main feature of the package is the extensive set of practical examples of astronomical data analysis, all written in python. In this talk, we will explore the statistical analysis of several interesting astrophysical datasets using python and astroML.

AstroML at Github:

AstroML is a Python module for machine learning and data mining built on numpy, scipy, scikit-learn, and matplotlib, and distributed under the 3-clause BSD license. It contains a growing library of statistical and machine learning routines for analyzing astronomical data in python, loaders for several open astronomical datasets, and a large suite of examples of analyzing and visualizing astronomical datasets.

The goal of astroML is to provide a community repository for fast Python implementations of common tools and routines used for statistical data analysis in astronomy and astrophysics, to provide a uniform and easy-to-use interface to freely available astronomical datasets. We hope this package will be useful to researchers and students of astronomy. The astroML project was started in 2012 to accompany the book Statistics, Data Mining, and Machine Learning in Astronomy by Zeljko Ivezic, Andrew Connolly, Jacob VanderPlas, and Alex Gray, to be published in early 2013.

The book, Statistics, Data Mining, and Machine Learning in Astronomy by Zeljko Ivezic, Andrew Connolly, Jacob VanderPlas, and Alex Gray, is not yet listed by Princeton University Press. 🙁

I have subscribed to their notice service and will post a note when it appears.

February 22, 2013

…Obtaining Original Data from Published Graphs and Plots

Filed under: Data,Data Mining,Graphs — Patrick Durusau @ 2:13 pm

A Simple Method for Obtaining Original Data from Published Graphs and Plots

From the post:

Was thinking of how to extract data points for infant age and weight distribution from a printed graph and I landed at this old paper http://www.ajronline.org/content/174/5/1241.full . it pointed me to NIH Image which reminds me of an old software i used to use for lab practicals as an undergrad .. and upon reaching the NIH Image site, Indeed! imageJ is an ‘update’ of sorts to the NIH Image software ..

The “old paper?” “A Simple Method for Obtaining Original Data from Published Graphs and Plots,” by Chris L. Sistrom and Patricia J. Mergo, American Journal of Roentgenology, May 2000 vol. 174 no. 5 1241-1244.

Update to the URI in the article, http://rsb.info.nih.gov/nih-image/ is correct. (Original URI is missing a hyphen, “-“.)

The mailing list archives don’t show much traffic for the last several years.

When you need to harvest data from published graphs/plots, what do you use?

February 17, 2013

Video: Data Mining with R

Filed under: Data Mining,R — Patrick Durusau @ 8:17 pm

Video: Data Mining with R by David Smith.

From the post:

Yesterday's Introduction to R for Data Mining webinar was a record setter, with more than 2000 registrants and more than 700 attending the live session presented by Joe Rickert. If you missed it, I've embedded the video replay below, and Joe's slides (with links to many useful resources) are also available.

During the webinar, Joe demoed several examples of data mining with R packages, including rattle, caret, and RevoScaleR from Revolution R Enteprise. If you want to adapt Joe's demos for your own data mining ends, Joe has made his scripts and data files available for download on github.

Glad this showed up! I accidentally missed the webinar.

Enjoy!

February 16, 2013

Deep Inside: A Study of 10,000 Porn Stars and Their Careers

Filed under: Data,Data Mining,Porn — Patrick Durusau @ 4:49 pm

Deep Inside: A Study of 10,000 Porn Stars and Their Careers by Jon Millward.

From the post:

For the first time, a massive data set of 10,000 porn stars has been extracted from the world’s largest database of adult films and performers. I’ve spent the last six months analyzing it to discover the truth about what the average performer looks like, what they do on film, and how their role has evolved over the last forty years.

I can now name the day when I became aware of the Internet Adult Film Database, today!

When you get through grinning, go take a look at the post. This is serious data analysis.

Complete with an idealized porn star face composite from the most popular porn stars.

Improve your trivia skills: What two states in the United States have one porn star each in the Internet Adult Film Database? (Jon has a map of the U.S. with distribution of porn stars.)

A full report with more details about the analysis is forthcoming.

I first saw this at Porn star demographics by Nathan Yau.

February 15, 2013

DataDive to Fight Poverty and Corruption with the World Bank!

Filed under: Data Mining,Government,Transparency — Patrick Durusau @ 5:44 am

DataDive to Fight Poverty and Corruption with the World Bank!

From the post:

We’re thrilled to announce a huge new DataKind DataDive coming to DC the weekend of 3/15! We’re teaming up with the World Bank to put a dent in some of the most serious problems in poverty and corruption through the use of data. Low bar, right?

We’re calling on all socially conscious analysts, statisticians, data scientists, coders, hackers, designers, or eager-to-learn do-gooders to come out with us on the weekend of 3/15 to work with data to improve the world. You’ll be working alongside experts in the field to analyze, visualize, and mashup the most cutting-edge data from the World Bank, UN, and other sources to improve poverty monitoring and root out corruption. We’ve started digging into the data a little ourselves and we’re already so excited for how cool this event is going to be. “Oh, what’d you do this weekend? I reduced global poverty and rooted out corruption. No big deal.”

BTW, there is an Open Data Day on 2/23 to prepare for the DataDive on 3/15.

What isn’t clear from the announcement(s) is what data is to be mined to fight poverty and corruption?

Or what is meant by “corruption?”

Graph solutions, for example, would be better at tracking American style corruption that shuns quid pro quo in favor of a community of interest of the wealthy and well-connected.

Such communities aren’t any less corrupt than members of government with cash in their freezers, just less obvious.

February 10, 2013

Call for KDD Cup Competition Proposals

Filed under: Contest,Data Mining,Dataset,Knowledge Discovery — Patrick Durusau @ 1:17 pm

Call for KDD Cup Competition Proposals

From the post:

Please let us know if you are interested in being considered for the 2013 KDD Cup Competition by filling out the form below.

This is the official call for proposals for the KDD Cup 2013 competition. The KDD Cup is the well known data mining competition of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD-2013 conference will be held in Chicago from August 11 – 14, 2013. The competition will last between 6 and 8 weeks and the winners should be notified by end-June. The winners will be announced in the KDD-2013 conference and we are planning to run a workshop as well.

A good competition task is one that is practically useful, scientifically or technically challenging, can be done without extensive application domain knowledge, and can be evaluated objectively. Of particular interest are non-traditional tasks/data that require novel techniques and/or thoughtful feature construction.

Proposals should involve data and a problem whose successful completion will result in a contribution of some lasting value to a field or discipline. You may assume that Kaggle will provide the technical support for running the contest. The data needs to be available no later than mid-March.

If you have initial questions about the suitability of your data/problem feel free to reach out to claudia.perlich [at] gmail.com.

Do you have:

non-traditional tasks/data that require[s] novel techniques and/or thoughtful feature construction?

Is collocation of information on the basis of multi-dimensional subject identity a non-traditional task?

Does extraction of multiple dimensions of a subject identity from users require novel techniques?

If so, what data sets would you suggest using in this challenge?

I first saw this at: 19th ACM SIGKDD Knowledge Discovery and Data Mining Conference.

February 6, 2013

Simon Rogers

Filed under: Data Mining,Journalism,News — Patrick Durusau @ 2:56 pm

Simon Rogers

From the “about” page:

Simon Rogers is editor of guardian.co.uk/data, an online data resource which publishes hundreds of raw datasets and encourages its users to visualise and analyse them – and probably the world’s most popular data journalism website.

He is also a news editor on the Guardian, working with the graphics team to visualise and interpret huge datasets.

He was closely involved in the Guardian’s exercise to crowdsource 450,000 MP expenses records and the organisation’s coverage of the Afghanistan and Iraq Wikileaks war logs. He was also a key part of the Reading the Riots team which investigated the causes of the 2011 England disturbances.

Previously he was the launch editor of the Guardian’s online news service and has edited the paper’s science section. He has edited three Guardian books, including How Slow Can You Waterski and The Hutton Inquiry and its impact.

If you are interested in “data journalism,” data mining or visualization, Simon’s site is one of the first to bookmark.

Introduction To R For Data Mining

Filed under: Data Mining,R — Patrick Durusau @ 1:42 pm

Introduction To R For Data Mining

Date: Thursday, February 14, 2013
Time: 10:00am – 11:00am Pacific Time
Presenter: Joseph Rickert, Technical Marketing Manager, Revolution Analytics

From the post:

We at Revolution Analytics are often asked “What is the best way to learn R?” While acknowledging that there may be as many effective learning styles as there are people we have identified three factors that greatly facilitate learning R. For a quick start:

  • Find a way of orienting yourself in the open source R world
  • Have a definite application area in mind
  • Set an initial goal of doing something useful and then build on it

In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. We will:

  • Provide an orientation to R’s data mining resources
  • Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data.
  • Show the simple R commands to accomplish these same tasks without the GUI
  • Demonstrate how to build on these fundamental skills to gain further competence in R
  • Move away from using small test data sets and show with the same level of skill one could analyze some fairly large data sets with RevoScaleR

Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R.

You have to do something while waiting for your significant other to get off work on Valentine’s Day. 😉

So long as you don’t try to watch the webinar on a smart phone at the restaurant, you should be ok.


Update: Video of the webinar: An Introduction to R for Data Mining.

February 3, 2013

Making Sense of Others’ Data Structures

Filed under: Data Mining,Data Structures,Identity,Subject Identity — Patrick Durusau @ 6:58 pm

Making Sense of Others’ Data Structures by Eruditio Loginquitas.

From the post:

Coming in as an outsider to others’ research always requires an investment of time and patience. After all, how others conceptualize their fields, and how they structure their questions and their probes, and how they collect information, and then how they represent their data all reflect their understandings, their theoretical and analytical approaches, their professional training, and their interests. When professionals collaborate, they will approach a confluence of understandings and move together in a semi-united way. Individual researchers—not so much. But either way, for an outsider, there will have to be some adjustment to understand the research and data. Professional researchers strive to control for error and noise at every stage of the research: the hypothesis, literature review, design, execution, publishing, and presentation.

Coming into a project after the data has been collected and stored in Excel spreadsheets means that the learning curve is high in yet another way: data structures. While the spreadsheet itself seems pretty constrained and defined, there is no foregone conclusion that people will necessarily represent their data a particular way.

Data structures as subjects. What a concept! 😉

Data structures, contrary to some, are not self-evident or self-documenting.

Not to mention that like ourselves, are in a constant state of evolution as our understanding or perception of data changes.

Mine is not the counsel of despair, but of encouragement to consider the costs/benefits of capturing data structure subject identities just as more traditional subjects.

It may be costs or other constraints prevent such capture but you may also miss benefits if you don’t ask.

How much did it cost for each transition in episodic data governance efforts to re-establish data structure subject identities?

Could be that more money spent now would get an enterprise off the perpetual cycle of data governance.

Need to discover, access, analyze and visualize big and broad data? Try F#.

Filed under: Data Analysis,Data Mining,F#,Microsoft — Patrick Durusau @ 6:58 pm

Need to discover, access, analyze and visualize big and broad data? Try F#. by Oliver Bloch.

From the post:

Microsoft Research just released a new iteration of Try F#, a set of tools designed to make it easy for anyone – not just developers – to learn F# and take advantage of its big data, cross-platform capabilities.

F# is the open-source, cross-platform programming language invented by Don Syme and his team at Microsoft Research to help reduce the time-to-deployment for analytical software components in the modern enterprise.

Big data definitively is big these days and we are excited about this new iteration of Try F#. Regardless of your favorite language, or if you’re on a Mac, a Windows PC, Linux or Android, if you need to deal with complex problems, you will want to take a look at F#!

Kerry Godes from Microsoft’s Openness Initiative connected with Evelyne Viegas, Director of Semantic Computing at Microsoft Research, to find out more about how you can use “Try F# to seamlessly discover, access, analyze and visualize big and broad data.” For the complete interview, go to the Openness blog or check out www.tryfsharp.org to get started “writing simple code for complex problems”.

Are you an F# user?

Curious how F# compares to other languages for “complexity?”

Visualization gurus: Does the complexity of languages go up or down with the complexity of licensing terms?

Inquiring minds want to know. 😉

January 24, 2013

What tools do you use for information gathering and publishing?

Filed under: Data Mining,Publishing,Text Mining — Patrick Durusau @ 8:07 pm

What tools do you use for information gathering and publishing? by Mac Slocum.

From the post:

Many apps claim to be the pinnacle of content consumption and distribution. Most are a tangle of silly names and bad interfaces, but some of these tools are useful. A few are downright empowering.

Finding those good ones is the tricky part. I queried O’Reilly colleagues to find out what they use and why, and that process offered a decent starting point. We put all our notes together into this public Hackpad — feel free to add to it. I also went through and plucked out some of the top choices. Those are posted below.

Information gathering, however humble it may be, is the start of any topic map authoring project.

Mac asks for the tools you use every week.

Let’s not disappoint him!

January 23, 2013

Top 10 Formulas for Aspiring Analysts

Filed under: Data Mining,Excel — Patrick Durusau @ 7:41 pm

Top 10 Formulas for Aspiring Analysts by Purna “Chandoo” Duggirala.

From the post:

Few weeks ago, someone asked me “What are the top 10 formulas?” That got me thinking.

While each of us have our own list of favorite, most frequently used formulas, there is no standard list of top 10 formulas for everyone. So, today let me attempt that.

If you want to become a data or business analyst then you must develop good understanding of Excel formulas & become fluent in them.

A good analyst should be familiar with below 10 formulas to begin with.

A reminder that not all data analysis starts with the most complex chain of transformations you can imagine.

Sometimes you need to explore and then roll out the heavy weapons.

January 19, 2013

The Pacific Symposium on Biocomputing 2013 [Proceedings]

Filed under: Bioinformatics,Biomedical,Data Mining,Text Mining — Patrick Durusau @ 7:09 pm

The Pacific Symposium on Biocomputing 2013 by Will Bush.

From the post:

For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:

See Will’s post for the highlights. Or browse the proceedings. You are almost certainly going to find something relevant to you.

Do note Will’s use of Twiiter IDs as identifiers. Unique, persistent (I assume Twitter doesn’t re-assign them), easy to access.

It wasn’t clear from Will’s post if the following image was from Biocomputing 2013 or if he stopped by a markup conference. Hard to tell. 😉

Biocomputing 2013

January 17, 2013

Machine Learning and Data Mining – Association Analysis with Python

Machine Learning and Data Mining – Association Analysis with Python by Marcel Caraciolo.

From the post:

Recently I’ve been working with recommender systems and association analysis. This last one, specially, is one of the most used machine learning algorithms to extract from large datasets hidden relationships.

The famous example related to the study of association analysis is the history of the baby diapers and beers. This history reports that a certain grocery store in the Midwest of the United States increased their beers sells by putting them near where the stippers were placed. In fact, what happened is that the association rules pointed out that men bought diapers and beers on Thursdays. So the store could have profited by placing those products together, which would increase the sales.

Association analysis is the task of finding interesting relationships in large data sets. There hidden relationships are then expressed as a collection of association rules and frequent item sets. Frequent item sets are simply a collection of items that frequently occur together. And association rules suggest a strong relationship that exists between two items.

When I think of associations in a topic map, I assume I am at least starting with the roles and the players of those roles.

As this post demonstrates, that may be overly optimistic on my part.

What if I discover an association but not its type or the roles in it? And yet I still want to preserve the discovery for later use?

An incomplete association as it were.

Suggestions?

January 16, 2013

Free Datascience books

Filed under: Data,Data Mining,Data Science — Patrick Durusau @ 7:55 pm

Free Datascience books by Carl Anderson

From the post:

I’ve been impressed in recent months by the number and quality of free datascience/machine learning books available online. I don’t mean free as in some guy paid for a PDF version of an O’Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, “the publisher has graciously agreed to allow a full, free version of my book to be available on this site.”
Here are a few in my collection:

Any you would like to add to the list?

I first saw this in Four short links: 1 January 2013 by Nat Torkington.

January 13, 2013

Foundations of Rule Learning [A Topic Map Parable]

Filed under: Data Mining,Machine Learning,Topic Maps — Patrick Durusau @ 8:14 pm

Foundations of Rule Learning by Authors: Johannes Fürnkranz, Dragan Gamberger, Nada Lavrač, ISBN: 978-3-540-75196-0 (Print) 978-3-540-75197-7 (Online).

From the Introduction:

Rule learning is not only one of the oldest but also one of the most intensively investigated, most frequently used, and best developed fields of machine learning. In more than 30 years of intensive research, many rule learning systems have been developed for propositional and relational learning, and have been successfully used in numerous applications. Rule learning is particularly useful in intelligent data analysis and knowledge discovery tasks, where the compactness of the representation of the discovered knowledge, its interpretability, and the actionability of the learned rules are of utmost importance for successful data analysis.

The aim of this book is to give a comprehensive overview of modern rule learning techniques in a unifying framework which can serve as a basis for future research and development. The book provides an introduction to rule learning in the context of other machine learning and data mining approaches, describes all the essential steps of the rule induction process, and provides an overview of practical systems and their applications. It also introduces a feature-based framework for rule learning algorithms which enables the integration of propositional and relational rule learning concepts.

The topic map parable comes near the end of the introduction where the authors note:

The book is written by authors who have been working in the field of rule learning for many years and who themselves developed several of the algorithms and approaches presented in the book. Although rule learning is assumed to be a well-established field with clearly defined concepts, it turned out that finding a unifying approach to present and integrate these concepts was a surprisingly difficult task. This is one of the reasons why the preparation of this book took more than 5 years of joint work.

A good deal of discussion went into the notation to use. The main challenge was to define a consistent notational convention to be used throughout the book because there is no generally accepted notation in the literature. The used notation is gently introduced throughout the book, and is summarized in Table I in a section on notational conventions immediately following this preface (pp. xi–xiii). We strongly believe that the proposed notation is intuitive. Its use enabled us to present different rule learning approaches in a unifying notation and terminology, hence advancing the theory and understanding of the area of rule learning.

Semantic diversity in rule learning was discovered and took five years to resolve.

Where n = all prior notations/terminologies, the solution was to create the n + 1 notation/terminology.

Understandable and certainly a major service to the rule learning community. The problem remains, how does one use the n + 1 notation/terminology to access prior (and forthcoming) literature in rule learning?

In its present form, the resolution of the prior notations and terminologies into the n + 1 terminology isn’t accessible to search, data, bibliographic engines.

Not to mention that on the next survey of rule learning, its authors will have to duplicate the work already accomplished by these authors.

Something about the inability to re-use the valuable work done by these authors, either for improvement of current information systems or to avoid duplication of effort in the future seems wrong.

Particularly since it is avoidable through the use of topic maps.


The link at the top of this post is the “new and improved site,” which has less sample content than Foundations for Rule Learning, apparently an old and not improved site.

I first saw this in a post by Gregory Piatetsky.

January 11, 2013

Starting Data Analysis with Assumptions

Filed under: Data Analysis,Data Mining,Data Models — Patrick Durusau @ 7:33 pm

Why you don’t get taxis in Singapore when it rains? by Zafar Anjum.

From the post:

It is common experience that when it rains, it is difficult to get a cab in Singapore-even when you try to call one in or use your smartphone app to book one.

Why does it happen? What could be the reason behind it?

Most people would think that this unavailability of taxis during rain is because of high demand for cab services.

Well, Big Data has a very surprising answer for you, as astonishing as it was for researcher Oliver Senn.

When Senn was first given his assignment to compare two months of weather satellite data with 830 million GPS records of 80 million taxi trips, he was a little disappointed. “Everyone in Singapore knows it’s impossible to get a taxi in a rainstorm,” says Senn, “so I expected the data to basically confirm that assumption.” As he sifted through the data related to a vast fleet of more than 16,000 taxicabs, a strange pattern emerged: it appeared that many taxis weren’t moving during rainstorms. In fact, the GPS records showed that when it rained (a frequent occurrence in this tropical island state), many drivers pulled over and didn’t pick up passengers at all.

Senn did discover the reason for the patterns in the data, which is being addressed.

The first question should have been: Is this a big data problem?

True, Senn had lots of data to crunch, but that isn’t necessarily an indicator of a big data problem.

Interviews of a few taxi drivers would have dispelled the original assumption of high demand for taxis. It would also have led to the cause of the patterns Senn recognized.

That is the patterns were a symptom, not a cause.

I first saw this in So you want to be a (big) data hero? by Vinnie Mirchandani.

« Newer PostsOlder Posts »

Powered by WordPress