Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 3, 2013

R and Data Mining: Examples and Case Studies (Update)

Filed under: Data Mining,R — Patrick Durusau @ 7:25 pm

R and Data Mining: Examples and Case Studies by Yanchang Zhao.

The PDF version now includes chapters 7 and 9 (on which see: Book “R and Data Mining: Examples and Case Studies” on CRAN [blank chapters] and only the case study chapters are omitted.

You will also find the R code for the book and an “R Reference Card for Data Mining.”

Enjoy!

January 2, 2013

100 most read R posts for 2012 [No Data = No Topic Maps]

Filed under: Data Mining,R,Topic Maps — Patrick Durusau @ 11:42 am

Tal Galili writes in 100 most read R posts for 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages:

R-bloggers.com is now three years young. The site is an (unofficial) online journal of the R statistical programming environment, written by bloggers who agreed to contribute their R articles to the site.

Last year, I posted on the top 24 R posts of 2011. In this post I wish to celebrate R-bloggers’ third birthmounth by sharing with you:

  1. Links to the top 100 most read R posts of 2012
  2. Statistics on “how well” R-bloggers did this year
  3. My wishlist for the R community for 2013 (blogging about R, guest posts, and sponsors)

A number of posts on R that may be useful in data mining to create topic maps.

I retain my interest in the theory/cutting edge side of things. But discovering more than 1/2 $trillion in untraceable payments in a government report is a thrill. The 560+ $Billion Shell Game

It’s untraceable for members of the public. I am certain insiders at the OMB can trace it quite easily.

Which makes you wonder why they are hoarding that information?

Will try to season the blog with more data into topic maps type posts in 2013.

Suggestions and comments on potential data sets for topic maps most welcome!

December 7, 2012

DMR – Data Mining and Reporting (blog)

Filed under: Data Mining,Knime — Patrick Durusau @ 7:26 pm

DMR – Data Mining and Reporting by Rosaria Silipo.

Data mining focused on KNIME.

I followed a reference by Sandro Saitta to it.

KNIME website (in case it is unfamiliar).

December 2, 2012

A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out

Filed under: Data Mining,SQL,SQL Server — Patrick Durusau @ 7:46 pm

A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out

A rather refreshing if anonymous take on statistics and data mining.

Since I can access SQL Servers in the cloud (without the necessity of maintaining a local Windows Server box), thought I should look at data mining for SQL Servers.

This was one of the first posts I encountered.

In the first of a series of amateur tutorials on SQL Server Data Mining (SSDM), I promised to pull off an impossible stunt: explaining the broad field of statistics in a few paragraphs without the use of equations. What other SQL Server blog ends with a cliffhanger like that? Anyone who aims at incorporating data mining into their IT infrastructure or skill set in any substantial way is going to have to learn to interpret equations, but it is possible to condense a few key statistical concepts in a way that will help those who aren’t statisticians – like me – to make productive use of SSDM without them. These crude Cliff’s Notes can at least familiarize DBAs, programmers and other readers of these tutorials with the minimal bare bones concepts they will need to know in order to interpret the data output by SSDM’s nine algorithms, as well as to illuminate the inner workings of the algorithms themselves. Without that minimal foundation, it will be more difficult to extract useful meaning from your data mining efforts.

The first principle to keep in mind is so absurdly obvious that it is often half-consciously forgotten – perhaps because it is right before our noses – but it is indispensable to understanding both the field of statistics and the stats output by SSDM. To wit, the numbers signify something. Some intelligence assigned meaning to them. One of the biggest hurdles when interpreting statistical data, reading equations or learning a foreign language is the subtle, almost subconscious error of forgetting that these symbols reflect ideas in the head of another conscious human being, which probably correspond to ideas that you also have in your head, but simply lack the symbols to express. An Englishman learning to read or write Spanish, Portuguese, Russian or Polish may often forget that the native speakers of these languages are trying to express the exact same concepts that an English speaker would; they have the exact same ideas in their heads as we do, but communicate them quite differently. Quite often, the seemingly incoherent quirks and rules of a particular foreign language may actually be part of a complex structure designed to convey identical, ordinary ideas in a dissimilar, extraordinary way. It is the same way with mathematical equations: the scientists and mathematicians who use them are trying to convey ideas in the most succinct way they know. It is often easier for laymen to understand the ideas and supporting evidence that those equations are supposed to express, when they’re not particularly well-versed in the detailed language that equations represent. I’m a layman, like some of my readers probably are. My only claim to expertise in this area is that when I was in fourth grade, I learned enough about equations to solve the ones my father, a college physics teacher, taught every week – but then I forgot it all, so I found myself back at Square One when I took up data mining a few years back.

On a side note, it would be wise for anyone who works with equations regularly to consciously remind themselves that they are merely symbols representing ideas, rather than the other way around; a common pitfall among physicists and other scientists who work with equations regularly seems to be the Pythagorean heresy, i.e. the quasi-religious belief that reality actually consists of mathematical equations. It doesn’t. If we add two apples to two apples, we end up with four apples; the equation 2 + 2 = 4 expresses the nature and reality of several apples, rather than the apples merely being a stand-in for the equation. Reality is not a phantom that obscures some deep, dark equation underlying all we know; math is simply a shortcut to expressing certain truths about the external world. This danger is magnified when we pile abstraction on top of abstraction, which may lead to the construction of ivory towers that eventually fall, often spectacularly. This is a common hazard in the field of finance, where our economists often forget that money is just an abstraction based on agreements among large numbers of people to assign certain meanings to it that correspond to tangible, physical goods; all of the periodic financial crashes that have plagued Western civilization since Tulipmania have been accompanied by a distinct forgetfulness of this fact, which automatically produces the scourge of speculation. I’ve often wondered if this subtle mistake has also contributed to the rash of severe mental illness among mathematicians and physicists, with John Nash (of the film A Beautiful Mind), Nicolai Tesla and Georg Cantor being among the most recognized names in a long list of victims. It may also be linked to the uncanny ineptitude of our most brilliant physicists and mathematicians when it comes to philosophy, such as Rene Descartes, Albert Einstein, Stephen Hawking and Alan Turing. In his most famous work, Orthodoxy, 20th Century British journalist G.K. Chesterton noticed the same pattern, which he summed up thus: “Poets do not go mad; but chess-players do. Mathematicians go mad, and cashiers; but creative artists very seldom. I am not, as will be seen, in any sense attacking logic: I only say that this danger does lie in logic, not in imagination.”[1] At a deeper level, some of the risk to mental health from excessive math may pertain to seeking patterns that aren’t really there, which may be closely linked to the madness underlying ancient “arts” of divination like haruspicy and alectromancy.

November 28, 2012

Bash One-Liners Explained (series)

Filed under: Bash,Data Mining,String Matching,Text Mining — Patrick Durusau @ 10:26 am

Bash One-Liners Explained by Peteris Krumins.

The series page for posts by Peteris Krumins on Bash one-liners.

So far:

One real advantage to Bash scripts is the lack of a graphical interface to get in the way.

A real advantage with “data” files but many times “text” files as well.

November 25, 2012

Fast rule-based bioactivity prediction using associative classification mining

Filed under: Associations,Associative Classification Mining,Classification,Data Mining — Patrick Durusau @ 1:24 pm

Fast rule-based bioactivity prediction using associative classification mining by Pulan Yu and David J Wild. (Journal of Cheminformatics 2012, 4:29 )

Who moved my acronym? continues: ACM = Association for Computing Machinery or associative classification mining.

Abstract:

Relating chemical features to bioactivities is critical in molecular design and is used extensively in lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, the classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) method, and produce highly interpretable models.

An interesting lead on investigation of associations in large data sets. Pass on those meeting a threshold on for further evaluation?

November 24, 2012

BIOKDD 2013 :…Biological Knowledge Discovery and Data Mining

Filed under: Bioinformatics,Biomedical,Conferences,Data Mining,Knowledge Discovery — Patrick Durusau @ 11:24 am

BIOKDD 2013 : 4th International Workshop on Biological Knowledge Discovery and Data Mining

When Aug 26, 2013 – Aug 30, 2013
Where Prague, Czech Republic
Abstract Registration Due Apr 3, 2013
Submission Deadline Apr 10, 2013
Notification Due May 10, 2013
Final Version Due May 20, 2013

From the call for papers:

With the development of Molecular Biology during the last decades, we are witnessing an exponential growth of both the volume and the complexity of biological data. For example, the Human Genome Project provided the sequence of the 3 billion DNA bases that constitute the human genome. And, consequently, we are provided too with the sequences of about 100,000 proteins. Therefore, we are entering the post-genomic era: after having focused so many efforts on the accumulation of data, we have now to focus as much effort, and even more, on the analysis of these data. Analyzing this huge volume of data is a challenging task because, not only, of its complexity and its multiple and numerous correlated factors, but also, because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex biological mechanisms under study. From here comes the necessity to use computer tools and develop new in silico high performance approaches to support us in the analysis of biological data and, hence, to help us in our understanding of the correlations that exist between, on one hand, structures and functional patterns of biological sequences and, on the other hand, genetic and biochemical mechanisms. Knowledge Discovery and Data Mining (KDD) are a response to these new trends.

Topics of BIOKDD’13 workshop include, but not limited to:

Data Preprocessing: Biological Data Storage, Representation and Management (data warehouses, databases, sequences, trees, graphs, biological networks and pathways, …), Biological Data Cleaning (errors removal, redundant data removal, completion of missing data, …), Feature Extraction (motifs, subgraphs, …), Feature Selection (filter approaches, wrapper approaches, hybrid approaches, embedded approaches, …)

Data Mining: Biological Data Regression (regression of biological sequences…), Biological data clustering/biclustering (microarray data biclustering, clustering/biclustering of biological sequences, …), Biological Data Classification (classification of biological sequences…), Association Rules Learning from Biological Data, Text mining and Application to Biological Sequences, Web mining and Application to Biological Data, Parallel, Cloud and Grid Computing for Biological Data Mining

Data Postprocessing: Biological Nuggets of Knowledge Filtering, Biological Nuggets of Knowledge Representation and Visualization, Biological Nuggets of Knowledge Evaluation (calculation of the classification error rate, evaluation of the association rules via numerical indicators, e.g. measurements of interest, … ), Biological Nuggets of Knowledge Integration

Being held in conjunction with 24th International Conference on Database and Expert Systems Applications – DEXA 2013.

In case you are wondering about BIOKDD, consider the BIOKDD Programme for 2012.

Or the DEXA program for 2012.

Looks like a very strong set of conferences and workshops.

The Ironies of MDM [Master Data Management/Muti-Database Mining]

Filed under: Data Mining,Master Data Management,Multi-Database Mining — Patrick Durusau @ 11:06 am

A survey on mining multiple data sources by T. Ramkumar, S. Hariharan and S. Selvamuthukumaran.

Abstract:

Advancements in computer and communication technologies demand new perceptions of distributed computing environments and development of distributed data sources for storing voluminous amount of data. In such circumstances, mining multiple data sources for extracting useful patterns of significance is being considered as a challenging task within the data mining community. The domain, multi-database mining (MDM) is regarded as a promising research area as evidenced by numerous research attempts in the recent past. The methods exist for discovering knowledge from multiple data sources, they fall into two wide categories, namely (1) mono-database mining and (2) local pattern analysis. The main intent of the survey is to explain the idea behind those approaches and consolidate the research contributions along with their significance and limitations.

I can’t reach the full article, yet, but it sounds like one that merits attention.

I was struck by the irony of MDM, which some data types would expand to be “Master Data Management,” is read here to mean, “Multi-Database Mining.”

To be sure, “Master Data Management” can be useful, but be mindful that non-managed data lurks just outside your door.

November 23, 2012

Data Mining and Machine Learning in Astronomy

Filed under: Astroinformatics,Data Mining,Machine Learning — Patrick Durusau @ 11:30 am

Data Mining and Machine Learning in Astronomy by Nicholas M. Ball and Robert J. Brunner. (International Journal of Modern Physics D, Volume 19, Issue 07, pp. 1049-1106 (2010).)

Abstract:

We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those in which data mining techniques directly contributed to improving science, and important current and future directions, including probability density functions, parallel algorithms, Peta-Scale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.

At fifty-eight (58) pages and three hundred and seventy-five references, this is a great starting place to learn about data mining and machine learning from an astronomy perspective!

And should yield new techniques or new ways to apply old ones to your data, with a little imagination.

Dates from 2010 so word of more recent surveys welcome!

…Knowledge Extraction From Complex Astronomical Data Sets

Filed under: Astroinformatics,BigData,Data Mining,Knowledge Discovery — Patrick Durusau @ 11:29 am

CLaSPS: A New Methodology For Knowledge Extraction From Complex Astronomical Data Sets by R. D’Abrusco, G. Fabbiano, G. Djorgovski, C. Donalek, O. Laurino and G. Longo. (R. D’Abrusco et al. 2012 ApJ 755 92 doi:10.1088/0004-637X/755/2/92)

Abstract:

In this paper, we present the Clustering-Labels-Score Patterns Spotter (CLaSPS), a new methodology for the determination of correlations among astronomical observables in complex data sets, based on the application of distinct unsupervised clustering techniques. The novelty in CLaSPS is the criterion used for the selection of the optimal clusterings, based on a quantitative measure of the degree of correlation between the cluster memberships and the distribution of a set of observables, the labels, not employed for the clustering. CLaSPS has been primarily developed as a tool to tackle the challenging complexity of the multi-wavelength complex and massive astronomical data sets produced by the federation of the data from modern automated astronomical facilities. In this paper, we discuss the applications of CLaSPS to two simple astronomical data sets, both composed of extragalactic sources with photometric observations at different wavelengths from large area surveys. The first data set, CSC+, is composed of optical quasars spectroscopically selected in the Sloan Digital Sky Survey data, observed in the x-rays by Chandra and with multi-wavelength observations in the near-infrared, optical, and ultraviolet spectral intervals. One of the results of the application of CLaSPS to the CSC+ is the re-identification of a well-known correlation between the αOX parameter and the near-ultraviolet color, in a subset of CSC+ sources with relatively small values of the near-ultraviolet colors. The other data set consists of a sample of blazars for which photometric observations in the optical, mid-, and near-infrared are available, complemented for a subset of the sources, by Fermi γ-ray data. The main results of the application of CLaSPS to such data sets have been the discovery of a strong correlation between the multi-wavelength color distribution of blazars and their optical spectral classification in BL Lac objects and flat-spectrum radio quasars, and a peculiar pattern followed by blazars in the WISE mid-infrared colors space. This pattern and its physical interpretation have been discussed in detail in other papers by one of the authors.

A new approach for mining “…correlations in complex and massive astronomical data sets produced by the federation of the data from modern automated astronomical facilities.”

Mining complex and massive data sets. I have heard that somewhere recently. Sure it will come back to me.

First Light for the Millennium Run Observatory

Filed under: Astroinformatics,Data Mining,Simulations — Patrick Durusau @ 11:29 am

First Light for the Millennium Run Observatory by Cmarchesin.

From the post:

The famous Millennium Run (MR) simulations now appear in a completely new light – literally. The project, led by Gerard Lemson of the MPA and Roderik Overzier of the University of Texas, combines detailed predictions from cosmological simulations with a virtual observatory in order to produce synthetic astronomical observations. In analogy to the moment when newly constructed astronomical observatories receive their “first light”, the Millennium Run Observatory (MRObs) has produced its first images of the simulated universe. These virtual observations allow theorists and observers to analyse the purely theoretical data in exactly the same way as they would purely observational data. Building on the success of the Millennium Run Database, the simulated observations are now being made available to the wider astronomical community for further study. The MRObs browser – a new online tool – allows users to explore the simulated images and interact with the underlying physical universe as stored in the database. The team expects that the advantages offered by this approach will lead to a richer collaboration between theoretical and observational astronomers.

At least with simulated observations, there is no need to worry about cloudy nights. 😉

Interesting in its own right but also as an example of yet another tool for data mining, that of simulation.

Not in the sense of generating “test” data but of deliberating altering data and then measuring the impact of the alterations on data mining tools.

Quite possibly in a double blind context where only some third party knows which data sets were “altered” until all tests have been performed.

Millennium Run Observatory Web Portal and access to the MRObs browser

November 11, 2012

Analysis of the statistics blogosphere

Filed under: Blogs,Data Mining,Python,Social Networks — Patrick Durusau @ 8:11 pm

Analysis of the statistics blogosphere by John Johnson.

From the post:

My analysis of the statistics blogosphere for the Coursera Social Networking Analysis class is up. The Python code and the data are up at my github repository. Enjoy!

Included are most of the Python code I used to obtain blog content, some of my attempts to automate the building of the network (I ended up using a manual process in the end), and my analysis. I also included the data. (You can probably see some of your own content.)

Excellent post on mining blog content.

A rich source of data for a topic map on the subject of your dreams.

October 14, 2012

Tech That Protects The President, Part 1: Data Mining

Filed under: Data Mining,Natural Language Processing,Semantics — Patrick Durusau @ 3:41 pm

Tech That Protects The President, Part 1: Data Mining by Alex Popescu.

From the post:

President Obama’s appearance at the Democratic National Convention in September took place amid a rat’s nest of perils. But the local Charlotte, North Carolina, police weren’t entirely on their own. They were aided by a sophisticated data mining system that helped them identify threats and react to them quickly. (Part 1 of a 3-part series about the technology behind presidential security.)

The Charlotte-Mecklenberg police used a software from lxReveal to monitor the Internet for associations between Obama, the DNC, and potential treats. The company’s program, known as uReveal, combs news articles, status updates, blog posts, discussion forum comments. But it doesn’t simply search for keywords. It works on concepts defined by the user and uses natural language processing to analyze plain English based on meaning and context, taking into account slang and sentiment. If it detects something amiss, the system sends real-time alerts.

“We are able to read and alert almost as fast as [information] comes on the Web, as opposed to other systems where it takes hours,” said Bickford, vice president of operations of IxReveal.

In the past, this kind of task would have required large numbers of people searching and then reading huge volumes of information and manually highlighting relevant references. “Normally you have to take information like an email and shove it in to a database,” Bickford explained. “Someone has to physically read it or do a keyword search.

uReveal, on the other hand, lets machines do the reading, tracking, and analysis. “If you apply our patented technology and natural language processing capability, you can actually monitor that information for specific keywords and phrases based on meaning and context,” he says. The software can differentiate between a Volkswagen bug, a computer bug and an insect bug, Bickford explained – or, more to the point, between a reference to fire from a gun barrel and on to fire in a fireplace.

Bickford says the days of people slaving over sifting through piles of data, or ETL (extract, transform and load) data processing capabilities are over. “It’s just not supportable.”

I understand product promotion but do you think potential assassins are publishing letters to the editor, blogging or tweeting about their plans or operational details?

Granting contract killers in Georgia are caught when someone tries to hire an undercover police officer as a “hit” man.

Does that expectation of dumbness apply in other cases as well?

Or, is searching large amounts of data like the drunk looking for his keys under the street light?

A case of “the light is better here?”

October 10, 2012

Interesting large scale dataset: D4D mobile data [Deadline: October 31, 2012]

Filed under: Data,Data Mining,Dataset,Graphs,Networks — Patrick Durusau @ 4:19 pm

Interesting large scale dataset: D4D mobile data by Danny Bickson.

From the post:

I got the following from Prof. Scott Kirkpatrick.

Write a 250-words research project and get access within a week to the largest ever released mobile phone datasets: datasets based on 2.5 billion records, calls and text messages exchanged between 5 million anonymous users over 5 months.

Participation rules: http://www.d4d.orange.com/

Description of the datasets: http://arxiv.org/abs/1210.0137

The “Terms and Conditions” by Orange allows the publication of resultsbobtained from the datasets even if they do not directly relate to the challenge.

Cash prizes for winning participants and an invitation to present the results at the NetMob conference be held in May 2-3, 2013 at the Medialab at MIT (www.netmob.org).

Deadline: October 31, 2012

Looking to exercise your graph software? Compare to other graph software? Do interesting things with cell phone data?

This could be your chance!

October 9, 2012

Animating Random Projections of High Dimensional Data [“just looking around a bit”]

Filed under: Data Mining,Graphics,High Dimensionality,Visualization — Patrick Durusau @ 4:02 pm

Animating Random Projections of High Dimensional Data by Andreas Mueller.

From the post:

Recently Jake showed some pretty cool videos in his blog.

This inspired me to go back to an idea I had some time ago, about visualizing high-dimensional data via random projections.

I love to do exploratory data analysis with scikit-learn, using the manifold, decomposition and clustering module. But in the end, I can only look at two (or three) dimensions. And I really like to see what I am doing.

So I go and look at the first two PCA directions, than at the first and third, than at the second and third… and so on. That is a bit tedious and looking at more would be great. For example using time.

There is a software out there, called ggobi, which does a pretty good job at visualizing  high dimensional data sets. It is possible to take interactive tours of your high dimensions, set projection angles and whatnot. It has a UI and tons of settings.

I used it a couple of times and I really like it. But it doesn’t really fit into my usual work flow. It  has good R integration, but not Python integration that I know of. And it also seems a bit overkill for “just looking around a bit”.

It’s hard to over estimate the value of “just looking around a bit.”

As opposed to defending a fixed opinion about data, data structures, or processing.

Who knows?

Practice at “just looking around a bit,” may make your opinions less fixed.

Chance you will have to take.

October 8, 2012

Wolfram Data Summit 2012 Presentations [Elves and Hypergraphs = Topic Maps?]

Filed under: Combinatorics,Conferences,Data,Data Mining — Patrick Durusau @ 1:39 pm

Wolfram Data Summit 2012 Presentations

Presentations have been posted from the Wolfram Data Summit 2012:

I looked at:

“The Trouble with House Elves: Computational Folkloristics, Classification, and Hypergraphs” Timothy Tangherlini, Professor, UCLA James Abello, Research Professor, DIMACS – Rutgers University

first. 😉

Would like to see a video of the presentation. Pointers anyone?

Close as I can imagine to being a topic map without using the phrase “topic map.”

Others?

Thursday, September 8

  • Presentation “Who’s Bigger? A Quantitative Analysis of Historical Fame” Steven Skiena, Professor, Stony Brook University
  • Presentation “Academic Data: A Funder’s Perspective” Myron Gutmann, Assistant Director, Social, Behavioral & Economic Sciences, National Science Foundation (NSF)
  • Presentation “Who Owns the Law?” Ed Walters, CEO, Fastcase, Inc.
  • Presentation “An Initiative to Improve Academic and Commercial Data Sharing in Cancer Research” Charles Hugh-Jones, Vice President, Head, Medical Affairs North America, Sanofi
  • Presentation “The Trouble with House Elves: Computational Folkloristics, Classification, and Hypergraphs” Timothy Tangherlini, Professor, UCLA James Abello, Research Professor, DIMACS – Rutgers University
  • Presentation “Rethinking Digital Research” Kaitlin Thaney, Manager, External Partnerships, Digital Science
  • Presentation “Building and Learning from Social Networks” Chris McConnell, Principal Software Development Lead, Microsoft Research FUSE Labs
  • Presentation “Keeping Repositories in Synchronization: NISO/OAI ResourceSync Project” Todd Carpenter, Executive Director, NISO
  • Presentation “A New, Searchable SDMX Registry of Country-Level Health, Education, and Financial Data” Chris Dickey, Director, Research and Innovations, DevInfo Support Group
  • Presentation “Dryad’s Evolving Proof of Concept and the Metadata Hook” Jane Greenberg, Professor, School of Information and Library Science (SILS), University of North Carolina at Chapel Hill
  • Presentation “How the Associated Press Tabulates and Distributes Votes in US Elections” Brian Scanlon, Director of Election Services, The Associated Press
  • Presentation “How Open Is Open Data?” Ian White, President, Urban Mapping, Inc.
  • Presentation “No More Tablets of Stone: Enabling the User to Weight Our Data and Shape Our Research” Toby Green, Head of Publishing, Organisation for Economic Co-operation and Development (OECD)
  • Presentation “Sharing and Protecting Confidential Data: Real-World Examples” Timothy Mulcahy, Principal Research Scientist, NORC at the University of Chicago
  • Presentation “Language Models That Stimulate Creativity” Matthew Huebert, Programmer/Designer, BrainTripping
  • Presentation “The Analytic Potential of Long-Tail Data: Sharable Data and Reuse Value” Carole Palmer, Center for Informatics Research in Science & Scholarship, University of Illinois at Urbana-Champaign
  • Presentation “Evolution of the Storage Brain—Using History to Predict the Future” Larry Freeman, Senior Technologist, NetApp, Inc.

Friday, September 9

  • Presentation “Devices, Data, and Dollars” John Burbank, President, Strategic Initiatives, The Nielsen Company
  • Presentation “Pulling Structured Data Out of Unstructured” Greg Lindahl, CTO, blekko
  • Presentation “Mining Consumer Data for Insights and Trends” Rohit Chauhan, Group Executive, MasterCard Worldwide
  • Presentation
    “Data Quality and Customer Behavioral Modeling” Daniel Krasner, Chief Data Scientist, Sailthru/KFit Solutions
  • No presentation available. “Human-Powered Analysis with Crowdsourcing and Visualization” Edwin Chen, Data Scientist, Twitter
  • Presentation “Leveraging Social Media Data as Real-Time Indicators of X” Maria Singson, Vice President, Country and Industry Research & Forecasting, IHS Chris Hansen, Director, IHS Dan Bergstresser, Chief Economist, Janys Analytics
  • No presentation available. “Visualizations in Yelp” Jim Blomo, Engineering Manager, Data-Mining, Yelp
  • Presentation “The Digital Footprints of Human Activity” Stanislav Sobolevsky, MIT SENSEable City Lab
  • Presentation “Unleash Your Research: The Wolfram Data Repository” Matthew Day, Manager, Data Repository, Wolfram Alpha LLC
  • Presentation “Quantifying Online Discussion: Unexpected Conclusions from Mass Participation” Sascha Mombartz, Creative Director, Urtak
  • Presentation “Statistical Physics for Non-physicists: Obesity Spreading and Information Flow in Society” Hernán Makse, Professor, City College of New York
  • Presentation “Neuroscience Data: Past, Present, and Future” Chinh Dang, CTO, Allen Institute for Brain Science
  • Presentation “Finding Hidden Structure in Complex Networks” Yong-Yeol Ahn, Assistant Professor, Indiana University Bloomington
  • Presentation “Data Challenges in Health Monitoring and Diagnostics” Anthony Smart, Chief Science Officer, Scanadu
  • No presentation available. “Datascience Automation with Wolfram|Alpha Pro” Taliesin Beynon, Manager and Development Lead, Wolfram Alpha LLC
  • Presentation “How Data Science, the Web, and Linked Data Are Changing Medicine” Joanne Luciano, Research Associate Professor, Rensselaer Polytechnic Institute
  • Presentation “Unstructured Data and the Role of Natural Language Processing” Philip Resnik, Professor, University of Maryland
  • Presentation “A Framework for Measuring Social Quality of Content Based on User Behavior” Nanda Kishore, CTO, ShareThis, Inc.
  • Presentation “The Science of Social Data” Hilary Mason, Chief Scientist, bitly
  • Presentation “Big Data for Small Languages” Laura Welcher, Director of Operations, The Rosetta Project
  • Presentation “Moving from Information to Insight” Anthony Scriffignano, Senior Vice President, Worldwide Data & Insight, Dun and Bradstreet

PS: I saw this in Christophe Lalanne’s A bag of tweets / September 2012 and reformatted the page to make it easier to consult.

October 7, 2012

Revisiting “Ranking the popularity of programming languages”: creating tiers

Filed under: Data Mining,Graphics,Statistics,Visualization — Patrick Durusau @ 4:05 pm

Revisiting “Ranking the popularity of programming languages”: creating tiers by Drew Conway.

From the post:

In a post on dataists almost two years ago, John Myles White and I posed the question: “How would you rank the popularity of a programming language?”.

From the original post:

One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.

So, we spent an evening at Princeton hacking around on Github and StackOverflow to get data on the number of projects and questions tagged, per programming language, respectively. The result was a scatter plot showing the linear relationship between these two measures. As with any post comparing programming languages, it was great bait for the Internet masses to poke holes in, and since then Stephen O’Grady at Redmonk has been re-running the analysis to show changes in the relative position of languages over time.

Today I am giving a talk at Monktoberfest on the importance of pursuing good questions in data science. As an example, I wanted to revisits the problem of ranking programming languages. For a long time I have been unsatisfied with the outcome of the original post, because the chart does not really address the original question about ranking.

I would not down play the importance of Drew’s descriptive analysis.

Until you can describe something, it is really difficult to explain it. 😉

October 2, 2012

Tapping the Data Deluge with R

Filed under: Data Mining,R — Patrick Durusau @ 4:31 pm

Tapping the Data Deluge with R by Jeffrey Breen.

Jeffrey points to slides and other resources for a presentation he made on accessing standard and not so standard sources of data with R.

September 28, 2012

Representing Solutions with PMML (ACM Data Mining Talk)

Filed under: Data Mining,Predictive Model Markup Language (PMML) — Patrick Durusau @ 1:58 pm

Representing Solutions with PMML (ACM Data Mining Talk)

Dr. Alex Guazzelli’s talk on PMML and Predictive Analytics to the ACM Data Mining Bay Area/SF group at the LinkedIn auditorium in Sunnyvale, CA.

Abstract:

Data mining scientists work hard to analyze historical data and to build the best predictive solutions out of it. IT engineers, on the other hand, are usually responsible for bringing these solutions to life, by recoding them into a format suitable for operational deployment. Given that data mining scientists and engineers tend to inhabit different information worlds, the process of moving a predictive solution from the scientist’s desktop to the operational environment can get lost in translation and take months. The advent of data mining specific open standards such as the Predictive Model Markup Language (PMML) has turned this view upside down: the deployment of models can now be achieved by the same team who builds them, in a matter of minutes.

In this talk, Dr. Alex Guazzelli not only provides the business rationale behind PMML, but also describes its main components. Besides being able to describe the most common modeling techniques, as of version 4.0, released in 2009, PMML is also capable of handling complex pre-processing tasks. As of version 4.1, released in December 2011, PMML has also incorporated complex post-processing to its structure as well as the ability to represent model ensemble, segmentation, chaining, and composition within a single language element. This combined representation power, in which an entire predictive solution (from pre-processing to model(s) to post-processing) can be represented in a single PMML file, attests to the language’s refinement and maturity.

I hesitated at the story of replacing IT engineers with data scientists. Didn’t we try that one before?

But then it was programmers with business managers. And it was called COBOL. 😉

Nothing against COBOL, it is still in use today. Widespread use as a matter of fact.

But all tasks, including IT engineering, look easy from a distance. Only after getting poor results is that lesson learned. Again.

What have your experiences been with PMML?

September 17, 2012

Statistical Data Mining Tutorials

Filed under: Data Mining,Statistics — Patrick Durusau @ 6:25 pm

Statistical Data Mining Tutorials by Andrew Moore.

From the post:

The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.

These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. They include regression algorithms such as multivariate polynomial regression, MARS, Locally Weighted Regression, GMDH and neural nets. And they include other data mining operations such as clustering (mixture models, k-means and hierarchical), Bayesian networks and Reinforcement Learning.

Perhaps a bit dated but not seriously so.

And one never knows when a slightly different explanation will make something obscure suddenly clear.

September 11, 2012

Web Data Extraction, Applications and Techniques: A Survey

Filed under: Data Mining,Extraction,Machine Learning,Text Extraction,Text Mining — Patrick Durusau @ 5:05 am

Web Data Extraction, Applications and Techniques: A Survey by Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, Robert Baumgartner.

Abstract:

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of application domains. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc application domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction.

This survey aims at providing a structured and comprehensive overview of the research efforts made in the field of Web Data Extraction. The fil rouge of our work is to provide a classification of existing approaches in terms of the applications for which they have been employed. This differentiates our work from other surveys devoted to classify existing approaches on the basis of the algorithms, techniques and tools they use.

We classified Web Data Extraction approaches into categories and, for each category, we illustrated the basic techniques along with their main variants.

We grouped existing applications in two main areas: applications at the Enterprise level and at the Social Web level. Such a classification relies on a twofold reason: on one hand, Web Data Extraction techniques emerged as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. On the other hand, Web Data Extraction techniques allow for gathering a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities of analyzing human behaviors on a large scale.

We discussed also about the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.

Comprehensive (> 50 pages) survey of web data extraction. Supplements and updates existing work by its focus on classifying by field of use, web data extraction.

Very likely to lead to adaptation of techniques from one field to another.

September 4, 2012

Getting data on your government

Filed under: Data Mining,Government Data,R — Patrick Durusau @ 6:52 pm

Getting data on your government

From the post:

I created an R package a while back to interact with some APIs that serve up data on what our elected represenatives are up to, including the New York Times Congress API, and the Sunlight Labs API.

What kinds of things can you do with govdat? Here are a few examples.

How do the two major parties differ in the use of certain words (searches the congressional record using the Sunlight Labs Capitol Words API)?

[text and code omitted]

Let’s get some data on donations to individual elected representatives.

[text and code omitted]

Or we may want to get a bio of a congressperson. Here we get Todd Akin of MO. And some twitter searching too? Indeed.

[text and code omitted]

I waver between thinking mining government data is a good thing and being reminded the government did voluntarily release it. In the latter case, it may be nothing more than a distraction.

Call for contribution: the RDataMining package…

Filed under: Data Mining,R — Patrick Durusau @ 6:35 pm

Call for contribution: the RDataMining package – an R package for data mining by Yanchang Zhao.

Join the RDataMining project to build a comprehensive R package for data mining http://www.rdatamining.com/package

We have started the RDataMining project on R-Forge to build an R package for data mining. The package will provide various functionalities for data mining, with contributions from many R users. If you have developed or will implement any data mining algorithms in R, please participate in the project to make your work available to R users worldwide.

Background

Although there are many R packages for various data mining functionalities, there are many more new algorithms designed and published every year, without any R implementations for them. It is far beyond the capability of a single team, even several teams, to build packages for oncoming new data mining algorithms. On the other hand, many R users developed their own implementations of new data mining algorithms, but unfortunately, used for their own work only, without sharing with other R users. The reason could be that they donot know or donot have time to build packages to share their code, or they might think that it is not worth building a package with only one or two functions.

Objective

To forester the development of data mining capability in R and facilitate sharing of data mining codes/functions/algorithms among R users, we started this project on R-Forge to collaboratively build an R package for data mining, with contributions from many R users, including ourselves.

Definitely worth considering if you are using R for data mining.

It also makes me think of the various public data dumps. I assume someone has mined some (most?) of those and has gained insights into their quirks.

Are there any projects gathering data mining tips or experiences with public data sets? Or are those buried in footnotes or asides, when they are recorded at all?

August 31, 2012

Data mining local radio with Node.js

Filed under: Data Mining,node-js — Patrick Durusau @ 3:19 pm

Data mining local radio with Node.js by Evan Muehlhausen.

From the post:

More harpsicord?!

Seattle is lucky to have KINGFM, a local radio station dedicated to 100% classical music. As one of the few existent classical music fans in his twenties, I listen often enough. Over the past few years, I’ve noticed that when I tune to the station, I always seem to hear the plinky sound of a harpsicord.

Before I sent KINGFM an email, admonishing them for playing so much of an instrument I dislike, I wanted to investigate whether my ears were deceiving me. Perhaps my own distaste for the harpsicord increased its impact in my memory.

This article outlines the details of this investigation and especially the process of collecting the data.
….

Another data collecting/mining post.

If you were collecting this data, how would you reliably share it with others?

In that regard, you might want to consider distinguishing members of the Bach family as a practice run.

I first saw this at DZone.

August 25, 2012

Data Mining Blogs

Filed under: Data Mining — Patrick Durusau @ 3:58 pm

Data Mining Blogs by Sandro Saitta.

From the post:

I posted an earlier version of this data mining blog list in a previously on DMR. Here is an updated version (blogs recently added to the list have the logo “new”). I will keep this version up-to-date. You can access it at any time from the DMR top bar. Here is a link to the OPML version. If you know a data mining blog that is not in this list, feel free to post a comment so I can add the link. Also, if you see any broken link, please mention it.

Consider this a starter set of locators for a custom web crawl on data mining.

RSS feeds are great, but only for current content.

August 24, 2012

Fall Lineup: Protest Monitoring, Bin Laden Letters Analysis, … [Defensive Big Data (DBD)]

Filed under: Data Mining,Predictive Analytics — Patrick Durusau @ 4:33 pm

Protest Monitoring, Bin Laden Letters Analysis, and Building Custom Applications

OK, not “Fall Lineup” in the TV sense. 😉

Webinars from Recorded Future in September, 2012.

All start at 11 AM EST.

These webinars should help you learn how data mining looks for clues or how to not leave clues.

Is the term: Defensive Big Data (DBD) in common usage?

Think of using Mahout to analyze email traffic to support reforming your emails to be close to messages that are routinely ignored.

Process a Million Songs with Apache Pig

Filed under: Amazon Web Services AWS,Cloudera,Data Mining,Hadoop,Pig — Patrick Durusau @ 3:22 pm

Process a Million Songs with Apache Pig by Justin Kestelyn.

From the post:

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.

The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.

Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).

An example of offering the reader their choice of implementation detail, on or off a cloud. 😉

Suspect that is going to become increasingly common.

August 21, 2012

Streaming Data Mining Tutorial slides (and more)

Filed under: Data Mining,Stream Analytics — Patrick Durusau @ 1:02 pm

Streaming Data Mining Tutorial slides (and more) by Igor Carron.

From the post:

Jelani Nelson.and Edo Liberty just released an important tutorial they gave at KDD 12 on the state of the art and practical algorithms used in mining streaming data, entitled: Streaming Data Mining I personally marvel at the development of these deep algorithms which, because of the large data streams constraints, get to redefine what it means to do seemingly simple functions such as counting in the Big Data world. Here are some slides that got my interest, but the 111 pages are worth the read:

Pointers to more slides and videos follow.

August 16, 2012

Mining of Massive Datasets [Revised – Mining Large Graphs Added]

Filed under: BigData,Data Analysis,Data Mining — Patrick Durusau @ 7:04 pm

Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman.

Version 1.0 errata frozen as of June 4, 2012.

Version 1.1 adds Jure Leskovec as a co-author and adds a chapter on mining large graphs.

Both versions can be downloaded as chapters or as entire text.

August 15, 2012

Mining the astronomical literature

Filed under: Astroinformatics,Data Mining — Patrick Durusau @ 1:58 pm

Mining the astronomical literature (A clever data project shows the promise of open and freely accessible academic literature) by Alasdair Allan.

From the post:

There is a huge debate right now about making academic literature freely accessible and moving toward open access. But what would be possible if people stopped talking about it and just dug in and got on with it?

NASA’s Astrophysics Data System (ADS), hosted by the Smithsonian Astrophysical Observatory (SAO), has quietly been working away since the mid-’90s. Without much, if any, fanfare amongst the other disciplines, it has moved astronomers into a world where access to the literature is just a given. It’s something they don’t have to think about all that much.

The ADS service provides access to abstracts for virtually all of the astronomical literature. But it also provides access to the full text of more than half a million papers, going right back to the start of peer-reviewed journals in the 1800s. The service has links to online data archives, along with reference and citation information for each of the papers, and it’s all searchable and downloadable.

(graphic omitted)

The existence of the ADS, along with the arXiv pre-print server, has meant that most astronomers haven’t seen the inside of a brick-built library since the late 1990s.

It also makes astronomy almost uniquely well placed for interesting data mining experiments, experiments that hint at what the rest of academia could do if they followed astronomy’s lead. The fact that the discipline’s literature has been scanned, archived, indexed and catalogued, and placed behind a RESTful API makes it a treasure trove, both for hypothesis generation and sociological research.

That’s the trick isn’t it? “…if they followed astronomy’s lead.”

The technology used by the astronomical community has been equally available to other scientific, technical, medical and humanities disciplines.

Instead of ADS, for example, the humanities have JSTOR. JSTOR is supported by funds that originate with the public but the public has no access.

An example of how a data project reflects the character of the community that gave rise to it.

Astronomers value sharing of information and data, therefore their projects reflect those values.

Other projects reflect other values.

Not a question of technology but one of fundamental values.

« Newer PostsOlder Posts »

Powered by WordPress