Archive for the ‘Data Mining’ Category
Monday, May 13th, 2013
WSDM 2014 : Seventh ACM International Conference on Web Search and Data Mining
Abstract submission deadline: August 19, 2013
Paper submission deadline: August 26, 2013
Tutorial proposals due: September 9, 2013
Tutorial and paper acceptance notifications: November 25, 2013
Tutorials: February 24, 2014
Main Conference: February 25-28, 2014
From the call for papers:
WSDM (pronounced “wisdom”) is one of the premier conferences covering research in the areas of search and data mining on the Web. The Seventh ACM WSDM Conference will take place in New York City, USA during February 25-28, 2014.
WSDM publishes original, high-quality papers related to search and data mining on the Web and the Social Web, with an emphasis on practical but principled novel models of search, retrieval and data mining, algorithm design and analysis, economic implications, and in-depth experimental analysis of accuracy and performance.
WSDM 2014 is a highly selective, single track meeting that includes invited talks as well as refereed full papers. Topics covered include but are not limited to:
(…)
Papers emphasizing novel algorithmic approaches are particularly encouraged, as are empirical/analytical studies of specific data mining problems in other scientific disciplines, in business, engineering, or other application domains. Application-oriented papers that make innovative technical contributions to research are welcome. Visionary papers on new and emerging topics are also welcome.
Authors are explicitly discouraged from submitting papers that do not present clearly their contribution with respect to previous works, that contain only incremental results, and that do not provide significant advances over existing approaches.
Sets a high bar but one that can be met.
Would be very nice PR to have a topic map paper among those accepted.
Posted in Conferences, Data Mining, Searching, WWW | No Comments »
Saturday, May 11th, 2013
Data-rich astronomy: mining synoptic sky surveys by Stefano Cavuoti.
Abstract:
In the last decade a new generation of telescopes and sensors has allowed the production of a very large amount of data and astronomy has become, a data-rich science; this transition is often labeled as: “data revolution” and “data tsunami”. The first locution puts emphasis on the expectations of the astronomers while the second stresses, instead, the dramatic problem arising from this large amount of data: which is no longer computable with traditional approaches to data storage, data reduction and data analysis. In a new, age new instruments are necessary, as it happened in the Bronze age when mankind left the old instruments made out of stone to adopt the new, better ones made with bronze. Everything changed, even the social structure. In a similar way, this new age of Astronomy calls for a new generation of tools and, for a new methodological approach to many problems, and for the acquisition of new skills. The attempts to find a solution to this problems falls under the umbrella of a new discipline which originated by the intersection of astronomy, statistics and computer science: Astroinformatics, (Borne, 2009; Djorgovski et al., 2006).
Dissertation by the same Stefano Cavuoti of: Astrophysical data mining with GPU….
Along with every new discipline comes semantics that are transparent to insiders and opaque to others.
Not out of malice but economy. Why explain a term if all those attending the discussion understand what it means?
But that lack of explanation, like our current ignorance about the means used to construct the pyramids, can come back to bite you.
In some cases far more quickly than intellectual curiosity about ancient monuments by the tin hat crowd.
Take the continuing failure of data integration by the U.S. intelligence services for example.
Rather than the current mule-like resistance to sharing, I would data bomb the other intelligence services with incompatible data exports every week.
Full sharing, for all they would be able to do with it.
Unless they had a topic map.
Posted in Astroinformatics, BigData, Data Mining, GPU | No Comments »
Thursday, May 2nd, 2013
HyperLogLog — Cornerstone of a Big Data Infrastructure
From the introduction:
In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS work FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.
After seeing the HyperLogLog references in Approximate Methods for Scalable Data Mining I started looking for a fuller explanation/illustration of HyperLogLog.
Stumbled on this posting.
Includes a great HyperLogLog (HLL) simulation written in JavaScript.
Enjoy!
Posted in Algorithms, BigData, Data Mining, Scalability | No Comments »
Thursday, May 2nd, 2013
Approximate Methods for Scalable Data Mining by Andrew Clegg.
Slides from a presentation at: Data Science London 24/04/13.
To get your interest, a nice illustration of HyperLogLog algorithm, “Billions of distinct values in 1.5KB of RAM with 2% relative error.”
Has a number of other useful illustrations and great references.
Posted in BigData, Data Mining, Scalability | 1 Comment »
Saturday, April 27th, 2013
Extracting and connecting chemical structures from text sources using chemicalize.org by Christopher Southan and Andras Stracz.
Abstract:
Background
Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.
Results
Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and/or merged extractions.
Conclusion
This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.
A great example of building a resource to address identity issues in a specific domain.
The result speaks for itself.
PS: The results were not delayed awaiting a reformation of chemistry to use a common identifier.
Posted in Cheminformatics, Data Mining, Text Mining | No Comments »
Tuesday, April 23rd, 2013
Resources and Readings for Big Data Week DC Events
This is Big Data week in DC and Data Community DC has put together a list of books articles and posts to keep you busy all week.
Very cool!
Posted in BigData, Data, Data Mining, Natural Language Processing | No Comments »
Sunday, April 21st, 2013
Collaborative annotation for scientific data discovery and reuse by Kirk Borne. (Borne, K. (2013), Collaborative annotation for scientific data discovery and reuse. Bul. Am. Soc. Info. Sci. Tech., 39: 44–45. doi: 10.1002/bult.2013.1720390414)
Abstract:
Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.
A persuasive call to arms to develop “collaborative annotation:”
Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams.
And more specifically for the Large Synoptic Survey Telescope (LSST):
The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System, www.biodas.org).
As you might expect, semantic diversity is going to be present with “collaborative annotation.”
Semantic Monotony (aka Semantic Web) has failed for machines alone.
No question it will fail for humans + machines.
Are you ready to step up to the semantic diversity of collaborative annotation (humans + machines)?
Posted in Annotation, BigData, Collaborative Annotation, Data Mining, Semantic Diversity, Semantic Web | No Comments »
Sunday, April 21st, 2013
The PLOS Text Mining Collection has launched!
From the webpage:
Across all realms of the sciences and beyond, the rapid growth in the number of works published digitally presents new challenges and opportunities for making sense of this wealth of textual information. The maturing field of Text Mining aims to solve problems concerning the retrieval, extraction and analysis of unstructured information in digital text, and to revolutionize how scientists access and interpret data that might otherwise remain buried in the literature.
Here PLOS acknowledges the growing body of work in the area of Text Mining by bringing together major reviews and new research studies published in PLOS journals to create the PLOS Text Mining Collection. It is no coincidence that research in Text Mining in PLOS journals is burgeoning: the widespread uptake of the Open Access publishing model developed by PLOS and other publishers now makes it easier than ever to obtain, mine and redistribute data from published texts. The launch of the PLOS Text Mining Collection complements related PLOS Collections on Open Access and Altmetrics, and further underscores the importance of the PLOS Application Programming Interface, which provides an open source interface with which to mine PLOS journal content.
The Collection is now open across the PLOS journals to all authors who wish to submit research or reviews in this area. Articles are presented below in order of publication date and new articles will be added to the Collection as they are published.
An impressive start to what promises to be a very rich resource!
I first saw this at: New: PLOS Text Mining.
Posted in Data Mining, Text Mining | No Comments »
Wednesday, April 17th, 2013
Practical tools for exploring data and models by Hadley Alexander Wickham. (PDF)
From the introduction:
This thesis describes three families of tools for exploring data and models. It is organised in roughly the same way that you perform a data analysis. First, you get the data in a form that you can work with; Section 1.1 introduces the reshape framework for restructuring data, described fully in Chapter 2. Second, you plot the data to get a feel for what is going on; Section 1.2 introduces the layered grammar of graphics, described in Chapter 3. Third, you iterate between graphics and models to build a succinct quantitative summary of the data; Section 1.3 introduces strategies for visualising models, discussed in Chapter 4. Finally, you look back at what you have done, and contemplate what tools you need to do better in the future; Chapter 5 summarises the impact of my work and my plans for the future.
The tools developed in this thesis are firmly based in the philosophy of exploratory data analysis (Tukey, 1977). With every view of the data, we strive to be both curious and sceptical. We keep an open mind towards alternative explanations, never believing we
have found the best model. Due to space limitations, the following papers only give a glimpse at this philosophy of data analysis, but it underlies all of the tools and strategies that are developed. A fuller data analysis, using many of the tools developed in this thesis, is available in Hobbs et al. (To appear).
Has a focus on R tools, including ggplot2 and Wilkinson’s The Grammar of Graphics.
The “…never believing we have found the best model” approach works for me!
You?
I first saw this at Data Scholars.
Posted in Data, Data Mining, Data Models, Exploratory Data Analysis | No Comments »
Tuesday, April 16th, 2013
The non-negative matrix factorization toolbox for biological data mining by Yifeng Li and Alioune Ngom. (Source Code for Biology and Medicine 2013, 8:10 doi:10.1186/1751-0473-8-10)
From the post:
Background: Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data.
Results: We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison.
Conclusions: A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.
Written in a bioinformatics context but also used in text data mining (Enron emails), spectral analysis and other data mining fields. (See Non-negative matrix factorization)
Posted in Bioinformatics, Data Mining, Matrix | No Comments »
Monday, April 15th, 2013
2ND International Workshop on Mining Scientific Publications
May 26, 2013 – Submission deadline
June 23, 2013 – Notification of acceptance
July 7, 2013 – Camera-ready
July 26, 2013 – Workshop
From the CFP:
Digital libraries that store scientific publications are becoming increasingly important in research. They are used not only for traditional tasks such as finding and storing research outputs, but also as sources for mining this information, discovering new research trends and evaluating research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries makes it no longer sufficient to provide access to content to human readers only. It is equally important to allow machines analyse this information and by doing so facilitate the processes by which research is being accomplished. Recent developments in natural language processing, information retrieval, the semantic web and other disciplines make it possible to transform the way we work with scientific publications. However, in order to make this happen, researchers first need to be able to easily access and use large databases of scientific publications and research data, to carry out experiments.
This workshop aims to bring together people from different backgrounds who:
(a) are interested in analysing and mining databases of scientific publications,
(b) develop systems, infrastructures or datasets that enable such analysis and mining,
(c) design novel technologies that improve the way research is being accomplished or
(d) support the openness and free availability of publications and research data.
2. TOPICS
The topics of the workshop will be organised around the following three themes:
- Infrastructures, systems, open datasets or APIs that enable analysis of large volumes of scientific publications.
- Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
- Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence and to aid content exploration.
Of particular interest for topic mappers:
Topics of interest relevant to theme 2 include, but are not limited to:
- Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure etc.
- Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
- New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications, according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
- Models for semantically representing and annotating publications. This topic is related to aspects of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of special interest.
- Semantically enriching/annotating publications by crowdsourcing. Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing metadata, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, (d) other relevant crowdsourcing topics relevant to the domain of scientific publications.
The other themes could be viewed through a topic map lens but semantic enrichment seems like a natural.
Posted in Conferences, Data Mining, Searching, Semantic Search, Semantics | No Comments »
Saturday, April 13th, 2013
Office of the Director of National Intelligence: Data Mining 2012
Office of the Director of National Intelligence = ODNI
To cut directly to the chase:
II. ODNI Data Mining Activities
The ODNI did not engage in any activities to use or develop data mining functionality during the reporting period.
My source, KDNuggets, provides the legal loophole analysis.
Who watches the watchers?
Looks like that it’s going to be you and me.
Every citizen who recognizes a government employee, agent, official, tweet the name you know them by with your location.
Just that.
If enough of us do that, patterns will begin to appear in the data stream.
If enough patterns appear in the data stream, the identities of government employees, agents, officials, will slowly become known.
Transparency won’t happen overnight or easily.
But if you are waiting for the watchers to watch themselves, you are going to be severely disappointed.
Posted in Data Mining, Intelligence | No Comments »
Wednesday, April 10th, 2013
Can Big Data From Cellphones Help Prevent Conflict? by Emmanuel Letouzé.
From the post:
Data from social media and Ushahidi-style crowdsourcing platforms have emerged as possible ways to leverage cellphones to prevent conflict. But in the world of Big Data, the amount of information generated from these is too small to use in advanced data-mining techniques and “machine-learning” techniques (where algorithms adjust themselves based on the data they receive).
But there is another way cellphones could be leveraged in conflict settings: through the various types of data passively generated every time a device is used. “Phones can know,” said Professor Alex “Sandy” Pentland, head of the Human Dynamics Laboratory and a prominent computational social scientist at MIT, in a Wall Street Journal article. He says data trails left behind by cellphone and credit card users—“digital breadcrumbs”—reflect actual behavior and can tell objective life stories, as opposed to what is found in social media data, where intents or feelings are obscured because they are “edited according to the standards of the day.”
The findings and implications of this, documented in several studies and press articles, are nothing short of mind-blowing. Take a few examples. It has been shown that it was possible to infer whether two people were talking about politics using cellphone data, with no knowledge of the actual content of their conversation. Changes in movement and communication patterns revealed in cellphone data were also found to be good predictors of getting the flu days before it was actually diagnosed, according to MIT research featured in the Wall Street Journal. Cellphone data were also used to reproduce census data, study human dynamics in slums, and for community-wide financial coping strategies in the aftermath of an earthquake or crisis.
Very interesting post on the potential uses for cell phone data.
You can imagine what I think could be correlated with cellphone data using a topic map so I won’t bother to enumerate those possibilities.
I did want to comment on the concern about privacy or re-identification as Emmanuel calls it in his post from cellphone data.
Governments, who have declared they can execute any of us without notice or a hearing, are the guardians of that privacy.
That causes me to lack confidence in their guarantees.
Discussions of privacy should assume governments already have unfettered access to all data.
The useful questions become: How do we detect their misuse of such data? and How do we make them heartily sorry for that misuse?
For cell phone data, open access will give government officials more reason for pause than the ordinary citizen.
Less privacy for individuals but also less privacy for access, bribery, contract padding, influence peddling, and other normal functions of government.
In the U.S.A., we have given up our rights to public trial, probable cause, habeas corpus, protections against unreasonable search and seizure, to be free from touching by strangers, and several others.
What’s the loss of the right to privacy for cellphone data compared to catching government officials abusing their offices?
Posted in BigData, Data Mining, Privacy | No Comments »
Wednesday, April 10th, 2013
R Cheatsheets
I ran across this collection of cheatsheets for R today.
The R Reference Card for Data Mining is interesting to me but you want to look at some of the others.
Enjoy!
Posted in Data Mining, R | No Comments »
Wednesday, April 10th, 2013
The Best Data Mining Tools You Can Use for Free in Your Company by: Mawuna Remarque KOUTONIN.
Short descriptions of the usual suspects but a couple (jHepWork and PSPP) that were new to me.
- RapidMiner
- RapidAnalytics
- Weka
- PSPP
- KNIME
- Orange
- Apache Mahout
- jHepWork
- Rattle
An interesting site in general.
Consider the following pitch for business success in Africa:
Africa: Your Business Should be Profitable in 45 days or Die
And the reasons for that claim:
1. “It’s almost virgin here. There are lot of opportunities, but you have to fight!”
2. “Target the vanity class with vanity products. The “new rich” have lot of money. They are though on everything except their big ego and social reputation”
3. “Target the lazy executives and middle managers. Do the job they are paid for as a consultant. Be good, and politically savvy, and the money is yours”
4. “You’ll make more money in selling food or opening a restaurant than working for the Bank”
5. “You can’t avoid politics, but learn to think like the people your are talking with. Always finish your sentence with something like “the most important is the country’s development, not power. We all have to work in that direction”
6. “It’s about hard work and passion, but you should first forget about managing time like in Europe.
Take time to visit people, go to the vanity parties, have the patience to let stupid people finish their long empty sentences, and make the politicians understand that your project could make them win elections and strengthen their positions”
7. “Speed is everything. Think fast, Act fast, Be everywhere through friends, family and informants”
With the exception of #1, all of these points are advice I would give to someone marketing topic maps on any continent.
It may be easier to market topic maps where there are few legacy IT systems that might feel threatened by a new technology.
Posted in Data Mining, Knime, Mahout, Marketing, Orange, PSPP, RapidMiner, Rattle, Weka, jHepWork | No Comments »
Tuesday, April 9th, 2013
Astrophysical data mining with GPU. A case study: genetic classification of globular clusters by Stefano Cavuoti, Mauro Garofalo, Massimo Brescia, Maurizio Paolillo, Antonio Pescape’, Giuseppe Longo, Giorgio Ventre.
Abstract:
We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200x in the training phase with respect to the CPU based version.
BTW, DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html.
In case you are curious about the application of genetic algorithms in a low signal/noise situation with really “big” data, this is a good starting point.
Makes me curious about the “noise” in other communications.
The “signal” is fairly easy to identify in astronomy, but what about in text or speech?
I suppose “background noise, music, automobiles” would count as “noise” on a tape recording of a conversation, but is there “noise” in a written text?
Or noise in a conversation that is clearly audible?
If we have 100% signal, how do we explain failing to understand a message in speech or writing?
If it is not “noise,” then what is the problem?
Posted in Astroinformatics, BigData, Data Mining, GPU, Genetic Algorithms | 1 Comment »
Tuesday, April 9th, 2013
Scrapely
From the webpage:
Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
A tool for data mining similar HTML pages.
Supports a command line interface.
Posted in Data Mining, Python | No Comments »
Saturday, April 6th, 2013
K-Nearest Neighbors: dangerously simple by Cathy O’Neil.
From the post:
I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.
After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.
I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well. Let me explain.
…
The devil is all in the detail of what you mean by close. And to make things trickier, as in easier to be deceptively easy, there are default choices you could make (and which you would make) which would probably be totally stupid. Namely, the raw numbers, and Euclidean distance.
Read and think about Cathy’s post.
All those nice, clean, clear number values and a simple math equation, muddied by meaning.
Undocumented meaning.
And undocumented relationships between the variables the number values represent.
You could document your meaning and the relationships between variables and still make dumb decisions.
The hope is you or your successor will use documented meaning and relationships to make better decisions.
For documentation you can:
- Try to remember the meaning of “close” and the relationships for all uses of K-Nearest Neighbors where you work.
- Write meaning and relationships down on sticky notes collected in your desk draw.
- Write meaning and relationships on paper or in electronic files, the latter somewhere on the server.
- Document meaning and relationships with a topic map, so you can leverage on information already known. Including identifiers for the VP who ordered you to use particular values, for example. (Along with digitally signed copies of the email(s) in question.)
Which one are you using?
PS: This link was forwarded to me by Sam Hunting.
Posted in Data Mining, K-Nearest-Neighbors, Marketing, Topic Maps | 2 Comments »
Saturday, April 6th, 2013
A Programmer’s Guide to Data Mining – The Ancient Art of the Numerati by Ron Zacharski.
From the webpage:
Before you is a tool for learning basic data mining techniques. Most data mining textbooks focus on providing a theoretical foundation for data mining, and as result, may seem notoriously difficult to understand. Don’t get me wrong, the information in those books is extremely important. However, if you are a programmer interested in learning a bit about data mining you might be interested in a beginner’s hands-on guide as a first step. That’s what this book provides.
This guide follows a learn-by-doing approach. Instead of passively reading the book, I encourage you to work through the exercises and experiment with the Python code I provide. I hope you will be actively involved in trying out and programming data mining techniques. The textbook is laid out as a series of small steps that build on each other until, by the time you complete the book, you have laid the foundation for understanding data mining techniques. This book is available for download for free under a Creative Commons license (see link in footer). You are free to share the book, and remix it. Someday I may offer a paper copy, but the online version will always be free.
If you are looking for explanations of data mining that fall between the “dummies” variety and arXiv.org papers, you are at the right place!
Not new information but well presented information, always a rare thing.
Take the time to read this book.
If not for the content, to get some ideas on how to improve your next book.
Posted in Data Mining, Python | No Comments »
Friday, April 5th, 2013
A Newspaper Clipping Service with Cascading by Sujit Pal.
From the post:
This post describes a possible implementation for an automated Newspaper Clipping Service. The end-user is a researcher (or team of researchers) in a particular discipline who registers an interest in a set of topics (or web-pages). An assistant (or team of assistants) then scour information sources to find more documents of interest to the researcher based on these topics identified. In this particular case, the information sources were limited to a set of “approved” newspapers, hence the name “Newspaper Clipping Service”. The goal is to replace the assistants with an automated system.
The solution I came up with was to analyze the original web pages and treat keywords extracted out of these pages as topics, then for each keyword, query a popular search engine and gather the top 10 results from each query. The search engine can be customized so the sites it looks at is restricted by the list of approved newspapers. Finally the URLs of the results are aggregated together, and only URLs which were returned by more than 1 keyword topic are given back to the user.
The entire flow can be thought of as a series of Hadoop Map-Reduce jobs, to first download, extract and count keywords from (web pages corresponding to) URLs, and then to extract and count search result URLs from the keywords. I’ve been wanting to play with Cascading for a while, and this seemed like a good candidate, so the solution is implemented with Cascading.
Hmmm, but an “automated system” leaves the user to sort, create associations, etc., for themselves.
Assistants with such a “clipping service” could curate the clippings by creating associations with other materials and adding non-obvious but useful connections.
Think of the front page of the New York Times as an interface to curated content behind the stories that appear on it.
Where “home” is the article on the front page.
Not only more prose but a web of connections to material you might not even know existed.
For example, in Beijing Flaunts Cross-Border Clout in Search for Drug Lord by Jane Perlez and Bree Feng (NYT) we learn that:
Under Lao norms, law enforcement activity is not done after dark, (Liu Yuejin, leader of the antinarcotics bureau of the Ministry of Public Security)
Could be important information, depending upon your reasons for being in Laos.
Posted in Authoring Topic Maps, Cascading, Data Mining, News | No Comments »
Saturday, March 30th, 2013
Using R For Statistical Analysis – Two Useful Videos by Bruce Berriman.
Bruce has uncovered two interesting videos on using R:
Introduction to R – A Brief Tutorial for R (Software for Statistical Analysis), and,
An Introduction to R for Data Mining by Joseph Rickert. (Recording of the webinar by the same name.)
Bruce has additional links that will be useful with the videos.
Enjoy!
Posted in Data Mining, R, Statistics | No Comments »
Friday, March 29th, 2013
David Coallier has two presentations under that general title:
Distributed Schema-less Document-Based Databases
and,
Computational Statistics with Open Source Tools
Neither one of which is a “…death by powerpoint…” type presentation where the speaker reads text you can read for yourself.
Which is good, except that with minimal slides, you get an occasional example, names of software/techniques, but you have to fill in a lot of context.
A pointer to videos of either of these presentations would be greatly appreciated!
Posted in Data Mining, Software, Statistics | No Comments »
Tuesday, March 26th, 2013
Tensor Decompositions and Applications by Tamara G. Kolda and Brett W. Bader.
Abstract:
This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or N-way array. Decompositions of higher-order tensors (i.e., N-way arrays with N ≥ 3) have applications in psychometrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and elsewhere. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decomposition:CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank-one tensors, and the Tucker decomposition is a higher-order form of principal component analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox, Tensor Toolbox, and Multilinear Engine are examples of software packages for working with tensors.
At forty-five pages and two hundred and forty-five (245) references, this is a broad survey of tensor decompostion with numerous pointers to other survey and more specialized works.
I found this shortly after discovering the post I cover in: Tensors and Their Applications…
As I said in the earlier post, this has a lot of promise.
Although it isn’t yet clear to me how you would compare/contrast tensors with different dimensions and perhaps even a different number of dimensions.
Still, a lot of reading to do so perhaps I haven’t reached that point yet.
Posted in Data Mining, Tensors | No Comments »
Tuesday, March 26th, 2013
Massive online data stream mining with R
From the post:
A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.
…
The stream package is currently focussed on clustering algorithms available in MOA (http://moa.cms.waikato.ac.nz/details/stream-clustering/) and also eases interfacing with some clustering already available in R which are suited for data stream clustering. Classification algorithms based on MOA are on the todo list. Current available clustering algorithms are BIRCH, CluStream, ClusTree, DBSCAN, DenStream, Hierarchical, Kmeans and Threshold Nearest Neighbor.
What if data were always encountered as a stream?
Could request a “re-streaming” of data but best to do analysis in one streaming.
How would that impact your notion of subject identity?
How would you compensate for information learned later in the stream?
Posted in Data Mining, Data Streams, R | No Comments »
Saturday, March 23rd, 2013
Data Mining and Visualization: Bed Bug Edition by Brooke Borel.
A very good example of data mining and visualization making a compelling case for conventional wisdom being wrong!
What I wonder about and what isn’t shown by the graphics, is what relationships, if any, existed between the authors of papers on bed bugs?
Were there communities, so to speak, of bed bug authors who cited each other? But not authors from parallel bed bug communities?
Not to mention the usual semantic gaps between authors from different traditions.
It sounds like Brooke is going to make a compelling read about all things, bed bugs!
The power of data mining!
Posted in Data Mining, Graphics, Visualization | No Comments »
Wednesday, March 20th, 2013
Scenes from a Dive – what’s big data got to do with fighting poverty and fraud? by Prasanna Lal Das.
From the post:
A more detailed recap will follow soon but here’s a very quick hats off to the about 150 data scientists, civic hackers, visual analytics savants, poverty specialists, and fraud/anti-corruption experts that made the Big Data Exploration at Washington DC over the weekend such an eye-opener.We invite you to explore the work that the volunteers did (these are rough documents and will likely change as you read them so it’s okay to hold off if you would rather wait for a ‘final’ consolidated document). The projects that the volunteers worked on include:
Here are some visualizations that some project teams built. A few photos from the event are here (thanks @neilfantom). More coming soon (and yes, videos too!). Thanks @francisgagnon for the first blog about the event. The event hashtag was #data4good (follow @datakind and @WBopenfinances for more updates on Twitter).
Great meeting and projects but I would suggest a different sort of “big data”
Requiring recipients to grant reporting access to all bank accounts where funds will be transferred and requiring the same for any entity paid out of those accounts to the point where transfers over 90 days are less than $1,000 for any entity (or related entity), would be a better start.
With the exception of the “related entity” information, banks already keep transfer of funds information as a matter of routine business. It would be “big data” that is rich in potential for spotting fraud and waste.
The reporting banks should also be required to deliver other banking records they have on the accounts where funds are transferred and other activity in those accounts.
Before crying “invasion of privacy,” remember World Bank funding is voluntary.
As is acceptance of payment from World Bank funded projects. Anyone and everyone is free to decline such funding and avoid the proposed reporting requirements.
“Big data” to track fraud and waste is already collected by the banking industry.
The question is whether we will use that “big data” to effectively track fraud and waste or wait for particularly egregious cases to come to light?
Posted in BigData, Data Mining, Open Data, Public Data | No Comments »
Tuesday, March 19th, 2013
Knowledge Discovery from Mining Big Data – Presentation by Kirk Borne by Bruce Berriman.
From the post:
My friend and colleague Kirk Borne, of George Mason University, is a specialist in the modern field of data mining and astroinformatics. I was delighted to learn that he was giving a talk on an introduction to this topic as part of the Space Telescope Engineering and Technology Colloquia, and so I watched on the webcast. You can watch the presentation on-line, and you can download the slides from the same page. The presentation is a comprehensive introduction to data mining in astronomy, and I recommend it if you want to grasp the essentials of the field.
Kirk began by reminding us that responding to the data tsunami is a national priority in essentially all fields of science – a number of nationally commissioned working groups have been unanimous in reaching this conclusion and in emphasizing the need for scientific and educational programs in data mining. The slides give a list of publications in this area.
Deeply entertaining presentation on big data.
The first thirty minutes or so are good for “big data” quotes and hype but the real meat comes at about slide 22.
Extends the 3 V’s (Volume, Variety, Velocity) to include Veracity, Variability, Venue, Vocabulary, Value.
And outlines classes of discovery:
- Class Discovery
- Finding new classes of objects and behaviors
- Learning the rules that constrain the class boundaries
- Novelty Discovery
- Finding new, rare, one-in-a-million(billion)(trillion) objects and events
- Correlation Discovery
- Finding new patterns and dependencies, which reveal new natural laws or new scientific principles
- Association Discovery
- Finding unusual (improbable) co-occurring associations
A great presentation with references and other names you will want to follow on big data and astroinformatics.
Posted in Astroinformatics, BigData, Data Mining, Knowledge Discovery | No Comments »
Saturday, March 16th, 2013
Finding Shakespeare’s Favourite Words With Data Explorer by Chris Webb.
From the post:
The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…
In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.
If Data Explorer is a GUI on top of M (outdated but a point of origin), it goes up in importance.
From the M link:
The Microsoft code name “M” Modeling Language, hereinafter referred to as M, is a language for modeling domains using text. A domain is any collection of related concepts or objects. Modeling domain consists of selecting certain characteristics to include in the model and implicitly excluding others deemed irrelevant. Modeling using text has some advantages and disadvantages over modeling using other media such as diagrams or clay. A goal of the M language is to exploit these advantages and mitigate the disadvantages.
A key advantage of modeling in text is ease with which both computers and humans can store and process text. Text is often the most natural way to represent information for presentation and editing by people. However, the ability to extract that information for use by software has been an arcane art practiced only by the most advanced developers. The language feature of M enables information to be represented in a textual form that is tuned for both the problem domain and the target audience. The M language provides simple constructs for describing the shape of a textual language – that shape includes the input syntax as well as the structure and contents of the underlying information. To that end, M acts as both a schema language that can validate that textual input conforms to a given language as well as a transformation language that projects textual input into data structures that are amenable to further processing or storage.
I try to not run examples using Shakespeare. I get distracted by the elegance of the text, which isn’t the point of the exercise.
Posted in Data Explorer, Data Mining, Excel, Microsoft, Text Mining | No Comments »
Sunday, March 10th, 2013
SPMF: A Sequential Pattern Mining Framework
From the webpage:
SPMF is an open-source data mining mining platform written in Java.
It is distributed under the GPL v3 license.
It offers implementations of 52 data mining algorithms for:
- sequential pattern mining,
- association rule mining,
- frequent itemset mining,
- sequential rule mining,
- clustering
It can be used as a standalone program with a user interface or from the command line. Moreover, the source code of each algorithm can be integrated in other Java software.
The documentation consists entirely of examples of using SPMF for data mining tasks.
The algorithms page details the fifty-two (52) algorithms of SPMF by references to the literature.
I first saw this at: SPMF: Sequential Pattern Mining Framework.
Posted in Algorithms, Data Mining | No Comments »
Friday, March 8th, 2013
Crossfilter: Fast Multidimensional Filtering for Coordinated Views
From the webpage:
Crossfilter is a JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.
Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Crossfilter uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the performance of live histograms and top-K lists. For more details on how Crossfilter works, see the API reference.
See the webpage for an impressive demonstration with a 5.3 MB dataset.
Is there a trend towards “big data” manipulation on clusters and “less big data” in browsers?
Will be interesting to see how the benchmarks for “big” and “less big” move over time.
I first saw this in Nat Torkington’s Four Short links: 4 March 2013.
Posted in Data Mining, Dataset, Filters, Javascript, Top-k Query Processing | No Comments »