Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 12, 2012

DATA MINING: Accelerating Drug Discovery by Text Mining of Patents

Filed under: Contest,Data Mining,Drug Discovery,Patents,Text Mining — Patrick Durusau @ 1:34 pm

DATA MINING: Accelerating Drug Discovery by Text Mining of Patents

From the contest page:

Patent documents contain important research that is valuable to the industry, business, law, and policy-making communities. Take the patent documents from the United States Patent and Trademark Office (USPTO) as examples. The structured data include: filing date, application date, assignees, UPC (US Patent Classification) codes, IPC codes, and others, while the unstructured segments include: title, abstract, claims, and description of the invention. The description of the invention can be further segmented into field of the invention, background, summary, and detailed description.

Given a set of “Source” patents or documents, we can use text mining to identify patents that are “similar” and “relevant” for the purpose of discovery of drug variants. These relevant patents could further be clustered and visualized appropriately to reveal implicit, previously unknown, and potentially useful patterns.

The eventual goal is to obtain a focused and relevant subset of patents, relationships and patterns to accelerate discovery of variations or evolutions of the drugs represented by the “source” patents.

Timeline:

  • July 19, 2012 – Start of the Contest Part 1
  • August 23, 2012 – Deadline for Submission of Onotolgy delieverables 
  • August 24 to August 29, 2012 – Crowdsourced And Expert Evaluation for Part 1. NO SUBMISSIONS ACCEPTED for contest during this week.
  • Milestone 1: August 30, 2012 – Winner for Part 1 contest announced and Ontology release to the community for Contest Part 2
  • Aug. 31 to Sept. 21, 2012 – Contest Part 2 Begins – Data Exploration / Text Mining of Patent Data
  • Milestone 2: Sept. 21, 2012 – Deadline for Submission Contest Part 2. FULL CONTEST CLOSING.
  • Sept. 22 to Oct. 5, 2012 – Crowdsourced and Expert Evaluation for contest Part 2
  • Milestone 3: Oct. 5, 2012 – Conditional Winners Announcement 

Possibly fertile ground for demonstrating the value of topic maps.

Particularly if you think of topic maps as curating search strategies and results.

Think about that for a moment: curating search strategies and results.

We have all asked reference librarians or other power searchers for assistance and watched while they discovered resources we didn’t imagine existed.

What if for medical expert searchers, we curate the “search request” along with the “search strategy” and the “result” of that search?

Such that we can match future search requests up with likely search strategies?

What we are capturing is the experts understanding and recognition of subjects not apparent to the average user. Capturing it in such a way as to make use of it again in the future.

If you aren’t interested in medical research, how about: Accelerating Discovery of Trolls by Text Mining of Patents? šŸ˜‰

I first saw this at KDNuggets.


Update: 13 August 2012

Tweet by Lars Marius Garshol points to: Patent troll Intellectual Ventures is more like a HYDRA.

Even a low-end estimate ā€“ the patents actually recorded in the USPTO as being assigned to one of those shells ā€“ identifies around 10,000 patents held by the firm.

At the upper end of the researchersā€™ estimates, Intellectual Ventures would rank as the fifth-largest patent holder in the United States and among the top fifteen patent holders worldwide.

As sad as that sounds, remember this is one (1) troll. There are others.

August 10, 2012

[C]rowdsourcing … knowledge base construction

Filed under: Biomedical,Crowd Sourcing,Data Mining,Medical Informatics — Patrick Durusau @ 1:48 pm

Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications by Allison B McCoy, Adam Wright, Archana Laxmisan, Madelene J Ottosen, Jacob A McCoy, David Butten, and Dean F Sittig. (J Am Med Inform Assoc 2012; 19:713-718 doi:10.1136/amiajnl-2012-000852)

Abstract:

Objective We describe a novel, crowdsourcing method for generating a knowledge base of problemā€“medication pairs that takes advantage of manually asserted links between medications and problems.

Methods Through iterative review, we developed metrics to estimate the appropriateness of manually entered problemā€“medication links for inclusion in a knowledge base that can be used to infer previously unasserted links between problems and medications.

Results Clinicians manually linked 231ā€ˆ223 medications (55.30% of prescribed medications) to problems within the electronic health record, generating 41ā€ˆ203 distinct problemā€“medication pairs, although not all were accurate. We developed methods to evaluate the accuracy of the pairs, and after limiting the pairs to those meeting an estimated 95% appropriateness threshold, 11ā€ˆ166 pairs remained. The pairs in the knowledge base accounted for 183ā€ˆ127 total links asserted (76.47% of all links). Retrospective application of the knowledge base linked 68ā€ˆ316 medications not previously linked by a clinician to an indicated problem (36.53% of unlinked medications). Expert review of the combined knowledge base, including inferred and manually linked problemā€“medication pairs, found a sensitivity of 65.8% and a specificity of 97.9%.

Conclusion Crowdsourcing is an effective, inexpensive method for generating a knowledge base of problemā€“medication pairs that is automatically mapped to local terminologies, up-to-date, and reflective of local prescribing practices and trends.

I would not apply the term “crowdsourcing,” here, in part because the “crowd” is hardly unknown. Not a crowd at all, but an identifiable group of clinicians.

Doesn’t invalidate the results, which shows the utility of data mining for creating knowledge bases.

As a matter of usage, let’s not confuse anonymous “crowds,” with specific groups of people.

August 7, 2012

Labcoat by Precog

Filed under: BigData,Data Mining,Labcoat — Patrick Durusau @ 1:32 pm

Labcoat by Precog

From the webpage:

Labcoat is an interactive data analysis tool.

With Labcoat, developers and data scientists can integrate, analyze, and visualize massive volumes of semi-structured data.

It’s now incredibly easy to understand your data. Simply visit Labcoat to explore sample data sets and run some queries – all right in your browser. Unleash your inner data scientist!

I just encountered this so don’t have any details to report.

But, if they put as much effort into the technical side as they did the marketing images (Your App versus Your App on Precog is particularly funny), then this could prove to be quite useful.

It is in private beta so not much to report. Will update when it comes out of beta or I can point to more resources.

I first saw this at KDNuggets.

August 5, 2012

The R-Podcast Episode 9: Adventures in Data Munging Part 1

Filed under: Data Mining,R — Patrick Durusau @ 4:11 pm

The R-Podcast Episode 9: Adventures in Data Munging Part 1

From the post:

Itā€™s great to be back with a new episode after an eventful break! This episode begins a series on my adventures in data munging, a.k.a data processing. I discuss three issues that demonstrate the flexibility and versatility R brings for recoding messy values, important inconsistent data files, and pinpointing problematic observations and variables. We also have an extended listener feedback segment with an audio installment of the ā€œpitfallsā€ of R contributed by listener Frans. I hope you enjoy this episode and keep passing along your feedback to theRcast(at)gmail.com and stop by the forums as well!

What do you think about the format?

Other than for Atlanta area commuters, it seems a bit over long to me.

And for some topics, such as teaching syntax, best to be able to “see” the examples.

July 31, 2012

WDM 2012 : Special Session on Web Data Matching

Filed under: Conferences,Data Mining — Patrick Durusau @ 2:58 pm

Call for Papers of the Special Session on Web Data Matching – WDM 2012

When Nov 21, 2012 – Nov 23, 2012
Where SĆ£o Carlos, Brazil
Submission Deadline Aug 15, 2012
Notification Due Sep 15, 2012
Final Version Due Sep 30, 2012

From the call for papers:

Under the framework of the 8th International Conference on Next Generation Web Services Practices (NWeSP 2012), 21-23 November 2012 in SĆ£o Carlos, Brazil

Objectives

In recent years, research in area of web mining and web searching grows rapidly, mainly thanks to growing complexity of digital data and the huge quantity of new data available every day. A Web user wishing to find information on a particular subject must usually guess the keywords under which that information might be classified by a standard search engine. There are also new approaches such as the various methods of the classification of web data based an analysis of unstructured and structured web data and use of human and social factors. WDM workshop focuses mainly (but not only) on methods of analysis of web data leading to their classification and use to improve user orientation at Web.

Specific topics of interest

To address the aforementioned aspects of evolution of social networks, the preferred topics for this special session are (but not limited to):

  • Web pattern recognition and matching
  • Web information extraction
  • Web content mining
  • Web genre detection
  • Deep web analysis
  • Relevance and ranking of web data
  • Web search systems and applications
  • Mapping structured and unstructured web data

I realize it is fashionable to sprinkle “web” or “web scale” in papers and calls for papers but is the object of our study really any different?

Does it matter for authorship, genre, entity extraction, data mining, whether the complete texts of Shakespeare are on your local hard drive or some website?

Or to put it another way, should the default starting point be to consider all the data on the Web?

How would you create a lens or filter to enable a user to start with “relevant” resources for a query?

July 27, 2012

Probabilistic Data Structures for Web Analytics and Data Mining

Filed under: Data Mining,Probabilistic Data Structures,Web Analytics — Patrick Durusau @ 7:34 pm

Probabilistic Data Structures for Web Analytics and Data Mining by Ilya Katsov.

Speaking of scalability, consider:

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact ā€“ sometimes astonishingly compact ā€“ replacement of raw data in stream-based computing.

For some subjects, we have probabilistic identifications, based upon data that is too voluminous or rapid to allow for a “definitive” identification.

The techniques introduced here will give you a grounding in data structures to deal with those situations. Interesting reading.

I saw this in Christophe Lalanneā€™s Bag of Tweets for July 2012.

July 23, 2012

XLConnect 0.2-0

Filed under: Data Mining,Excel,R — Patrick Durusau @ 5:59 pm

XLConnect 0.2-0

From the post:

Mirai Solutions GmbH (http://www.mirai-solutions.com) is very pleased to announce the release of XLConnect 0.2-0, which can be found at CRAN.

As one of the updates, XLConnect has moved to the newest release of Apache POI: 3.8. Also, the lazy evaluation issues with S4 generics are now fixed: generic methods now fully expand the argument list in order to have the arguments immediately evaluated.

Furthermore, we have added an XLConnect.R script ļ¬le to the top level library directory, which contains all code examples presented in the vignette, so that itā€™s easier to reuse the code.

From an earlier description of XLConnect:

XLConnect is a comprehensive and platform-independent R package for manipulating Microsoft Excel files from within R. XLConnect differs from other related R packages in that it is completely cross-platform and as such runs under Windows, Unix/Linux and Mac (32- and 64-bit). Moreover, it does not require any installation of Microsoft Excel or any other special drivers to be able to read & write Excel files. The only requirement is a recent version of a Java Runtime Environment (JRE). Also, XLConnect can deal with the old *.xls (BIFF) and the new *.xlsx (Office Open XML) file formats. Under the hood, XLConnect uses Apache POI (http://poi.apache.org) ā€“ a Java API to manipulate Microsoft Office documents. (From XLConnect ā€“ A platform-independent interface to Excel

If you work with data in a business environment, you are going to encounter Excel files. (Assuming you are not in a barter economy counting animal skins and dried fish.)

And customers are going to want you to return Excel files to them. (Yes, yes, topic maps would be a much better delivery format. But if the choice is Excel files, you get paid, topic maps files, you don’t get paid, which one would you do? That’s what I thought.)

A package to consider if you need to manipulate Excel files from within R.

July 19, 2012

Analyzing 20,000 Comments

Filed under: Analytics,Data Mining — Patrick Durusau @ 7:34 am

Analyzing 20,000 Comments

First, congratulations on Chandoo.org reaching its 20,000th comment!

Second, the post does not release the data (email addresses, etc.) so it also doesn’t include the code.

Thinking of this as an exercise in analytics, which of the measures applied should lead to changes in behavior?

After all, we don’t mine data simply because we can.

What goals would you suggest and how would we measure meeting them based on the analysis described here?

Biological Dark Matter [Intelllectual Dark Matter?]

Filed under: Bioinformatics,Data Mining — Patrick Durusau @ 6:05 am

Biological Dark Matter

Nathan Wolfe answers a child’s question of “what is left to explore?” with an exposition on how little we know about the most abundant life form of all, the virus.

Opportunities abound for data mining and mapping the results of data mining on viruses.

Protection against the next pandemic is vitally important but I would have answered differently.

In addition to viruses, advances have been made in data structures, graph algorithms, materials science, digital chip design, programming languages, astronomy, just to name a few areas where substantial progress has been made and more is anticipated.

Those just happen to be areas of interest to me. I am sure you could create even longer lists of areas of interest to you where substantial progress has been made.

We need to convey a sense of excitement and discovery in all areas of the sciences and humanities.

Perhaps we should call it: Intellectual Dark Matter? (another name for the unknown?)

World Leaders Comment on Attack in Bulgaria

Filed under: Data Mining,Intelligence,Social Media — Patrick Durusau @ 4:53 am

World Leaders Comment on Attack in Bulgaria

From the post:

Following the terror attack in Bulgaria killing a number of Israeli tourists on an airport bus, we can see the statements from world leaders around the globe including Israel Prime Minister Benjamin Netanyahu openly pinning the blame on Iran and threatening retaliation

If you haven’t seen one of the visualizations by Recorded Future you will be impressed by this one. Mousing over people and locations invokes what we would call scoping in a topic map context and limits the number of connections you see. And each node can lead to additional information.

While this works like a topic map, I can’t say it is a topic map application because how it works isn’t disclosed. You can read How Recorded Future Works, but you won’t be any better informed than before you read it.

Impressive work but it isn’t clear how I would integrate their matching of sources to say an internal mapping of sources? Or how I would augment their mapping with additional mappings by internal subject experts?

Or how I would map this incident to prior incidents which lead to disproportionate responses?

Or map “terrorist” attacks by the world leaders now decrying other “terrorist” attacks?

That last mapping could be an interesting one for the application of the term “terrorist.” My anecdotal experience is that it depends on the sponsor.

Would be interesting to know if systematic analysis supports that observation.

Perhaps the news media could then evenly identify the probable sponsors of “terrorists” attacks.

July 18, 2012

Data Mining Projects (Ping Chen)

Filed under: Data,Data Mining — Patrick Durusau @ 6:59 pm

Data Mining Projects

From the webpage:

This is the website for the Data Mining CS 4319 class projects. Here you will find all of the information and data files you will need to begin working on the project you have selected for this semester. Please click on the link on the left hand side corresponding to your project to begin. Development of the projects hosted in this website is funded by NSF Award DUE 0737408.

Projects with resources and files are:

  • Netflix
  • Word Relevance Measures
  • Identify Time
  • Orbital Debris Analysis
  • Oil Exploration
  • Environmental Data Analysis
  • Association Rule Pre-Processing
  • Neural Network-Based Financial Market Forecasting
  • Identify Locations From a Webpage
  • Co-reference Resolution
  • Email Visualization

Now there is a broad selection of data mining projects!

BTW, be careful of the general Netflix file. It is 665 MB so don’t attempt it on airport WiFi.

I first saw this at KDNuggets.

PS: I can’t swear to the dates of the class but the grant ran from 2008 to 2010.

July 16, 2012

Data Mining In Excel: Lecture Notes and Cases (2005)

Filed under: Data Mining,Excel,Microsoft — Patrick Durusau @ 3:03 pm

Data Mining In Excel: Lecture Notes and Cases (2005) by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce.

From the introduction:

This book arose out of a data mining course at MITā€™s Sloan School of Management. Preparation for the course revealed that there are a number of excellent books on the business context of data mining, but their coverage of the statistical and machine-learning algorithms that underlie data mining is not sufficiently detailed to provide a practical guide if the instructorā€™s goal is to equip students with the skills and tools to implement those algorithms. On the other hand, there are also a number of more technical books about data mining algorithms, but these are aimed at the statistical researcher, or more advanced graduate student, and do not provide the case-oriented business focus that is successful in teaching business students.

Hence, this book is intended for the business student (and practitioner) of data mining techniques, and its goal is threefold:

  1. To provide both a theoretical and practical understanding of the key methods of classification, prediction, reduction and exploration that are at the heart of data mining;
  2. To provide a business decision-making context for these methods;
  3. Using real business cases, to illustrate the application and interpretation of these methods.

An important feature of this book is the use of Excel, an environment familiar to business analysts. All required data mining algorithms (plus illustrative datasets) are provided in an Excel add-in, XLMiner. XLMiner offers a variety of data mining tools: neural nets, classification and regression trees, k-nearest neighbor classification, naive Bayes, logistic regression, multiple linear regression, and discriminant analysis, all for predictive modeling. It provides for automatic partitioning of data into training, validation and test samples, and for the deployment of the model to new data. It also offers association rules, principal components analysis, k-means clustering and hierarchical clustering, as well as visualization tools, and data handling utilities. With its short learning curve, affordable price, and reliance on the familiar Excel platform, it is an ideal companion to a book on data mining for the business student.

Some what dated but remember there are lots of older copies of MS Office around. Not an inconsiderable market if you start to write something on using Excel to produce topic maps. Write for the latest version but I would have a version keyed to earlier versions of Excel as well.

I first saw this at KDNuggets.

UH Data Mining Hypertextbook

Filed under: Data Mining — Patrick Durusau @ 2:24 pm

UH Data Mining Hypertextbook by Professor Rakesh Verma and his students at U. of Houston.

From the contents:

Chapter 1. Decision Trees

This chapter provides an introduction to one of the major fields of data mining called classification. It also outlines some of the real world applications of classification tools and introduces the decision tree classifier that is widely used. What is a classifier? What is a decision tree? How does one construct a decision tree? These are just some of the questions answered in this chapter. Currently the novice (green) and intermediate (blue) tracks are active. More content will be added to this chapter over time.

Chapter 2. Association Analysis

In this chapter we explore another major field of data mining called association rules. Association Analysis focuses on discovering association rules which are interesting and useful hidden relationships that can be found in large data sets. This chapter is divided into various sections that explain the key concepts in Association Analysis and introduce you, the reader, to the basic algorithms used in generating Association Rules. Currently the novice (green) and intermediate (blue) tracks are active. More content will be added to this chapter over time.

Chapter 3. Visualization

In this chapter we take a step back from data mining algorithms and techniques and focus on the visualization of data. This step is crucial and normally takes place before any data mining algorithms have been applied or pre-processing techniques, it is useful because it helps us in some situations pinpoint which algorithms should be used in future analysis. This chapter is divided into three main sections, the first section introduces you, the reader to visualization, the second defines general concepts that are pertinent and the third section explores a couple of visualization techniques. This capter also includes a brief introduction to OLAP. More content will be added to this chapter over time.

Chapter 4. Cluster Analysis

In this chapter we pick up from where classification left off and delve a little bit deeper into the world of grouping data objects. Cluster analysis aims to group data objects based on the information that is available that describes the objects and their relationships. This chapter is first introduces the concept of cluster analysis and its applications in the real world and then it explores some of the popular clustering techniques such as the k-means clustering algorithm and agglomerative hierarchical clustering. More content will be added to this chapter over time.

Appendix 1. Includes direct links to Java Applets, online links to additional resources and a list of references

As citations are made to literature, the corresponding references are kept in this appendix. Links to this appendix accompany the citations.

Note that navigation is by drop-down menus at the top of pages, for the book and chapters. Pages have “next” links at the bottom. Not a problem, just something to get used to.

First saw this at KDNuggets:

July 14, 2012

Journal of Data Mining in Genomics and Proteomics

Filed under: Bioinformatics,Biomedical,Data Mining,Genome,Medical Informatics,Proteomics — Patrick Durusau @ 12:20 pm

Journal of Data Mining in Genomics and Proteomics

From the Aims and Scope page:

Journal of Data Mining in Genomics & Proteomics (JDMGP), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Genomics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes.

In today’s wired world information is available at the click of the button, curtsey the Internet. JDMGP-Open Access gives a worldwide audience larger than that of any subscription-based journal in OMICS field, no matter how prestigious or popular, and probably increases the visibility and impact of published work. JDMGP-Open Access gives barrier-free access to the literature for research. It increases convenience, reach, and retrieval power. Free online literature is available for software that facilitates full-text searching, indexing, mining, summarizing, translating, querying, linking, recommending, alerting, “mash-ups” and other forms of processing and analysis. JDMGP-Open Access puts rich and poor on an equal footing for these key resources and eliminates the need for permissions to reproduce and distribute content.

A publication (among many) from the OMICS Publishing Group, which sponsors a large number of online publications.

Has the potential to be an interesting source of information. Not much in the way of back files but then it is a very young journal.

July 10, 2012

Data Mining In Excel: Lecture Notes and Cases

Filed under: Data Mining,Excel — Patrick Durusau @ 7:51 am

Data Mining In Excel: Lecture Notes and Cases by Yanchang Zhao.

Table of contents (270 page book)

  • Overview of the Data Mining Process
  • Data Exploration and Dimension Reduction
  • Evaluating Classification and Predictive Performance
  • Multiple Linear Regression
  • Three Simple Classification Methods
  • Classification and Regression Trees
  • Logistic Regression
  • Neural Nets
  • Discriminant Analysis
  • Association Rules
  • Cluster Analysis

You knew that someday all those Excel files would be useful! šŸ˜‰ Well, today may be the day!

A bit dated, 2005, but should be a good starting place.

If you are interested in learning data mining in Excel cold, try comparing the then capacities of Excel to the current version of Excel and updating the text/examples.

Best way to learn it is to update and then teach it to others.

July 9, 2012

Using Palantir to Explore Prescription Drug Safety

Filed under: Data Mining,Palantir — Patrick Durusau @ 2:58 pm

Using Palantir to Explore Prescription Drug Safety

From the post:

Drug safety is a serious concern in the United States with adverse drug events contributing to over 770,000 injuries and deaths per year. Cost estimates range from $1.5 to $5.6 billion annually. The FDA closely monitors these adverse events and releases communications and advisories depending on the severity and frequency of the events. The FDA released such a communication regarding the drug Simvastatin in June 2011. Simvastatin, which is used to treat hyperlidemia, is one of the most heavily prescribed medications in the world, and nearly 100 million prescriptions were written for patients in 2010.

A canned demo but impressive none the less.

I have written asking for a link to the “community” version of the software. It is mentioned several times on the site but I have been unable to find the URL.

July 8, 2012

R Integration in Weka

Filed under: Data Mining,Machine Learning,R,Weka — Patrick Durusau @ 9:57 am

R Integration in Weka by Mark Hall.

From the post:

These days it seems like every man and his proverbial dog is integrating the open-source R statistical language with his/her analytic tool. R users have long had access to Weka via the RWeka package, which allows R scripts to call out to Weka schemes and get the results back into R. Not to be left out in the cold, Weka now has a brand new package that brings the power of R into the Weka framework.

Weka

In this section I briefly cover what the new RPlugin package for Weka >= 3.7.6 offers. This package can be installed via Weka’s built-in package manager.

Here is an list of the functionality implemented:

  • Execution of arbitrary R scripts in Weka’s Knowledge Flow engine
  • Datasets into and out of the R environment
  • Textual results out of the R environment
  • Graphics out of R in png format for viewing inside of Weka and saving to files via the JavaGD graphics device for R
  • A perspective for the Knowledge Flow and a plugin tab for the Explorer that provides visualization of R graphics and an interactive R console
  • A wrapper classifier that invokes learning and prediction of R machine learning schemes via the MLR (Machine Learning in R) library

The use of R appears to be spreading! (Oracle, SAP, Hadoop, just to name a few that come readily to mind.)

Where is it on your list of data mining tools?

I first saw this at DZone.

July 5, 2012

Olympic medal winners: every one since 1896 as open data

Filed under: Data,Data Mining,Data Source — Patrick Durusau @ 5:21 am

Olympic medal winners: every one since 1896 as open data

The Guardian Datablog has posted Olympic medal winner data for download.

Admitting to some preference I was pleased to see that OpenDocument Format was one of the download choices. šŸ˜‰

It may just be my ignorance of Olympic events but it seems odd for the gender of competitors to be listed along with the gender of the event?

A brief history of Olympic Sports (from Wikipedia). Military patrol was a demonstration sport in 1928, 1936 and 1948. Is that likely to make a return in 2016? Or would terrorist spotting be more appropriate?

July 3, 2012

Awesome website for #rstats Mining Twitter using R

Filed under: Data Mining,Graphics,R,Tweets,Visualization — Patrick Durusau @ 7:33 pm

Awesome website for #rstats Mining Twitter using R by Ajay Ohri

From the post:

Just came across this very awesome website.

Did you know there were six kinds of wordclouds in R.

(giggles like a little boy)

https://sites.google.com/site/miningtwitter/questions/talking-about

No, I can honestly say I was unaware “…there were six kinds of wordclouds in R.” šŸ˜‰

Still, it might be a useful think to know at some point in the future.

The Science Network of Medical Data Mining

Filed under: Biomedical,Data Mining — Patrick Durusau @ 4:14 pm

The Science Network of Medical Data Mining

From the description of Unit 1:

Bar-Ilan University & The Chaim Sheba Medical Center – The Biomedical Informatics Program – The Science Network of Medical Data Mining

Course 80-665 – Medical Data Mining Spring, 2012

Lecturer: Dr. Ronen Tal-Botzer

Lectures as of today:

  • Unit 01 – Introduction & Scientific Background
  • Unit 02 – From Data to Information to Knowledge
  • Unit 03 – From Knowledge to Wisdom to Decision
  • Unit 04 – The Electronic Medical Record
  • Unit 05 – Artificial Intelligence in Medicine – Part A
  • Unit 06 – Science Network A: System Requirement Description

An enthusiastic lecturer which counts for a lot!

The presentation of medical information as intertwined with data mining sounds like a sound approach to me. Assuming students are grounded in medical information (or some other field), adding data mining is an extension of the familiar.

June 29, 2012

Detecting Emergent Conflicts with Recorded Future + Ushahidi

Filed under: Data Mining,Intelligence — Patrick Durusau @ 3:16 pm

Detecting Emergent Conflicts with Recorded Future + Ushahidi by Ninja Shoes. (?)

From the post:

An ocean of data is available on the web. From this ocean of data, information can in theory be extracted and used by analysts for detecting emergent trends (trend spotting). However, to do this manually is a daunting and nearly impossible task. We in this study we describe a semi-automatic system in which data is automatically collected from selected sources, and to which linguistic analysis is applied to extract e.g., entities and events. After combining the extracted information with human intelligence reports, the results are visualized to the user of the system who can interact with it in order to obtain a better awareness of historic as well as emergent trends. A prototype of the proposed system has been implemented and some initial results are presented in the paper.

The paper in question.

A fairly remarkable bit of work that illustrates the current capabilities for mining the web and also its limitations.

The processing of news feeds for protest reports is interesting, but mistakes the result of years of activity as an “emergent” conflict.

If you were going to capture the data that would enable a human analyst to “predict” the Arab Spring, you would have to begin in union organizing activities. Not the sort of thing that is going to make news reports on the WWW.

For that you would need traditional human intelligence. From people who don’t spend their days debating traffic or reports with other non-native staffers. Or meeting with managers from Washington or Stockholm.

Or let me put it this way:

Mining the web doesn’t equal useful results. Just as mining for gold doesn’t mean you will find any.

June 28, 2012

R and Data Mining (RDataMining.com)

Filed under: Data Mining,R — Patrick Durusau @ 6:32 pm

R and Data Mining (RDataMining.com)

I have mentioned several resources from this site:

R Reference Card for Data Mining [Annotated TOC?]

An Example of Social Network Analysis with R using Package igraph

Book ā€œR and Data Mining: Examples and Case Studiesā€ on CRAN [blank chapters]

Online resources for handling big data and parallel computing in R

There are others I have yet to cover and new ones will be appearing. If you are using R for data mining, a good site to re-visit on a regular basis.

R Reference Card for Data Mining [Annotated TOC?]

Filed under: Data Mining,R — Patrick Durusau @ 6:31 pm

R Reference Card for Data Mining

A good reference to have at hand.

For teaching/learning purposes, use this listing as an annotated table of contents and create an entry for each item demonstrating its use.

Will be a broader and deeper survey of R data mining techniques than you are likely to encounter otherwise.

First seen in Christophe Lalanne’s A bag of tweets / June 2012.

June 17, 2012

Data Mining with Microsoft SQL Server 2008 [Book Review]

Filed under: Data Mining,Microsoft,SQL Server — Patrick Durusau @ 3:10 pm

Data Mining with Microsoft SQL Server 2008

Sandro Saitta writes:

If you are using Microsoft data mining tools, this book is a must have. Written by MacLennan, Tang and Crivat, it describes how to perform data mining using SQL Server 2008. The book is huge ā€“ more than 630 pages ā€“ but it is normal since authors give detailed explanation for each data mining function. The book covers topics such as general data mining concepts, DMX, Excel add-ins, OLAP cubes, data mining architecture and many more. The seven data mining algorithms included in the tool are described in separate chapters.

The book is well written, so it can be read from A to Z or by selecting specific chapters. Each theoretical concept is explained through examples. Using screenshots, each step of a given method is presented in details. It is thus more a user manual than a book explaining data mining concepts. Donā€™t expect to read any detailed algorithms or equations. A good surprise of the book are the case studies. They are present in most chapters and show real examples and how to solve them. It really shows the experience of the authors in the field.

I haven’t seen the book, yet, but that can be corrected. šŸ˜‰

June 15, 2012

Mozilla Ignite [Challenge – $15,000]

Filed under: Challenges,Data Integration,Data Mining,Filters,Topic Maps — Patrick Durusau @ 8:21 am

Mozilla Ignite

From the webpage:

Calling all developers, network engineers and community catalysts. Mozilla and the National Science Foundation (NSF) invite designers, developers and everyday people to brainstorm and build applications for the faster, smarter Internet of the future. The goal: create apps that take advantage of next-generation networks up to 250 times faster than today, in areas that benefit the public — like education, healthcare, transportation, manufacturing, public safety and clean energy.

Designing for the internet of the future

The challenge begins with a “Brainstorming Round” where anyone can submit and discuss ideas. The best ideas will receive funding and support to become a reality. Later rounds will focus specifically on application design and development. All are welcome to participate in the brainstorming round.

ļ»æļ»æļ»æļ»æBRAINSTORM

What would you do with 1 Gbps? What apps would you create for deeply programmable networks 250x faster than today? Now through August 23rd, let’s brainstorm. $15,000 in prizes.

The challenge is focused specifically on creating public benefit in the U.S. The deadline for idea submissions is August 23, 2012.

Here is the entry website.

I assume the 1Gbps is actual and not as measured by the marketing department of the local cable company. šŸ˜‰

That would have to be from a source that can push 1 Gbps to you and you be capable of handling it. (Upstream limitations being what chokes my local speed down.)

I went looking for an example of what that would mean and came up with: “…[you] can download 23 episodes of 30 Rock in less than two minutes.

On the whole, I would rather not.

What other uses would you suggest for 1Gbps network speeds?

Assuming you have the capacity to push back at the same speed, I wonder what that means in terms of querying/viewing data as a topic map?

Transformation to a topic map for only for a subset of data?

Looking forward to seeing your entries!

June 12, 2012

Data distillation with Hadoop and R

Filed under: Data Mining,Data Reduction,Hadoop,R — Patrick Durusau @ 1:55 pm

Data distillation with Hadoop and R by David Smith.

From the post:

We’re definitely in the age of Big Data: today, there are many more sources of data readily available to us to analyze than there were even a couple of years ago. But what about extracting useful information from novel data streams that are often noisy and minutely transactional … aye, there’s the rub.

One of the great things about Hadoop is that it offers a reliable, inexpensive and relatively simple framework for capturing and storing data streams that just a few years ago we would have let slip though our grasp. It doesn’t matter what format the data comes in: without having to worry about schemas or tables, you can just dump unformatted text (chat logs, tweets, email), device “exhaust” (binary, text or XML packets), flat data files, network traffic packets … all can be stored in HDFS pretty easily. The tricky bit is making sense of all this unstructured data: the downside to not having a schema is that you can’t simply make an SQL-style query to extract a ready-to-analyze table. That’s where Map-Reduce comes in.

Think of unstructured data in Hadoop as being a bit like crude oil: it’s a valuable raw material, but before you can extract useful gasoline from Brent Sweet Light Crude or Dubai Sour Crude you have to put it through a distillation process in a refinery to remove impurities, and extract the useful hydrocarbons.

I may find this a useful metaphor because I grew up in Louisiana where land based oil wells were abundant and there was an oil reflinery only a couple of miles from my home.

Not a metaphor that will work for everyone but one you should keep in mind.

June 11, 2012

Announcing Revolution R Enterprise 6.0

Filed under: Data Mining,R — Patrick Durusau @ 4:22 pm

Announcing Revolution R Enterprise 6.0

Just in case you missed the announcement:

Revolution Analytics is proud to announce the latest update to our enhanced, production-grade distribution of R, Revolution R Enterprise. This update expands the range of supported computation platforms, adds new Big Data predictive models, and updates to the latest stable release of open source R (2.14.2), which improves performance of the R interpreter by about 30%.

This release expands the range of big-data statistical analysis with support for Generalized Linear Models (GLM). Logistic (Binomial) Poisson, Gamma and Tweedie models are all supported with a high-performance C++ implementation, and you can also model any distribution in the GLM family with a custom link function written in R. Big Data GLM has been a common request from many of our customers, and beta testers have been blown away by the speed of the implementation. For example here's an example of a Tweedie regression on 8.5 million insurance claims in less than 2 and a half minutes (skip ahead to 1:10 for the demo):

 

I included the video because it is about as impressive as demos get.

Details about Revolution R Enterprise 6.0 follow in the post.

June 7, 2012

Principles of Data Mining

Filed under: Data Mining — Patrick Durusau @ 2:19 pm

Principles of Data Mining by David J. Hand , Heikki Mannila and Padhraic Smyth.

Description:

The growing interest in data mining is motivated by a common problem across disciplines: how does one store, access, model, and ultimately describe and understand very large data sets? Historically, different aspects of data mining have been addressed independently by different disciplines. This is the first truly interdisciplinary text on data mining, blending the contributions of information science, computer science, and statistics.

The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data mining algorithms and their application. The presentation emphasizes intuition rather than rigor. The second section, data mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The algorithms covered include trees and rules for classification and regression, association rules, belief networks, classical statistical models, nonlinear models such as neural networks, and local “memory-based” models. The third section shows how all of the preceding analysis fits together when applied to real-world data mining problems. Topics include the role of metadata, how to handle missing data, and data preprocessing.

Another high quality resource if you are learning data mining in a classroom or just adding to your skill set.

The wealth of data, resources such as this book, and free tools has made ignorance of data modeling a “shame on me” proposition.

PDF slides and R code examples on Data Mining and Exploration

Filed under: Data Mining,R — Patrick Durusau @ 2:18 pm

PDF slides and R code examples on Data Mining and Exploration by Yanchang Zhao.

A sampling:

Overview of Data Mining http://www.inf.ed.ac.uk/teaching/courses/dme/2012/slides/datamining_intro4up.pdf

Visualizing Data http://www.inf.ed.ac.uk/teaching/courses/dme/2012/slides/visualisation4up.pdf

Decision trees http://www.inf.ed.ac.uk/teaching/courses/dme/2012/slides/classification4up.pdf

More await your review!

June 5, 2012

Dominic Widdows

Filed under: Data Mining,Natural Language Processing,Researchers,Visualization — Patrick Durusau @ 7:57 pm

While tracking references, I ran across the homepage of Dominic Widdows at Google.

Actually I found the Papers and Publications page for Dominic Widdows and then found his homepage. šŸ˜‰

There is much to be read here.

DBLP page for Dominic Widdows.

« Newer PostsOlder Posts »

Powered by WordPress