Data Mining « Another Word For It

December 14, 2011

Stop Mining Data!

Filed under: Data Mining — Patrick Durusau @ 7:46 pm

The title caught my attention, particularly given that Matthew Hurst was saying it!

From the post:

In some recent planning and architectural discussions I’ve become aware of the significant difference between reasoning about data and reasoning about the world that the data represents.

Before reading his post, care to guess what entity is “…reasoning about the world that the data represents.”?

Go on! Take a chance! 😉

I don’t know if his use of “record linkage” was in the technical sense or not. Will have to ask.

Comments Off

December 11, 2011

The Coron System

Filed under: Associations,Data Mining,Software — Patrick Durusau @ 9:23 pm

The Coron System

From the overview:

Coron is a domain and platform independent, multi-purposed data mining toolkit, which incorporates not only a rich collection of data mining algorithms, but also allows a number of auxiliary operations. To the best of our knowledge, a data mining toolkit designed specifically for itemset extraction and association rule generation like Coron does not exist elsewhere. Coron also provides support for preparing and filtering data, and for interpreting the extracted units of knowledge.

In our case, the extracted knowledge units are mainly association rules. At the present time, finding association rules is one of the most important tasks in data mining. Association rules allow one to reveal “hidden” relationships in a dataset. Finding association rules requires first the extraction of frequent itemsets.

Currently, there exist several freely available data mining algorithms and tools. For instance, the goal of the FIMI workshops is to develop more and more efficient algorithms in three categories: (1) frequent itemsets (FI) extraction, (2) frequent closed itemsets (FCI) extraction, and (3) maximal frequent itemsets (MFI) extraction. However, they tend to overlook one thing: the motivation to look for these itemsets. After having found them, what can be done with them? Extracting FIs, FCIs, or MFIs only is not enough to generate really useful association rules. The FIMI algorithms may be very efficient, but they are not always suitable for our needs. Furthermore, these algorithms are independent, i.e. they are not grouped together in a unified software platform. We also did experiments with other toolkits, like Weka. Weka covers a wide range of machine learning tasks, but it is not really suitable for finding association rules. The reason is that it provides only one algorithm for this task, the Apriori algorithm. Apriori finds FIs only, and is not efficient for large, dense datasets.

Because of all these reasons, we decided to group the most important algorithms into a software toolkit that is aimed at data mining. We also decided to build a methodology and a platform that implements this methodology in its entirety. Another advantage of the platform is that it includes the auxiliary operations that are often missing in the implementations of single algorithms, like filtering and pre-processing the dataset, or post-processing the found association rules. Of course, the usage of the methodology and the platform is not narrowed to one kind of dataset only, i.e. they can be generalized to arbitrary datasets.

I found this too late in the weekend to do more than report it.

I have spent most of the weekend trying to avoid expanding a file to approximately 2 TB before parsing it. More on that saga later this week.

Anyway, Coron looks/sounds quite interesting.

Anyone using it that cares to comment on it?

Comments Off

December 7, 2011

Rattle: A Graphical User Interface for Data Mining using R

Filed under: Data Mining,R,Rattle — Patrick Durusau @ 8:18 pm

Rattle: A Graphical User Interface for Data Mining using R

From the webpage:

Rattle (the R Analytical Tool To Learn Easily) presents statistical and visual summaries of data, transforms data into forms that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets.

I think I found Data Mining: Desktop Survival Guide before I located Rattle. Either way, both look like resources you will find useful.

Comments Off

DIM 2012 : IEEE International Workshop on Data Integration and Mining

Filed under: Conferences,Data Integration,Data Mining — Patrick Durusau @ 8:12 pm

DIM 2012 : IEEE International Workshop on Data Integration and Mining

Important Dates:

When Aug 8, 2012 – Aug 10, 2012
Where Las Vegas, Nevada, USA
Submission Deadline Mar 31, 2012
Notification Due Apr 30, 2012
Final Version Due May 14, 2012

From the website:

Given the emerging global Information-centric IT landscape that has tremendous social and economic implications, effectively processing and integrating humungous volumes of information from diverse sources to enable effective decision making and knowledge generation have become one of the most significant challenges of current times. Information Reuse and Integration (IRI) seeks to maximize the reuse of information by creating simple, rich, and reusable knowledge representations and consequently explores strategies for integrating this knowledge into systems and applications. IRI plays a pivotal role in the capture, representation, maintenance, integration, validation, and extrapolation of information; and applies both information and knowledge for enhancing decision-making in various application domains.

This conference explores three major tracks: information reuse, information integration, and reusable systems. Information reuse explores theory and practice of optimizing representation; information integration focuses on innovative strategies and algorithms for applying integration approaches in novel domains; and reusable systems focus on developing and deploying models and corresponding processes that enable Information Reuse and Integration to play a pivotal role in enhancing decision-making processes in various application domains.

The IEEE IRI conference serves as a forum for researchers and practitioners from academia, industry, and government to present, discuss, and exchange ideas that address real-world problems with real-world solutions. Theoretical and applied papers are both included. The conference program will include special sessions, open forum workshops, panels and keynote speeches.

Note the emphasis on integration. In topic maps we would call that merging.

I think that bodes well for the future of topic maps. Provided that we “steal the march” so to speak.

We have spent years, decades for some of us thinking about data integration issues. Let’s not hide our bright lights under a basket.

Comments Off

November 28, 2011

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Filed under: Data Mining,Dataset,Extraction — Patrick Durusau @ 7:05 pm

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Ryan Rosario.

From the post:

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

article content and template pages

article content with revision history (huge files)

article content including user pages and talk pages

redirect graph

page-to-page link lists: redirects, categories, image links, page links, interwiki etc.

image metadata

site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

All of that is available but also lacking any consistent usage of syntax. Ryan stumbles upon Wikipedia Extractor, which has pluses and minuses, an example of that latter being really slow. Things look up for Ryan when he is reminded about Cloud9, which is designed for a MapReduce environment.

Read the post to see how things turned out for Ryan using Cloud9.

Depending on your needs, Wikipedia URLs are a start on subject identifiers, although you will probably need to create some for your particular domain.

Comments Off

November 27, 2011

Concord: A Tool That Automates the Construction of Record Linkage Systems

Filed under: Concord,Data Mining,Entity Resolution,Equivalence Class,Record Linkage,Record Resolution Systems (RSS),Surrogate Learning — Patrick Durusau @ 8:49 pm

Concord: A Tool That Automates the Construction of Record Linkage Systems by Christopher Dozier, Hugo Molina Salgado, Merine Thomas, Sriharsha Veeramachaneni, 2010.

From the webpage:

Concord is a system provided by Thomson Reuters R&D to enable the rapid creation of record resolution systems (RRS). Concord allows software developers to interactively configure a RRS by specifying match feature functions, master record retrieval blocking functions, and unsupervised machine learning methods tuned to a specific resolution problem. Based on a developer’s configuration process, the Concord system creates a Java based RRS that generates training data, learns a matching model and resolves record information contained in files of the same types used for training and configuration.

A nice way to start off the week! Deeply interesting paper and a new name for record linkage.

Several features of Concord that merit your attention (among many):

A choice of basic comparison operations with the ability to extend seems like a good design to me. No sense overwhelming users with all the general comparison operators, to say nothing of the domain specific ones.

The blocking functions, which operate just as you suspect, narrows the potential set of records for matching down, is also appealing. Sometimes you may be better at saying what doesn’t match than what does. This gives you two bites at a successful match.

Surrogate learning, although I have located the paper cited on this subject and will be covering it in another post.

I have written to ThomsonReuters inquiring about availability of Concord, its ability to interchange mapping settings between instances of Concord or beyond. Will update when I hear back from them.

Comments Off

November 21, 2011

TextMinr

Filed under: Data Mining,Language,Text Analytics — Patrick Durusau @ 7:31 pm

TextMinr

In pre-beta (can signal interest now) but:

Text Mining As A Service – Coming Soon!

What if you could incorporate state-of-the-art text mining, language processing & analytics into your apps and systems without having to learn the science or pay an arm and a leg for the software?

Soon you will be able to!

We aim to provide our text mining technology as a simple, affordable pay-as-you-go service, available through a web dashboard and a set of REST API’s.

If you already familiar with these tools and your data sets, this could be a useful convenience.

If you aren’t familiar with these tools and your data sets, this could be a recipe for disaster.

Like SurveyMonkey.

In the hands of a survey construction expert, with testing of the questions, etc., I am sure SurveyMonkey can be a very useful tool.

In the hands of management, who want to justify decisions where surveys can be used, SurveyMonkey is positively dangerous.

Ask yourself this: Why in an age of SurveyMonkey, do politicians pay pollsters big bucks?

Do you suspect there is something different from a professional pollster and SurveyMonkey?

Same distance between TextMinr and professional text analysis.

Or perhaps better, you get what you pay for.

Comments Off

November 14, 2011

November 13, 2011

Cross-Industry Standard Process for Data Mining (CRISP-DM 1.0)

Filed under: CRISP-DM,Data Mining — Patrick Durusau @ 10:00 pm

Cross-Industry Standard Process for Data Mining (CRISP-DM 1.0) (pdf file)

From the foreword:

CRISP-DM was conceived in late 1996 by three “veterans” of the young and immature data mining market. DaimlerChrysler (then Daimler-Benz) was already experienced, ahead of most industrial and commercial organizations, in applying data mining in its business operations. SPSS (then ISL) had been providing services based on data mining since 1990 and had launched the first commercial data mining workbench – Clementine – in 1994. NCR, as part of its aim to deliver added value to its Teradata data warehouse customers, had established teams of data mining consultants and technology specialists to service its clients’ requirements.

At that time, early market interest in data mining was showing signs of exploding into widespread uptake. This was both exciting and terrifying. All of us had developed our approaches to data mining as we went along. Were we doing it right? Was every new adopter of data mining going to have to learn, as we had initially, by trial and error? And from a
supplier’s perspective, how could we demonstrate to prospective customers that data mining was sufficiently mature to be adopted as a key part of their business processes? A standard process model, we reasoned, non-proprietary and freely available, would address these issues for us and for all practitioners.

…

CRISP-DM has not been built in a theoretical, academic manner working from technical principles, nor did elite committees of gurus create it behind closed doors. Both these approaches to developing methodologies have been tried in the past, but have seldom led to practical, successful and widely–adopted standards. CRISP-DM succeeds because it is soundly based on the practical, real-world experience of how people do data mining projects. And in that respect, we are overwhelmingly indebted to the many practitioners who contributed their efforts and their ideas throughout the project.

You might want to note that despite the issue date of 2000:

Eric King, founder and president of The Modeling Agency, a Pittsburgh-based consulting firm that focuses on analytics and data mining, [said]:

While King believes a guide in the form of a consultant is an invaluable resource for businesses in the planning phase, he noted that his firm follows the Cross Industry Standard Process for Data Mining, a public document he describes as “a cheat sheet,” when it’s working with clients. (emphasis added. Source: Developing a predictive analytics program doable on a limited budget

Comments Off

November 9, 2011

Multiperspective

Filed under: Associations,Data Mining,Graphs,Multiperspective,Neo4j,Visualization — Patrick Durusau @ 7:43 pm

Multiperspective

From the readme:

WHAT IS MULTISPECTIVE?

Multispective is an open source intelligence management system based on the neo4j graph database. By using a graph database to capture information, we can use its immensely flexible structure to store a rich relationship model and easily visualize the contents of the system as nodes with relationships to one another.

INTELLIGENCE MANAGEMENT FOR ACTIVISTS AND COLLECTIVES

The main purpose for creating this system is to provide socially motivated groups with an open source software product for managing their own intelligence relating to target networks, such as corporations, governments and other organizations. Multispective will provide these groups with a collective/social mechanism for obtaining and sharing insights into their target networks. My intention is that Multispective’s use of social media paradigms combined with visualisations will provide a well-articulated user interface into working with complex network data.

Inspired by the types of intelligence management systems used by law enforcement and national security agencies, Multispective will be great for showing things like corporate ownership and interest, events like purchases, payments (bribes), property transfers and criminal acts. The system will make it easier to look at how seemingly unrelated information is actually connected.

Multispective will also allow groups to overlap in areas of interest, discovering commonalities between discrete datasets, and being able to make use of data which has already been collected. (emphasis added)

The last two lines would not be out of place in any topic map presentation.

A project that is going to run into subject identity issues sooner rather than later. Experience and suggestions from the topic map camp would be welcome I suspect.

I don’t have a lot of extra time but I am going to toss my hat into the ring as at least interested in helping. How about you?

Comments Off

November 6, 2011

Rdatamarket Tutorial

Filed under: Data Mining,Government Data,R — Patrick Durusau @ 5:44 pm

From the Revolutions blog:

The good folks at DataMarket have posted a new tutorial on using the rdatamarket package (covered here in August) to easily download public data sets into R for analysis.

The tutorial describes how to install the rdatamarket package, how to extract metadata for data sets, and how to download the data themselves into R. The tutorial also illustrates a feature of the package I wasn’t previously aware of: you can use dimension filtering to extract just the portion of the dataset you need: for example, to read just the population data for specific countries from the entire UN World Population dataset.

DataMarket Blog: Using DataMarket from within R

Comments Off

November 3, 2011

Introducing DocDiver

Filed under: Authoring Topic Maps,Data Mining,Document Classification,Document Management,Interface Research/Design,Topic Maps — Patrick Durusau @ 7:22 pm

Introducing DocDiver by Al Shaw. The ProPublica Nerd Blog

From the post:

Today [4 Oct. 2011] we’re launching a new feature that lets readers work alongside ProPublica reporters—and each other—to identify key bits of information in documents, and to share what they’ve found. We call it DocDiver [1].

Here’s how it works:

DocDiver is built on top of DocumentViewer [2] from DocumentCloud [3]. It frames the DocumentViewer embed and adds a new right-hand sidebar with options for readers to browse findings and to add their own. The “overview” tab shows, at a glance, who is talking about this document and “key findings”—ones that our editors find especially illuminating or noteworthy. The “findings” tab shows all reader findings to the right of each page near where readers found interesting bits.

Graham Moore (Networkedplanet) mentioned early today that the topic map working group should look for technologies and projects where topic maps can make a real difference for a minimal amount of effort. (I’m paraphrasing so if I got it wrong, blame me, not Graham.)

This looks like a case where an application is very close to having topic map capabilities but not quite. The project already has users, developers and I suspect would be interested in anything that would improve their software, without starting over. That would be the critical part, to leverage existing software an imbue it with subject identity as we understand the concept, to the benefit of current users of the software.

Comments Off

November 1, 2011

aliquote

Filed under: Bioinformatics,Data Mining — Patrick Durusau @ 3:33 pm

aliquote

One of the odder blogs I have encountered, particularly the “bag of tweets” postings.

What appear to be fairly high grade posting on data and bioinformatics topics. It is one that I will be watching and thought I would pass it along.

Comments Off

October 28, 2011

Strata Conference: Making Data Work

Filed under: Conferences,Data,Data Mining,Data Science — Patrick Durusau @ 3:15 pm

Strata Conference: Making Data Work Proceedings from the New York Strata Conference, Sept. 22-23, 2011.

OK, so you missed the live video feeds. Don’t despair, videos are available for some and slides appear to be available for all. Not like being there or seeing the videos but better than missing it altogether!

A number of quite remarkable presentations.

Comments Off

tm – Text Mining Package

Filed under: Data Mining,R,Text Extraction — Patrick Durusau @ 3:12 pm

tm – Text Mining Package

From the webpage:

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

With the package ships native support for handling the Reuters-21578 data set, Gmane RSS feeds, e-mails, and several classic file formats (e.g. plain text, CSV text, or PDFs).

Admittedly, the “tm” caught my attention but a quick review confirmed that the package could be useful to topic map authors.

Comments Off

Dealing with Data (Science 11 Feb. 2011)

Filed under: Data,Data Mining,Data Science — Patrick Durusau @ 3:11 pm

Dealing with Data (Science 11 Feb. 2011)

From the website:

In the 11 February 2011 issue, Science joins with colleagues from Science Signaling, Science Translational Medicine, and Science Careers to provide a broad look at the issues surrounding the increasingly huge influx of research data. This collection of articles highlights both the challenges posed by the data deluge and the opportunities that can be realized if we can better organize and access the data.

The Science cover (left) features a word cloud generated from all of the content from the magazine’s special section.

Science is making access to this entire collection FREE (simple registration is required for non-subscribers).

Better late than never!

This is a very good overview of the big data issue, from a science perspective.

Comments Off

October 27, 2011

Data.gov

Filed under: Data,Data Mining,Government Data — Patrick Durusau @ 4:46 pm

Data.gov

A truly remarkable range of resources from the U.S. Federal Government, that is made all the more interesting by Data.gov Next Generation:

Data.gov starts an exciting new chapter in its evolution to make government data more accessible and usable than ever before. The data catalog website that broke new grounds just two years ago, is once again redefining the Open Data experience. Learn more about Data.gov’s transformation into a cloud-based Open Data platform for citizens, developers and government agencies in this 4-minute introductory video.

Developers should take a look at: http://dev.socrata.com/.

Comments Off

Timetric

Filed under: Data Mining,Data Structures — Patrick Durusau @ 4:46 pm

Timetric: Everything you need to publish data and research online

Billed as having more than three (3) million public statistics.

Looks like an interesting data source.

Anyone have experience with this site in particular?

Comments Off

Unsolicited advice for large governmental data providers

Filed under: Data Mining,Metadata — Patrick Durusau @ 4:45 pm

Unsolicited advice for large governmental data providers

From the post:

We source data from a number of large national, and trans-national, statistical bodies, like the Office of National Statistics here in the UK, or Eurostat. Downloading useful data from organizations like this is sometimes a tricky job – although publishing data is usually part of their raison d’être, they’re not usually thinking of people like us – Big Data geeks – when making their data available. And often, their methods of making data available have been essentially unchanged for the past ten or fifteen years, and even then are probably based on processes predating the Internet.

One of the sources of value Timetric adds is simply making this data more widely available and accessible. But it’s also true that there’s so much more we could do if we could put our minds to using this data in new and exciting ways, rather than expending expertise on working out the best way to map old-fashioned data publication workflows to a web-centric way of working. So it’s an interesting question to ask – in an ideal world, how would a large statistical organization publish data for us?

There’s three aspects to this question:

Data transfer and formats

Metadata formats and reconciliation

Update frequency and notifications

The advice seems UK/Euro centric to me, which works given their audience.

My question: How would you change this for governments located in other countries? Pointers to data format documentation should be included. 2-3 pages.

Comments Off

October 23, 2011

The Simple Way to Scrape an HTML Table: Google Docs

Filed under: Data Mining,HTML — Patrick Durusau @ 7:22 pm

The Simple Way to Scrape an HTML Table: Google Docs

From the post:

Raw data is the best data, but a lot of public data can still only be found in tables rather than as directly machine-readable files. One example is the FDIC’s List of Failed Banks. Here is a simple trick to scrape such data from a website: Use Google Docs.

OK, not a great trick but if you are in a hurry it may be a useful one.

Of course, I get the excuse from local governments that their staff can’t export data in useful formats (I get images of budget documents in PDF files, how useful is that?).

Comments Off

October 22, 2011

A history of the world in 100 seconds

Filed under: Data Mining,Geographic Data,Visualization — Patrick Durusau @ 3:17 pm

A history of the world in 100 seconds by Gareth Lloyd.

From the post:

Many Wikipedia articles are tagged with geographic coordinates. Many have references to historic events. Cross referencing these two subsets and plotting them year on year adds up to a dynamic visualization of Wikipedia’s view of world history.

The ‘spotlight’ is an overlay on the video that tries to keep about 90% of the datapoints within the bright area. It takes a moving average of all the latitudes and longitudes over the past 50 or so years and centres on the mean coordinate. I love the way it opens up, first resembling medieval maps of “The World” which included only Europe and some of Asia, then encompassing “The New World” and finally resembling a modern map.

This is based on the thing that me and Tom Martin built at Matt Patterson’s History Hackday. To make it, I built a python SAX Parser that sliced and diced an xml dump of all wikipedia articles (30Gb) and pulled out 424,000 articles with coordinates and 35,000 references to events. We managed to pair up 14,238 events with locations, and Tom wrote some Java to fiddle about with the coordinates and output frames. I’ve hacked around some more to add animation, because, you know, why not?

I wanted to point this post out separately for several reasons.

First, it is a good example of re-use of existing data in a new and/or interesting way. That avoids you having to spend time collecting up the original data.

Second, Gareth provides both the source code and data so you can verify his results for yourself or decide that some other visualization suits your fancy.

Third, you should read some of the comments about this work. That sort of thing is going to occur no matter what resource or visualization you make available. If you had a super-Wiki with 10 million articles in the top ten languages of the world, some wag would complain that X language wasn’t represented. Not that they would contribute to making it available, but they have the time to complain that you didn’t.

Comments Off

Processing every Wikipedia article

Filed under: Data Mining,Data Source — Patrick Durusau @ 3:17 pm

Processing every Wikipedia article by Gareth Lloyd.

From the post:

I thought it might be worth writing a quick follow up to the Wikipedia Visualization piece. Being able to parse and process all of Wikipedia’s articles in a reasonable amount of time opens up fantastic opportunities for data mining and analysis. What’s more, it’s easy once you know how.

An alternative method for accessing and parsing Wikipedia data. I probably need to do a separate post on the Visualization post.

Enjoy!

Comments Off

Java Wikipedia Library (JWPL)

Filed under: Data Mining,Java,Software — Patrick Durusau @ 3:16 pm

Java Wikipedia Library (JWPL)

From the post:

Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.

JWPL (Java Wikipedia Library) is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia. The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper.

JWPL contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.

Further, JWPL contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at download.wikimedia.org.

Wikipedia is a resource of growing interest. This toolkit may prove useful in mining it for topic map purposes.

Comments Off

October 20, 2011

The Habit of Change

Filed under: Data Mining — Patrick Durusau @ 4:18 am

The Habit of Change by David Alan Grier appears in the October 2011 issue of Computer.

An interesting account of the differences in how statisticians and computer science data miners view the processing of data. And how new techniques crowd out old ones and little heed is paid to prior results. Not all that undeserved, the lack of heed for prior results, because prior waves of change had the same disdain for their predecessors.

Not that I agree with that disdain, but it is good to be reminded that if we paid a bit more attention to the past, perhaps we could make new mistakes rather that repeating old ones.

Note that we also replace old terminologies with new ones, which makes matching up old mistakes with current ones more difficult.

Apologies for the link above that takes you to a pay-per-view option.

If you like podcasts, try: The Known World: The Habit of Change being read by David Alan Grier.

Comments Off

October 19, 2011

Knime4Bio:…Next Generation Sequencing data with KNIME

Filed under: Bioinformatics,Biomedical,Data Mining — Patrick Durusau @ 3:15 pm

Knime4Bio:…Next Generation Sequencing data with KNIME by # Pierre Lindenbaum, Solena Le Scouarnec, Vincent Portero and Richard Redon.

Abstract:

Analysing large amounts of data generated by next-generation sequencing (NGS) technologies is difficult for researchers or clinicians without computational skills. They are often compelled to delegate this task to computer biologists working with command line utilities. The availability of easy-to-use tools will become essential with the generalisation of NGS in research and diagnosis. It will enable investigators to handle much more of the analysis. Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilised to quickly retrieve previously published scientific findings.

Code: http://code.google.com/p/knime4bio/

While I applaud the trend towards “easy-to-use” software, I do worry about results that are returned by automated analysis, which of course “must be true.”

I am mindful of the four-year old whose name was on a terrorist watch list and so delayed the departure of a plane. The ground personnel lacked the moral courage or judgement to act on what was clearly a case of mistaken identity.

As “bigdata” grows ever larger, I wonder if “easy” interfaces will really be facile interfaces, that we lack the courage (skill?) to question?

Comments (2)

Rapid-I: Report the Future

Filed under: Analytics,Data Mining,Document Classification,Prediction — Patrick Durusau @ 3:15 pm

Rapid-I: Report the Future

Source of:

RapidMiner: Professional open source data mining made easy.

Analytical ETL, Data Mining, and Predictive Reporting with a single solution

RapidAnalytics: Collaborative data analysis power.

No 1 in open source business analytics

The key product for business critical predictive analysis

RapidDoc: Webbased solution for document retrieval and analysis.

Classify text, identify trends as well as emerging topics

Easy to use and configure

From About Rapid-I:

Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, i.e. for large amounts of structured data like database systems and unstructured data like texts. The open-source data mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.

The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading open-source system for knowledge discovery and data mining. It is available as a stand-alone application for data analysis and as a data mining engine which can be integrated into own products. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP, Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma, PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses benefitting from the open-source business model of Rapid-I.

Data mining/analysis is the first part of any topic map project, however large or small. These tools, which I have not (yet) tried, are likely to prove useful in such projects. Comments welcome.

Comments Off

October 18, 2011

rOpenSci

Filed under: Data Mining,Data Source,R — Patrick Durusau @ 2:40 pm

rOpenSci

From the website:

Projects in rOpenSci fall into two categories: those for working with the scientific literature, and those for working directly with the databases. Visit the active development hub of each project on github, where you can see and download source-code, see updates, and follow or join the developer discussions of issues. Most of the packages work through an API provided by the resource (database, paper archive) to access data and bring it within reach of R’s powerful manipulation.

Project started this past summer but has already collected some tutorials and data.

Good opportunity to learn some R as well as talk up the notion of re-using scientific data in new ways. Don’t jump right into recursion of subject identity as it relates to data, data structures and the subject both represent. 😉 YOU MAY THINK THAT, but what you say is: How do you say when you are talking about the same subject across data sets? Would that be useful to you? (Note the strategy of asking the user, not explaining their problem first. The explaining of their problem for them in terms I understand is mostly my strategy so this is a reminder to me to not do that!)

Comments Off

October 15, 2011

RadioVision: FMA Melds w Echo Nest’s Musical Brain

Filed under: Data Mining,Machine Learning,Natural Language Processing — Patrick Durusau @ 4:28 pm

RadioVision: FMA Melds w Echo Nest’s Musical Brain

From the post:

The Echo Nest has indexed the Free Music Archive catalog, integrating the most incredible music intelligence platform with the finest collection of free music.

The Echo Nest has been called “the most important music company on Earth” for good reason: 12 years of research at UC Berkeley, Columbia and MIT factored into the development of their “musical brain.” The platform combines large-scale data mining, natural language processing, acoustic analysis and machine learning to automatically understand how the online world describes every artist, extract musical attributes like tempo and time signature, learn about music trends (see: “hotttnesss“), and a whole lot more. Echo Nest then shares all of this data through a free and open API. [read more here]

Add music to your topic map!

Comments Off

October 8, 2011

Data Mining Research Notes – Wiki

Filed under: Data Mining,Vocabularies — Patrick Durusau @ 8:15 pm

Data Mining Research Notes – Wiki

You can go to the parent resource but I am deliberately pointing to the “wiki” resource page.

It is a collection of terms from data mining with pointers to Wikipedia pages for each one.

While I may quibble with the readability of some of the work at Wikipedia, I must confess to having created no competing explanations for their consideration.

Perhaps that is something that I could use to fill the idle hours. 😉 Seriously, readable explanations of technical material is both an art form and quite welcome by most technical types. It saves them the time of explanations if anything and possibly helps others become interested.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 14, 2011

December 11, 2011

December 7, 2011

November 28, 2011

November 27, 2011

November 21, 2011

November 14, 2011

November 13, 2011

November 9, 2011

November 6, 2011

November 3, 2011

November 1, 2011

October 28, 2011

October 27, 2011

October 23, 2011

October 22, 2011

October 20, 2011

October 19, 2011

October 18, 2011

October 15, 2011

October 8, 2011