Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 14, 2011

KDnuggets

Filed under: Analytics,Conferences,Data Mining,Humor — Patrick Durusau @ 7:13 pm

KDnuggests

Good site to follow for data mining and analytics resources, ranges from conference announcements, data mining sites and forums, software, to crossword puzzles.

See: Analytics Crossword Puzzle 2.

I like that, it has a timer. One that starts automatically.

Maybe topic maps needs a cross-word puzzle or two. Pointers? Suggestions for content/clues?

Mining Data in Motion

Filed under: Data Mining,Marketing,Topic Maps — Patrick Durusau @ 7:09 pm

Mining Data in Motion by Chris Nott says: “…The scope for business innovation is considerable.

Or in context:

Mining data in motion. On the face of it, this seems to be a paradox: data in motion is transitory and so can’t be mined. However, this is one of the most powerful concepts for businesses to explore innovative opportunities if they can only release themselves from the constraints of today’s IT thinking.

Currently analytics are focused on data at rest. But exploiting information as it arrives into an organisation can open up new opportunities. This might include influencing customers as they interact based on analytics triggered by web log insight, social media analytics, a real-time view of business operations, or all three. The scope for business innovation is considerable.

The ability to mine this live information in real time is a new field of computer science. The objective is to process information as it arrives, using the knowledge of what has occurred in the past, but the challenge is in organising the data in a way that it is accessible to the analytics, processing a stream of data in motion.

Innovation in this context is going to require subject recognition, whether so phrased or not, and collation with other information, some of which may also be from live feeds.

Curious if standards for warranting the reliability of identifications or information in general are going to arise? Suspect there will be explicit liability limitations for information and the effort made to verify it. Free information will likely carry a disclaimer for any use for any purpose. Take your chances.

How reliable the information you are supplied depending upon the level of liability you have purchased.

I wonder how an “information warranty” economy would affect information suppliers who now disavow responsibility for their information content. Interesting because businesses would not hire lawyers or accountants who did not take some responsibility for their work. Perhaps there are more opportunities in data mining than just data stream mining.

Perhaps: Topic Maps – How Much Information Certainty Can You Afford?

Information could range from the fun house stuff you see on Fox to full traceability to sources that expands in real time. Depends on what you can afford.

August 8, 2011

Suicide Note Classification…ML Correct 78% of the time.

Filed under: Data Analysis,Data Mining,Machine Learning — Patrick Durusau @ 6:41 pm

Suicide Note Classification Using Natural Language Processing: A Content Analysis

Punch line (for the impatient):

…trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time.

Abstract:

Suicide is the second leading cause of death among 25–34 year olds and the third leading cause of death among 15–25 year olds in the United States. In the Emergency Department, where suicidal patients often present, estimating the risk of repeated attempts is generally left to clinical judgment. This paper presents our second attempt to determine the role of computational algorithms in understanding a suicidal patient’s thoughts, as represented by suicide notes. We focus on developing methods of natural language processing that distinguish between genuine and elicited suicide notes. We hypothesize that machine learning algorithms can categorize suicide notes as well as mental health professionals and psychiatric physician trainees do. The data used are comprised of suicide notes from 33 suicide completers and matched to 33 elicited notes from healthy control group members. Eleven mental health professionals and 31 psychiatric trainees were asked to decide if a note was genuine or elicited. Their decisions were compared to nine different machine-learning algorithms. The results indicate that trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time.

The researchers concede that the data set is small but apparently it is the only one of it kind.

I mention the study here as a reason to consider using ML techniques in your next topic map project.

Merging the results from different ML algorithms re-creates the original topic maps use case, how do you merge indexes made by different indexers?, but that can’t be helped. More patterns to discover to use as the basis for merging rules!*

PS: I spotted this at Improbable Results: Machines vs. Professionals: Recognizing Suicide Notes.


* I wonder if we could apply the lessons from ensembles of classifiers to a situation where multiple classifiers are used by different projects? One part of me says that an ensemble is developed by a person or group that shares an implicit view of the data and so that makes the ensemble workable.

Another part wants to say that no, the results of classifiers, whether programmed by the same group or different groups, should not make a difference. Well, other than having to “merge” the results of the classifiers, which happens with an ensemble anyway. In that case you might have to think about it more.

Hard to say. Will have to investigate further.

Data Mining: Professor Pier Luca Lanzi, Politecnico di Milano

Filed under: Data Mining,Genetic Algorithms,Machine Learning,Visualization — Patrick Durusau @ 6:27 pm

This post started with my finding the data mining slides at Slideshare (about 4 years old) and after organizing those, deciding to check Professor Pier Luca Lanzi’s homepage for more recent material. I think you will find it useful material.

Pier Luca Lanzi – homepage

The professor is obviously interested in video games, a rapidly growing area of development and research.

Combining video games with data mining, that would be a real coup.

Data Mining Course page

Data Mining

Includes prior exams, video (2009 course), transparencies from all lectures.

Lecture slides on Data Mining and Machine Learning at Slideshare.

Not being a lemming, I don’t find most viewed a helpful sorting criteria.

I organized the data mining slides in course order (as nearly as I could determine, there are two #6 presentations and no #7 or #17 presentations):

00 Course Introduction

01 Data Mining

02 Machine Learning

03 The representation of data

04 Association rule mining

05 Association rules: advanced topics

06 Clustering: Introduction

06 Clustering: Partitioning Methods

08 Clustering: Hierarchical

09 Density-based, Grid-based, and Model-based Clustering

10 Introduction to Classification

11 Decision Trees

12 Classification Rules

13 Nearest Neighbor and Bayesian Classifiers

14 Evaluation

15 Data Exploration and Preparation

16 Classifiers Ensembles

18 Mining Data Streams

19 Text and Web Mining

Genetic Algorithms

Genetic Algorithms Course Notes

August 3, 2011

UK Government Paves Way for Data-Mining

Filed under: Authoring Topic Maps,Data Mining,Marketing — Patrick Durusau @ 7:37 pm

UK Government Paves Way for Data-Mining

Blog report on interesting UK government policy report.

From the post:

The key recommendation is that the Government should press at EU level for the introduction of an exception to current copyright law, allowing “non-consumptive” use of a work (ie a use that doesn’t directly trade on the underlying creative and expressive purpose of the work). In the process of text-mining, copying is only carried out as part of the analysis process – it is a substitute for a human reading the work, and therefore does not compete with the normal exploitation of the work itself – in fact, as the paper says, these processes actually facilitate a work’s exploitation (ie by allowing search, or content recommendation). (emphasis in original)

If you think of topic maps as a value-add on top of information stores, allowing “non-consumptive” access would be a real boon for topic maps.

You could create a topic map into copyrighted material and the user of your topic map could access that material only if say they were a subscriber to that content.

As Steve Newcomb has argued on many occasions, topic maps can become economic artifacts in their own right.

July 24, 2011

KNIME Version 2.4.0 released

Filed under: Data Analysis,Data Integration,Data Mining — Patrick Durusau @ 6:45 pm

KNIME Version 2.4.0 released

From the release notice:

We have just released KNIME v2.4, a feature release with a lot of new functionality and some bug fixes. The highlights of this release are:

  • Enhancements around meta node handling (collapse/expand & custom dialogs)
  • Usability improvements (e.g. auto-layout, fast node insertion by double-click)
  • Polished loop execution (e.g. parallel loop execution available from labs)
  • Better PMML processing (added PMML preprocessing, which will also be presented at this year's KDD conference)
  • Many new nodes, including a whole suite of XML processing nodes, cross-tabulation and nodes for data preprocessing and data mining, including ensemble learning methods.

In case you aren’t familiar with KNIME, it is self-described as:

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is currently being used actively by over 6,000 professionals all over the world, in both industry and academia.

What would you do the same/differently for a topic map interface?

July 21, 2011

Oracle, Sun Burned, and Solr Exposure

Filed under: Data Mining,Database,Facets,Lucene,SQL,Subject Identity — Patrick Durusau @ 6:27 pm

Oracle, Sun Burned, and Solr Exposure

From the post:

Frankly we wondered when Oracle would move off the dime in faceted search. “Faceted search”, in my lingo, is showing users categories. You can fancy up the explanation, but a person looking for a subject may hit a dead end. The “facet” angle displays links to possibly related content. If you want to educate me, use the comments section for this blog, please.

We are always looking for a solution to our clients’ Oracle “findability” woes. It’s not just relevance. Think performance. Query and snack is the operative mode for at least one of our technical baby geese. Well, Oracle is a bit of a red herring. The company is not looking for a solution to SES11g functionality. Lucid Imagination, a company offering enterprise grade enterprise search solutions, is.

If “findability” is an issue at Oracle, I would be willing to bet that subject identity is as well. Rumor has it that they have paying customers.

July 13, 2011

RecordBreaker: Automatic structure for your text-formatted data

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 7:30 pm

RecordBreaker: Automatic structure for your text-formatted data

From the post:

This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike’s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.

RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.

No quite “automatic” but a step in that direction and a useful one.

Think of “automatic” identification of subjects and associations in such files.

Like the files from campaign financing authorities.

Unstructured data ‘out of control’: survey

Filed under: Data,Data Mining — Patrick Durusau @ 7:28 pm

Unstructured data ‘out of control’: survey

Joe McKendrick writes:

Many organizations are becoming overwhelmed with the volumes of unstructured information — audio, video, graphics, social media messages — that falls outside the purview of their “traditional” databases. Organizations that do get their arms around this data will gain significant competitive edge.

As part of my work with Unisphere Research, a division of Information Today, Inc., I helped conduct a new survey that finds unstructured data is growing at a faster clip than relational data — driving the “Big Data” explosion. Thirty-five percent of respondents say unstructured information has already surpassed or will surpass the volume of traditional relational data in the next 36 months. Sixty-two percent say this is inevitable within the next decade. The survey gathered input from 446 data managers and professionals who are readers of Database Trends and Applications magazine, and was underwritten by MarkLogic.

A majority of survey respondents acknowledge that unstructured information is growing out of control and is driving the big data explosion – 91% say unstructured information already lives in their organizations, but many aren’t sure what to do about it.

I mention this survey because unstructured data has few contenders for the attribution, discovery, extraction of semantics and topic maps may find less competition from traditional solutions.

July 9, 2011

Data journalism, data tools, and the
newsroom stack

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 7:02 pm

Data journalism, data tools, and the newsroom stack by Alex Howard.

From the post:

MIT’s recent Civic Media Conference and the latest batch of Knight News Challenge winners made one reality crystal clear: as a new era of technology-fueled transparency, innovation and open government dawns, it won’t depend on any single CIO or federal program. It will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, whatever form it is delivered in.

The themes that unite this class of Knight News Challenge winners were data journalism and platforms for civic connections. Each theme draws from central realities of the information ecosystems of today. Newsrooms and citizens are confronted by unprecedented amounts of data and an expanded number of news sources, including a social web populated by our friends, family and colleagues. Newsrooms, the traditional hosts for information gathering and dissemination, are now part of a flattened environment for news, where news breaks first on social networks, is curated by a combination of professionals and amateurs, and then analyzed and synthesized into contextualized journalism.

Pointers to the newest resources and analysis of the issues of “transparency, innovation and open government….”

Until government transparency becomes public and cumulative, it will be personal and transitory.

Topic maps have the capability to make it the former instead of the latter.

July 7, 2011

Boilerpipe

Filed under: Data Mining,Java — Patrick Durusau @ 4:15 pm

Boilerpipe

From the webpage:

The boilerpipe library provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

Should save you some time when harvesting data from webpages.

July 1, 2011

ScraperWiki

Filed under: Data,Data Mining,Data Source,Text Extraction — Patrick Durusau @ 2:49 pm

ScraperWiki

From the About page:

What is ScraperWiki?

There’s lots of useful data on the internet – crime statistics, government spending, missing kittens…

But getting at it isn’t always easy. There’s a table here, a report there, a few web pages, PDFs, spreadsheets… And it can be scattered over thousands of different places on the web, making it hard to see the whole picture and the story behind it. It’s like trying to build something from Lego when someone has hidden the bricks all over town and you have to find them before you can start building!

To get at data, programmers write bits of code called ‘screen scrapers’, which extract the useful bits so they can be reused in other apps, or rummaged through by journalists and researchers. But these bits of code tend to break, get thrown away or forgotten once they have been used, and so the data is lost again. Which is bad.

ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code.

Something to keep an eye on and whenever possible, to contribute to.

People make data difficult to access for a reason. Let’s disappoint them.

June 26, 2011

21st-Century Data Miners Meet 19th-Century Electrical Cables

Filed under: Data Mining,Machine Learning — Patrick Durusau @ 4:09 pm

21st-Century Data Miners Meet 19th-Century Electrical Cables by Cynthia Rudin, Rebecca J. Passonneau, Axinia Radeva, Steve Ierome, and Delfina F. Isaac, Computer, June 2011 (vol. 44 no. 6).

As they say, the past is never far behind. In this case, about 5% of the low-voltage cables in Manhattan were installed before 1930. The records of Consolidated Edison (ConEd) on its cabling and manholes to access it, vary in form, content and originate in different departments, starting in the 1880’s. Yes, the 1880’s for those of you who thing the 1980’s are ancient history.

From the article:

The text in trouble tickets is very irregular and thus challenging to process in its raw form. There are many spellings of each word–for instance, the term “service box” has at least 38 variations, including SB, S, S/B, S.B, S?B, S/BX, SB/X, S/XB, /SBX, S.BX, S&BX, S?BX, S BX, S/B/X, S BOX, SVBX, SERV BX, SERV-BOX, SERV/BOX, and SERVICE BOX.

Similar difficulties plagued determining the type of event from trouble tickets, etc.

Read the article for the details on how the researchers were successful at showing legacy data can assist in the maintenance of a current electrical grid.

I suspect that “service box” is used by numerous utilities and with similar divergences in its recording. A more general application written as a topic map would preserve all those variations and use them in searching other data records. It is the reuse of user analysis and data that make them so valuable.

June 16, 2011

Evaluating Text Extraction Algorithms

Filed under: Data Mining,Text Extraction — Patrick Durusau @ 3:42 pm

Evaluating Text Extraction Algorithms

From the post:

Lately I’ve been working on evaluating and comparing algorithms, capable of extracting useful content from arbitrary html documents. Before continuing I encourage you to pass trough some of my previous posts, just to get a better feel of what we’re dealing with; I’ve written a short overview, compiled a list of resources if you want to dig deeper and made a feature wise comparison of related software and APIs.

If you’re not simply creating topic map content, you are mining content from other sources, such as texts, to point to or include in a topic map. A good set of posts on tools and issues surrounding that task.

June 12, 2011

U.S. DoD Is Buying. Are You Selling?

Filed under: BigData,Data Analysis,Data Integration,Data Mining — Patrick Durusau @ 4:14 pm

CTOVision.com reports: Big Data is Critical to the DoD Science and Technology Investment Agenda

Of the seven reported priorities:

(1) Data to Decisions – science and applications to reduce the cycle time and manpower requirements for analysis and use of large data sets.

(2) Engineered Resilient Systems – engineering concepts, science, and design tools to protect against malicious compromise of weapon systems and to develop agile manufacturing for trusted and assured defense systems.

(3) Cyber Science and Technology – science and technology for efficient, effective cyber capabilities across the spectrum of joint operations.

(4) Electronic Warfare / Electronic Protection – new concepts and technology to protect systems and extend capabilities across the electro-magnetic spectrum.

(5) Counter Weapons of Mass Destruction (WMD) – advances in DoD’s ability to locate, secure, monitor, tag, track, interdict, eliminate and attribute WMD weapons and materials.

(6) Autonomy – science and technology to achieve autonomous systems that reliably and safely accomplish complex tasks, in all environments.

(7) Human Systems – science and technology to enhance human-machine interfaces to increase productivity and effectiveness across a broad range of missions

I don’t see any where topic maps would be out of place.

Do you?

June 11, 2011

“Human Cognition is Limited”

Filed under: Data Mining — Patrick Durusau @ 12:43 pm

In Data Mining and Open APIs, Toby Segaran offers several reasons why data mining is important, including:

Human Cognition is Limited (slide 7)

We have all seen similar claims in data mining/processing presentations and for the most part they are just background noise until we get to the substance of the presentation.

The substance of this presentation is some useful Python code for several open interfaces and I commend it to your review. But I want to take issue with the notion that “human cognition is limited,” that we blow by so easily.

I suspect the real problem is:

Human Cognition is Unlimited

Any data mining task you can articulate could be performed by underpaid and overworked graduate assistants. The problem is that their minds wander from the task at hand, to preparation for lectures the next day, to reading assignments to that nice piece of work they saw on the way to the lab and other concerns. None of which are distractions that trouble data mining programs or the machines on which they run.

What is really needed is an assistant that acts like a checker with a counter that simply “clicks” as the next body in line passes. Just enough cognition to perform the task at hand.

Since it is difficult to find humans with such limited cognition, we turn to computers to take up the gauge.

For example, campaign contributions in the United States is too large a data set for manual processing. While automated processors can dutifully provide totals, etc., they won’t notice, on their own initiative, checks going to Illinois senators and presidential candidates from “Al Capone.” The cognition of data mining programs is bestowed by their creators. That would be us.

May 29, 2011

Exploring NYT news and its authors

Filed under: Data Analysis,Data Mining,Visualization — Patrick Durusau @ 7:05 pm

Exploring NYT news and its authors

To say this project/visualization is clever is an understatement!

Completely inadequate description but the interface constructs a mythic “single” reporter on any topic you choose from stories in the New York Times. The interface also gives you reporters who wrote stories on that topic. You can then find what “other” stories the mythic one reporter wrote, as well as compare the stories written by actual NYT reporters.

A project of the IBM Center for Social Software, see: NYTWrites: Exploring The New York Times Authorship.

May 22, 2011

reclab

Filed under: Data Mining,Merging,Topic Map Software — Patrick Durusau @ 5:34 pm

reclab

From the website:

If you can’t bring the data to the code, bring the code to the data.

How do we do this? Simple. RecLab solves the intractable problem of supplying real data to researchers by turning it on is head. Rather than attempt the impossible task of bringing sensitive, proprietary retail data to innovative code, RecLab brings the code to the data on live retailing sites. This is done via the RichRelevance cloud environment-a large-scale, distributed environment that is the backbone of the leading dynamic personalization technology solution for the web’s top retailers.

Two things occurred to me while at this site:

1) Does this foreshadow enterprises being able to conduct competitions on analysis/mining/processing (BI) of their data? Rather than buying solutions and then learning the potential of an acquired solution?

2) For topic maps, is this a way to create competition between “merging” algorithms on “sensitive, proprietary” data? After all, it is users who decide whether appropriate “merging” has taken place.

BTW, this site has links to a contest with a $1 Million dollar prize. Just in case you are using topic maps to power recommender systems.

Data Science Toolkit

Filed under: Data Mining,Software — Patrick Durusau @ 5:34 pm

Data Science Toolkit by Peter Warden.

Interesting collection of data tools. Can use here or download to use locally.

Peter is the author of the Data Source Handbook from O’Reilly.

May 18, 2011

ICON Programming for Humanists, 2nd edition

Filed under: Data Mining,Indexing,Text Analytics,Text Extraction — Patrick Durusau @ 6:50 pm

ICON Programming for Humanists, 2nd edition

From the foreword to the first edition:

This book teaches the principles of Icon in a very task-oriented fashion. Someone commented that if you say “Pass the salt” in correct French in an American university you get an A. If you do the same thing in France you get the salt. There is an attempt to apply this thinking here. The emphasis is on projects which might interest the student of texts and language, and Icon features are instilled incidentally to this. Actual programs are exemplified and analyzed, since by imitation students can come to devise their own projects and programs to fulfill them. A number of the illustrations come naturally enough from the field of Stylistics which is particularly apt for computerized approaches.

I can’t say that the success of ICON is a recommendation for task-oriented teaching but as I recall the first edition, I thought it was effective.

Data mining of texts is an important skill in the construction of topic maps.

This is a very good introduction to that subject.

Balisage 2011 Preliminary Program

Filed under: Conferences,Data Mining,RDF,SPARQL,XPath,XQuery,XSLT — Patrick Durusau @ 6:40 pm

At-A-Glance

Program (in full)

From the announcement (Tommie Usdin):

Topics this year include:

  • multi-ended hypertext links
  • optimizing XSLT and XQuery processing
  • interchange, interoperability, and packaging of XML documents
  • eBooks and epub
  • overlapping markup and related topics
  • visualization
  • encryption
  • data mining

The acronyms this year include:

XML XSLT XQuery XDML REST XForms JSON OSIS XTemp RDF SPARQL XPath

New this year will be:

Lightning talks: an opportunity for participants to say what they think, simply, clearly, and persuasively.

As I have said before, simply the best conference of the year!

Conference site: http://www.balisage.net/

Registration: http://www.balisage.net/registration.html

May 17, 2011

TunedIT

Filed under: Algorithms,Data Mining,Machine Learning — Patrick Durusau @ 2:52 pm

TunedIT Machine Learning & Data Mining Algorithms Automated Tests, Repeatable Experiments, Meaningful Results

There are two parts to the TunedIT site:

TunedIT Research

TunedIT Research is an open platform for reproducible evaluation of machine learning and data mining algorithms. Everyone may use TunedIT tools to launch reproducible experiments and share results with others. Reproducibility is achieved through automation. Datasets and algorithms, as well as experimental results, are collected in central databases: Repository and Knowledge Base, to enable comparison of wide range of algorithms, and to facilitate dissemination of research findings and cooperation between researchers. Everyone may access the contents of TunedIT and contribute new resources and results.

TunedIT Challenge

The TunedIT project was established in 2008 as a free and open experimentation platform for data mining scientists, specialists and programmers. It was extended in 2009 with a framework for online data mining competitions, used initially for laboratory classes at universities. Today, we provide a diverse range of competition types – for didactic, scientific and business purposes.

  • Student Challenge — For closed members groups. Perfectly suited to organize assignments for students attending laboratory classes. Restricted access and visibility, only for members of the group. FREE of charge
  • Scientific Challenge — Open contest for non-commercial purpose. Typically associated with a conference, journal or scientific organization. Concludes with public dissemination of results and winning algorithms. May feature prizes. Fee: FREE or 20%
  • Industrial Challenge — Open contest with commercial purpose. Intellectual Property can be transfered at the end. No requirement for dissemination of solutions. Fee: 30%

This looks like a possible way to generate some publicity about and interest in topic maps.

Suggestions of existing public data sets that would be of interest to a fairly broad audience?

Thinking we are likely to model some common things the same and other common things differently.

Would be interesting to compare results.

May 14, 2011

XMLSH

Filed under: Authoring Topic Maps,Data Mining — Patrick Durusau @ 6:25 pm

XMLSH – Command line shell for processing XML.

Another tool for your data mining/manipulation tool-kit!

May 11, 2011

Data Stream Mining Techniques

Filed under: Data Mining,Data Streams — Patrick Durusau @ 6:56 pm

An analytical framework for data stream mining techniques based on challenges and requirements by Mahnoosh Kholghi and Mohammadreza Keyvanpour.

Abstract:

A growing number of applications that generate massive streams of data need intelligent data processing and online analysis. Real-time surveillance systems, telecommunication systems, sensor networks and other dynamic environments are such examples. The imminent need for turning such data into useful information and knowledge augments the development of systems, algorithms and frameworks that address streaming challenges. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. Generally, two main challenges are designing fast mining methods for data streams and need to promptly detect changing concepts and data distribution because of highly dynamic nature of data streams. The goal of this article is to analyze and classify the application of diverse data mining techniques in different challenges of data stream mining. In this paper, we present the theoretical foundations of data stream analysis and propose an analytical framework for data stream mining techniques.

The paper is an interesting collection of work on mining data streams and its authors should be encouraged to continue their research in this field.

However, the current version is in serious need of editing, both in terms of language usage but organizationally as well. For example, it is hard to relate table 2 (data stream mining techniques) to the analytical framework that was the focus of the article.

April 30, 2011

When Data Mining Goes Horribly Wrong

Filed under: Data Mining,Merging,Search Engines — Patrick Durusau @ 10:22 am

In When Data Mining Goes Horribly Wrong, Matthew Hurst brings us a cautionary tale about what can happen when “merging” decisions are made badly.

From the blog:

Consequently, when you see a details page – either on Google, Bing or some other search engine with a local search product – you are seeing information synthesized from multiple sources. Of course, these sources may differ in terms of their quality and, as a result, the values they provide for certain attributes.

When combining data from different sources, decisions have to be made as to firstly when to match (that is to say, assert that the data is about the same real world entity) and secondly how to merge (for example: should you take the phone number found in one source or another?).

This process – the conflation of data – is where you either succeed or fail.

Read Matthew’s post for encouraging signs that there is plenty of room for the use of topic maps.

What I find particularly amusing is that repair of the merging in this case doesn’t help prevent it from happening again and again.

Not much of a repair if the problem continues to happen elsewhere.

April 29, 2011

Duolingo: The Next Chapter in Human Communication

Duolingo: The Next Chapter in Human Communication

By one of the co-inventors of CAPTCHA and reCAPTCHA, Luis von Ahn, so his arguments should give us pause.

Luis wants to address the problem of translating the web into multiple languages.

Yes, you heard that right, translate the web into multiple languages.

Whatever you think now, watch the video and decide if you still feel the same way.

My question is how to adapt his techniques to subject identification?

April 25, 2011

Inside Horizon: interactive analysis at cloud scale

Filed under: Cloud Computing,Data Analysis,Data Mining — Patrick Durusau @ 3:36 pm

Inside Horizon: interactive analysis at cloud scale

From the website:

Late last year, we were honored to be invited to talk at Reflections|Projections, ACM@UIUC’s annual student-run computing conference. We decided to bring a talk about Horizon, our system for doing aggregate analysis and filtering across very large amounts of data. The video of the talk was posted a few weeks back on the conference website.

Horizon started as research project / technology demonstrator built as part of Palantir’s Hack Week – a periodic innovation sprint that our engineering team uses to build brand new ideas from whole cloth. It was then used to by the Center For Public Integrity in their Who’s Behind The Subprime Meltdown report. We produced a short video on the subject, Beyond the Cloud: Project Horizon, released on our analysis blog. Subsequently, it was folded into our product offering, under the name Object Explorer.

In this hour-long talk, two of the engineers that built this technology tell the story of how Horizon came to be, how it works, and show a live demo of doing analysis on hundreds of millions of records in interactive time.

From the presentation:

Mission statement: Organize the world’s information and make it universally accessible and useful. -> Google’s statement

Which should say:

Organize the world’s [public] information and make it universally accessible and useful.

Palantir’s misson:

Organize the world’s [private] information and make it universally accessible and useful.

Closes on human-driven analysis.

A couple of points:

The demo was of a pre-beta version even though the product version shipped several months prior to the presentation. What’s with that?

Long on general statements and short on any specifics.

Did mention this is a column-store solution. Appears to work well with very clean data, but then what solution doesn’t?

Good emphasis on user interface and interactive responses to queries.

I wonder if the emphasis on interactive responses creates unrealistic expectations among customers?

Or an emphasis on problems that can be solved or appear to be solvable, interactively?

My comments about intelligence community bias the other day for example. You can measure and visualize tweets that originate in Tahrir Square, but if they are mostly from Western media, how meaningful is that?

April 17, 2011

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition

Filed under: Data Mining,Inference,Prediction,Statistical Learning — Patrick Durusau @ 5:24 pm

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition

by Trevor Hastie, Robert Tibshirani and Jerome Friedman.

The full pdf of the latest printing is available at this site.

Strongly recommend that if you find the text useful, that you ask your library to order the print version.

From the website:

During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book’s coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting–the first comprehensive treatment of this topic in any book.

This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for “wide” data (italics p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.

April 12, 2011

Spreadsheet Data Connector Released

Filed under: Data Mining,Software,Topic Map Software — Patrick Durusau @ 12:02 pm

Spreadsheet Data Connector Released

From the website:

This project contains an abstract layer on top of the Apache POI library. This abstraction layer provides the Spreadsheet Query Language – eXql and additional method to access spreadsheets. The current version is designed to support the XLS and XLSX format of Microsoft© Excel® files.

The Spreadsheet Data Connector is well suited for all use cases where you have to access data in Excel sheets and you need a sophisticated language to address and query the data.

Will have to ask when we will see a connector for ODF based spreadsheets.

April 11, 2011

A Data Parallel toolkit for Information Retrieval

Filed under: Data Mining,Information Retrieval,Search Algorithms,Searching — Patrick Durusau @ 5:53 am

A Data Parallel toolkit for Information Retrieval

From the website:

Many modern information retrieval data analyses need to operate on web-scale data collections. These collections are sufficiently large as to make single-computer implementations impractical, apparently necessitating custom distributed implementations.

Instead, we have implemented a collection of Information Retrieval analyses atop DryadLINQ, a research LINQ provider layer over Dryad, a reliable and scalable computational middleware. Our implementations are relatively simple data parallel adaptations of traditional algorithms, and, due entirely to the scalability of Dryad and DryadLINQ, scale up to very large data sets. The current version of the toolkit, available for download below, has been successfully tested against the ClueWeb corpus.

Are you using large data sets in the construction of your topic maps?

Where large is taken to mean data sets in the range of one billion documents. (http://boston.lti.cs.cmu.edu/Data/clueweb09/)

The authors of this work are attempting to extend access to large data sets to a larger audience.

Did they succeed?

Is their work useful for smaller data sets?

What tools would you add to assist more specifically with topic map construction?

« Newer PostsOlder Posts »

Powered by WordPress