Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 15, 2012

Linguamatics Puts Big Data Mining on the Cloud

Filed under: Cloud Computing,Data Mining,Medical Informatics — Patrick Durusau @ 8:03 pm

Linguamatics Puts Big Data Mining on the Cloud

From the post:

In response to market demand, Linguamatics is pleased to announce the launch of the first NLP-based, scaleable text mining platform on the cloud. Text mining allows users to extract more value from vast amounts of unstructured textual data. The new service builds on the successful launch by Linguamatics last year of I2E OnDemand, the Software-as-a-Service version of Linguamatics’ I2E text mining software. I2E OnDemand proved to be so popular with both small and large organizations, that I2E is now fully available as a managed services offering, with the same flexibility in choice of data resources as with the in-house, Enterprise version of I2E. Customers are thus able to benefit from best-of-breed text mining with minimum setup and maintenance costs. Such is the strength of demand for this new service that Linguamatics believes that by 2015, well over 50% of its revenues could be earned from cloud and mobile-based products and services.

Linguamatics is responding to the established trend in industry to move software applications on to the cloud or to externally managed servers run by service providers. This allows a company to concentrate on its core competencies whilst reducing the overhead of managing an application in-house. The new service, called “I2E Managed Services”, is a hosted and managed cloud-based text mining service which includes: a dedicated, secure I2E server with full-time operational support; the MEDLINE document set, updated and indexed regularly; and access to features to enable the creation and tailoring of proprietary indexes. Upgrades to the latest version of I2E happen automatically, as soon as they become available. (emphasis added)

Interesting but not terribly so, until I saw the MEDLINE document set was part of the service.

I single that out as an example of creating a value-add for a service by including a data set of known interest.

You could do a serious value-add for MEDLINE or find a collection that hasn’t been made available to an interested audience. Perhaps one for which you could obtain an exclusive license for some period of time. State/local governments are hurting for money and they have lots of data. Can’t buy it but exclusive licensing isn’t the same as buying, in most jurisdictions. Check with local counsel to be sure.

March 14, 2012

No Honor Among Thieves

Filed under: Ad Targeting,Data Mining — Patrick Durusau @ 7:36 pm

Well, the original title is: 50% of the online ads are never seen by Panos Ipeirotis.

About my title: The purpose of ads is to sell you something. Whatever the consequences may be for you. A lesson well taught by US Tobacco, Big Pharma and the corn lobby (think of all the unnatural fructose products in your food).

That said, the post by Panos is a remarkable piece about investigation and data analysis.

From the post:

Almost a year back, I was involved in an advertising fraud case, as part of my involvement with AdSafe Media. (See the related Wall Street Journal story.) Long story short, it was a sophisticated scheme for generating user traffic to websites that were displaying ads to real users but these users could never see these ads, as they were never visible to the user. While we were able to uncover the scheme, what triggered our investigation was almost an accident: our adult-content classifier seemed to detect porn in websites that had absolutely nothing suspicious. While it was a great investigative success, we could not overlook the fact that this was not a systematic method for discovering such attempts for fraud. As part of this effort to make more systematic, the following idea came up:

Let’s monitor the duration for which a user can actually see an ad?

After a few months of development to get this feature to work, it became possible to measure the exact amount of time an was visible to a user. While this feature could easily now detect any fraud attempt that delivers ads to users that never see them, this was now almost secondary. It was the first time that we could monitor the amount of time that users get exposed to ads.

50% of the Ads are (almost) Never Seen.

By measuring the statistics of more than 1.5 billion ad impressions per day, it was possible to understand deeply how different websites perform. Some of the high level results:

  • 38% of the ads are never in view to a user
  • 50% of the ads are in view for less than 0.5 seconds
  • 56% of the ads are in view for less than 5 seconds

Personally, I found these numbers impressive. 50% of the delivered ads are never seen for more than 0.5 seconds! I wanted to check myself whether 0.5 seconds is sufficient to understand the ad. Apparently, the guys at AdSafe thought about that as well, so here is their experiment:

A “pull” advertising model avoids this type of fraud because advertisers could deliver directly to pre-qualified consumers. Better use of funds for psycho-sexual manipulation of pre-qualified consumers, rather than scatter-shot across demographics.

If you are tired of wasting money on “push” advertising (with the hazards and dangers of fraud), consider a different model. Consider topic maps.

March 11, 2012

Corpus-Wide Association Studies

Filed under: Corpora,Data Mining,Linguistics — Patrick Durusau @ 8:10 pm

Corpus-Wide Association Studies by Mark Liberman.

From the post:

I’ve spent the past couple of days at GURT 2012, and one of the interesting talks that I’ve heard was Julian Brooke and Sali Tagliamonte, “Hunting the linguistic variable: using computational techniques for data exploration and analysis”. Their abstract (all that’s available of the work so far) explains that:

The selection of an appropriate linguistic variable is typically the first step of a variationist analysis whose ultimate goal is to identify and explain social patterns. In this work, we invert the usual approach, starting with the sociolinguistic metadata associated with a large scale socially stratified corpus, and then testing the utility of computational tools for finding good variables to study. In particular, we use the ‘information gain’ metric included in data mining software to automatically filter a huge set of potential variables, and then apply our own corpus reader software to facilitate further human inspection. Finally, we subject a small set of particularly interesting features to a more traditional variationist analysis.

This type of data-mining for interesting patterns is likely to become a trend in sociolinguistics, as it is in other areas of the social and behavioral sciences, and so it’s worth giving some thought to potential problems as well as opportunities.

If you think about it, the social/behavioral sciences are being applied to the results of data mining of user behavior now. Perhaps you can “catch the wave” early on this cycle of research.

AlchemyAPI

Filed under: AlchemyAPI,Data Mining,Machine Learning — Patrick Durusau @ 8:10 pm

AlchemyAPI

From the documentation:

AlchemyAPI utilizes natural language processing technology and machine learning algorithms to analyze content, extracting semantic meta-data: information about people, places, companies, topics, facts & relationships, authors, languages, and more.

API endpoints are provided for performing content analysis on Internet-accessible web pages, posted HTML or text content.

To use AlchemyAPI, you need an access key. If you do not have an API key, you must first obtain one.

I haven’t used it but it looks like a useful service for information products meant for an end user.

Do you use such services? Any others you would suggest?

March 4, 2012

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

Filed under: Data Mining,PubMed — Patrick Durusau @ 7:17 pm

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? by Casey Bergman.

From the post:

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is an incredibly rich achive of the combined output of thousands of scientists worldwide.

Good question.

Suggestions for answers? (post to the original posting)

BTW, Casey includes a listing of the articles based on mining of the open-access contents of PubMed Central.

What other open access data sets suffer from a lack of use? Comments on why?

March 3, 2012

Populating the Semantic Web…

Filed under: Data Mining,Entity Extraction,Entity Resolution,RDF,Semantic Web — Patrick Durusau @ 7:28 pm

Populating the Semantic Web – Combining Text and Relational Databases as RDF Graphs by Kate Bryne.

I ran across this while looking for RDF graph material today. Delighted to find someone interested in the problem of what do we do with existing data, even if new data is in some semantic web format?

Abstract:

The Semantic Web promises a way of linking distributed information at a granular level by interconnecting compact data items instead of complete HTML pages. New data is gradually being added to the SemanticWeb but there is a need to incorporate existing knowledge. This thesis explores ways to convert a coherent body of information from various structured and unstructured formats into the necessary graph form. The transformation work crosses several currently active disciplines, and there are further research questions that can be addressed once the graph has been built.

Hybrid databases, such as the cultural heritage one used here, consist of structured relational tables associated with free text documents. Access to the data is hampered by complex schemas, confusing terminology and difficulties in searching the text effectively. This thesis describes how hybrid data can be unified by assembly into a graph. A major component task is the conversion of relational database content to RDF. This is an active research field, to which this work contributes by examining weaknesses in some existing methods and proposing alternatives.

The next significant element of the work is an attempt to extract structure automatically from English text using natural language processing methods. The first claim made is that the semantic content of the text documents can be adequately captured as a set of binary relations forming a directed graph. It is shown that the data can then be grounded using existing domain thesauri, by building an upper ontology structure from these. A schema for cultural heritage data is proposed, intended to be generic for that domain and as compact as possible.

Another hypothesis is that use of a graph will assist retrieval. The structure is uniform and very simple, and the graph can be queried even if the predicates (or edge labels) are unknown. Additional benefits of the graph structure are examined, such as using path length between nodes as a measure of relatedness (unavailable in a relational database where there is no equivalent concept of locality), and building information summaries by grouping the attributes of nodes that share predicates.

These claims are tested by comparing queries across the original and the new data structures. The graph must be able to answer correctly queries that the original database dealt with, and should also demonstrate valid answers to queries that could not previously be answered or where the results were incomplete.

This will take some time to read but it looks quite enjoyable.

February 21, 2012

Magic Elephants, Data Psychics, and Invisible Gorillas

Filed under: BigData,Data Mining — Patrick Durusau @ 8:02 pm

Magic Elephants, Data Psychics, and Invisible Gorillas

Jim Harris writes:

A recent Forbes article predicts Big Data will be a $50 billion market by 2017, and Michael Friedenberg recently blogged how the rise of big data is generating buzz about Hadoop (which I call the Magic Elephant): “It certainly looks like the Holy Grail for organizing unstructured data, so it’s no wonder everyone is jumping on this bandwagon. So get ready for Hadoopalooza 2012.”

John Burke recently blogged about the role of big data helping CIOs “figure out how to handle the new, the unusual, and the unexpected as an opportunity to focus more clearly on how to bring new levels of order to their traditional structured data.”

As I have previously blogged, many big data proponents (especially the Big Data Lebowski vendors selling Hadoop solutions) extol its virtues as if big data provides clairvoyant business insight, as if big data was the Data Psychic of the Information Age.

But a recent New York Times article opened with the story of a statistician working for a large retail chain being asked by his marketing colleagues: “If we wanted to figure out if a customer is pregnant, even if she didn’t want us to know, can you do that?” As Eric Siegel of Predictive Analytics World is quoted in the article, “we’re living through a golden age of behavioral research. It’s amazing how much we can figure out about how people think now.”

There are funny moments in this post but the main lesson isn’t humorous.

When reading it, think about how much money your clients are leaving on the table by seeing what they expect to see from search analysis. Could be enough money to make the difference in success or failure. Or perhaps more importantly, being able to continue with your services.

There is a lot of room (I think) for improvement on the technological side of things but there is just as much if not more to be improved on the human engineering side.

The book Jim mentions, Daniel Kahneman’s Thinking, Fast and Slow is just the emerging tip of the iceberg in terms of research that is directly relevant to both marketing and interfaces.

Suggest you get a copy. Not to read and accept, may or may not be right in the details, but the focus is one you cannot afford to ignore.

February 20, 2012

Social Media Application (FBI RFI)

Filed under: Data Mining,RFI-RFP,Social Media — Patrick Durusau @ 8:35 pm

Social Media Application (FBI RFI)

Current Due Date: 11:00 AM, March 13, 2012

You have to read the Social Media Application.pdf document to prepare a response.

Be aware that as of 20 February 2012, that document has a blank page every other page. I suspect it is the complete document but have written to confirm and to request a corrected document be posted.

Out-Hoover Hoover: FBI wants massive data-mining capability for social media does mention:

Nowhere in this detailed RFI, however, does the FBI ask industry to comment on the privacy implications of such massive data collection and storage of social media sites. Nor does the FBI say how it would define the “bad actors” who would be subjected this type of scrutiny.

I take that to mean that the FBI is not seeking your comments on privacy implications or possible definitions of “bad actors.”

I won’t be able to prepare an official response because I don’t meet the contractor suitability requirements, which include a cost estimate for an offsite server as a solution to the requirements.

I will be going over the requirements and publishing my response here as though I meet the contractor suitability requirements. Could be an interesting exercise.

February 19, 2012

MoleculaRnetworks

Filed under: Data Mining,Graphs,PageRank — Patrick Durusau @ 8:37 pm

MoleculaRnetworks: An integrated graph theoretic and data mining tool to explore solvent organization in molecular simulation by Barbara Logan Mooney, L. René Corrales and Aurora E. Clark.

Abstract:

This work discusses scripts for processing molecular simulations data written using the software package R: A Language and Environment for Statistical Computing. These scripts, named moleculaRnetworks, are intended for the geometric and solvent network analysis of aqueous solutes and can be extended to other H-bonded solvents. New algorithms, several of which are based on graph theory, that interrogate the solvent environment about a solute are presented and described. This includes a novel method for identifying the geometric shape adopted by the solvent in the immediate vicinity of the solute and an exploratory approach for describing H-bonding, both based on the PageRank algorithm of Google search fame. The moleculaRnetworks codes include a preprocessor, which distills simulation trajectories into physicochemical data arrays, and an interactive analysis script that enables statistical, trend, and correlation analysis, and other data mining. The goal of these scripts is to increase access to the wealth of structural and dynamical information that can be obtained from molecular simulations. © 2012 Wiley Periodicals, Inc.

Data mining, graph theory, PageRank, something for everyone in this article!

Not to mention innovative use of PageRank with non-WWW data.

MoculaRnetworks code.

Selling Data Mining to Management

Filed under: Data Management,Data Mining,Marketing — Patrick Durusau @ 8:36 pm

Selling Data Mining to Management by Sandro Saitta.

From the post:

Preparing data and building data mining models are two very well documented steps of analytics projects. However, whatever interesting your results are, they are useless if no action is taken. Thus, the step from analytics to action is a crucial one in any analytics project. Imagine you have the best data and found the best model of all time. You need to industrialize the data mining solution to make your company benefits from them. Often, you will first need to sell your project to the management.

Sandro references three very good articles on pitching data management/mining/analytics to management.

I would rephrase Sandra’s opening line to read: “Preparing data [for a topic map] and building [a topic map] are two very well documented steps of [topic map projects]. However, whatever interesting your results are, [there is no revenue if no one buys the map].”

OK, maybe I am being generous on the preparing data and building a topic map points but you can see where the argument is going.

And there are successful topic map merchants with active clients, just not enough of either one.

These papers maybe the push in the right direction to get more of them.

February 15, 2012

Unstructured data is a myth

Filed under: Data,Data Mining — Patrick Durusau @ 8:33 pm

Unstructured data is a myth by Ram Subramanyam Gopalan.

From the post:

Couldn’t resist that headline! But seriously, if you peel the proverbial onion enough, you will see that the lack of tools to discover / analyze the structure of that data is the truth behind the opaqueness that is implied by calling the data “unstructured”.

This article will give you a firm basis for arguing with casual use of “unstructured” data as a phrase.

One point that stands above the others is that all the so-called “unstructured” data is generated by some process, automated or otherwise. That you may be temporarily ignorant of that process, doesn’t mean that the data is “unstructured.” Worth reading, more than once.

February 14, 2012

Scienceography: the study of how science is written

Filed under: Data Mining,Information Retrieval — Patrick Durusau @ 5:05 pm

Scienceography: the study of how science is written by Graham Cormode, S. Muthukrishnan and Jinyun Yun.

Abstract:

Scientific literature has itself been the subject of much scientific study, for a variety of reasons: understanding how results are communicated, how ideas spread, and assessing the influence of areas or individuals. However, most prior work has focused on extracting and analyzing citation and stylistic patterns. In this work, we introduce the notion of ‘scienceography’, which focuses on the writing of science. We provide a first large scale study using data derived from the arXiv e-print repository. Crucially, our data includes the “source code” of scientific papers-the LATEX source-which enables us to study features not present in the “final product”, such as the tools used and private comments between authors. Our study identifies broad patterns and trends in two example areas-computer science and mathematics-as well as highlighting key differences in the way that science is written in these fields. Finally, we outline future directions to extend the new topic of scienceography.

What content are you searching/indexing in a scientific context?

The authors discover what many of us have overlooked. The “source” of scientific papers. A source that can reflects a richer history than the final product.

Some questions:

Will searching the source give us finer grained access to the content? That is can we separate portions of text that recite history, related research, background, from new insights/conclusions? To access the other material only if needed. (Every graph paper starts off with nodes and edges, complete with citations. Anyone reading a graph paper is likely to know those terms.)

Other disciplines use LaTeX. Do those LaTeX files differ from the ones reported here? If so, in what way?

February 6, 2012

Wikimeta Project’s Evolution…

Filed under: Annotation,Data Mining,Semantic Annotation,Semantic Web — Patrick Durusau @ 6:58 pm

Wikimeta Project’s Evolution Includes Commercial Ambitions and Focus On Text-Mining, Semantic Annotation Robustness by Jennifer Zaino.

From the post:

Wikimeta, the semantic tagging and annotation architecture for incorporating semantic knowledge within documents, websites, content management systems, blogs and applications, this month is incorporating itself as a company called Wikimeta Technologies. Wikimeta, which has a heritage linked with the NLGbAse project, last year was provided as its own web service.

The Semantic Web Blog interviews Dr. Eric Charton about Wikimeta and its future plans.

More interesting that the average interview piece. I have a weakness for academic projects and Wikimeta certainly has the credentials in that regard.

On the other hand, when I read statements like:

So when we said Wikimeta makes over 94 percent of good semantic annotation in the three first ranked suggested annotations, this is tested, evaluated, published, peer-reviewed and reproducible by third parties.

I have to wonder what standard for “…good semantic annotation…” was in play and for what application would 94 percent be acceptable?

Annotation of nuclear power plant documentation? Drug interaction documentation? Jet engine repair manual? Chemical reaction warning on product? None of those sound like 94% right situations.

That isn’t a criticism of this project but of the notion that “correctness” of semantic annotation can be measured separate and apart from some particular use case.

It could be the case that 94% correct is unnecessary if we are talking about the content of Access Hollywood.

And your particular use case may lie somewhere in between those two extremes.

Do read the interview as this sound like it will be an interesting project, whatever your thoughts on “correctness.”

February 4, 2012

Hacking Chess: Data Munging

Filed under: Data Mining,MongoDB — Patrick Durusau @ 3:36 pm

Hacking Chess: Data Munging

Kristina Chodorow specifies a conversion from portable game notation (PGN) to JSON. For loading the chess games into MongoDB.

Useful for her Hacking Chess with the MongoDB Pipeline post.

Addressing data in situ would be more robust but conversion is far more common.

When I get around to outlining a topic map book, I will have to include a chapter on data conversion techniques.

February 3, 2012

MINE: Maximal Information-based NonParametric Exploration

Filed under: Data Mining,R — Patrick Durusau @ 5:03 pm

MINE: Maximal Information-based NonParametric Exploration

From the post:

There was a lot of buzz in the blogosphere as well as the science community about a new family of algorithms that are able to find non-linear relationships over extremely large fields of data. What makes it particularly useful is that the measure(s) it uses are based upon mutual information rather than standard pearson’s correlation type measures, which do not capture non-linear relationships well.

The (java based) software can be downloaded here: http://www.exploredata.net/ In addition, there is the capability to directly run the software from R.

A data exploration tool for your weekend enjoyment!

January 23, 2012

Mining Text Data

Filed under: Classification,Data Mining,Text Extraction — Patrick Durusau @ 7:46 pm

Mining Text Data Charu Aggarwal and ChengXiang Zhai, Springer, February 2012, Approximately 500 pages.

From the publisher’s description:

Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned.

Mining Text Data introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including the key research content on the topic, and the future directions of research in the field. There is a special focus on Text Embedded with Heterogeneous and Multimedia Data which makes the mining process much more challenging. A number of methods have been designed such as transfer learning and cross-lingual mining for such cases.

Mining Text Data simplifies the content, so that advanced-level students, practitioners and researchers in computer science can benefit from this book. Academic and corporate libraries, as well as ACM, IEEE, and Management Science focused on information security, electronic commerce, databases, data mining, machine learning, and statistics are the primary buyers for this reference book.

Not at the publisher’s site but you can see the Table of Contents and chapter 4, A SURVEY OF TEXT CLUSTERING ALGORITHMS and chapter 6, A SURVEY OF TEXT CLASSIFICATION ALGORITHMS at: www.charuaggarwal.net/text-content.pdf.

The two chapters you can download from Aggarwal’s website will give you a good idea of what to expect from the text.

While an excellent survey work, with chapters written by experts in various sub-fields, it also suffers from the survey work format.

For example, for the two sample chapters, there are overlaps in the bibliographies for both chapters. Not surprising given the closely related subject matter but as a reader I would be interested in discovering that some works are cited in both chapters. Something that is possible given the back of the chapter bibliography format, only by repetitive manual inspection.

Although I rail against examples in standards, expanding the survey reference work format to include more details and examples would only increase its usefulness and possible its life as a valued reference.

Which raises the question of having a print format for survey works at all. The research landscape is changing quickly and a shelf life of 2 to 3 years, if that long, seems a bit brief for the going rate for print editions. Printed versions of chapters as smaller and more timely works on demand, that is a value-add proposition that Springer is in a unique position to bring to its customers.

January 21, 2012

January 20, 2012

Exploring News

Filed under: Data Mining,News — Patrick Durusau @ 9:22 pm

Exploring News by Matthew Hurst.

From the post:

In experimenting with news aggregation and mining on the d8taplex site, I’ve come up with the following questions:

  • Why are some news articles picked up and others not? News sources such as Reuters create articles that are either directly consumed or which are picked up by other publications and passed along.
  • Who are these people writing these articles? What are their interests, areas of expertise and personalities?
  • What is the role of the editor and how do they influence the selection and form of the content produced by the news machine?
  • The next round of experimentation with news aggregation has resulted in the current new site. It has the following features.

    Drop by and give Matthew a hand.

    Story selection has many factors. Which ones do you think are important?

    January 14, 2012

    Faster reading through math

    Filed under: Data Mining,Natural Language Processing,Searching — Patrick Durusau @ 7:39 pm

    Faster reading through math

    John Johnson writes:

    Let’s face it, there is a lot of content on the web, and one thing I hate worse is reading halfway through an article and realizing that the title and first paragraph indicate little about the rest of the article. In effect, I check out the quick content first (usually after a link), and am disappointed.

    My strategy now is to use automatic summaries, which are now a lot more accessible than they used to be. The algorithm has been around since 1958 (!) by H. P. Luhn and is described in books such as Mining the Social Web by Matthew Russell (where a Python implementation is given). With a little work, you can create a program that scrapes text from a blog, provides short and long summaries, and links to the original post, and packages it up in a neat HTML page.

    Or you can use the cute interface in Safari, if you care to switch.

    The woes of ambiguity!

    I jumped to John’s post thinking it had some clever way to read math faster. 😉 Some of the articles I am reading take a lot longer than others. I have one on homology that I am just getting comfortable enough with to post about it.

    Any reading assistant tools that you would care to recommend?

    Of particular interest would be software that I could feed a list of URLs that resolve to PDFs files (possibly with authentication although I could login to start it off) and it produces a single HTML page summary.

    January 11, 2012

    New Techniques Turbo-Charge Data Mining

    Filed under: Data Mining,Spectral Feature Selection,Spectral Graph Theory — Patrick Durusau @ 8:08 pm

    New Techniques Turbo-Charge Data Mining by Nicole Hemsoth.

    From the post:

    While the phrase “spectral feature selection” may sound cryptic (if not ghostly) this concept is finding a welcome home in the realm of high performance data mining.

    We talked with an expert in the spectral feature selection for data mining arena, Zheng Zhao from the SAS Institute, about how trends like this, as well as a host of other new developments, are reshaping data mining for both researchers and industry users.

    Zhao says that when it comes to major trends in data mining, cloud and Hadoop represent the key to the future. These developments, he says, offer the high performance data mining tools required to tackle the types of large-scale problems that are becoming more prevalent.

    In an interview this week, Zhao predicted that over the next few years, large-scale analytics will be at the forefront of both academic research and industry R&D efforts. On one side, industry has strong requirements for new techniques, software and hardware for solving their real problems at the large scale, while on the other hand, academics find this to be an area laden with interesting new challenges to pursue.

    For more details, you may want to see our earlier posts:

    Spectral Feature Selection for Data Mining

    Spectral Graph Theory

    January 5, 2012

    Digging into Data Challenge

    Filed under: Archives,Contest,Data Mining,Library,Preservation — Patrick Durusau @ 4:09 pm

    Digging into Data Challenge

    From the homepage:

    What is the “challenge” we speak of? The idea behind the Digging into Data Challenge is to address how “big data” changes the research landscape for the humanities and social sciences. Now that we have massive databases of materials used by scholars in the humanities and social sciences — ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data or cell phone records — what new, computationally-based research methods might we apply? As the world becomes increasingly digital, new techniques will be needed to search, analyze, and understand these everyday materials. Digging into Data challenges the research community to help create the new research infrastructure for 21st century scholarship.

    Winners for Round 2, some 14 projects out of 67, were announced on 3 January 2012.

    Interested to hear your comments on the projects as I am sure the projects would as well.

    January 3, 2012

    Mining Massive Data Sets – Update

    Filed under: BigData,Data Analysis,Data Mining,Dataset — Patrick Durusau @ 5:03 pm

    Mining Massive Data Sets by Anand Rajaraman and Jeff Ullman.

    Update of Mining of Massive Datasets – eBook.

    The hard copy has been published by Cambridge Press.

    The electronic version remains available for download. (Hint, suggest all of us who can should buy a hard copy to encourage this sort of publisher behavior.)

    Homework system for both instructors and self-guided study is available at this page.

    While I wait for a hard copy to arrive, I have downloaded the PDF version.

    January 1, 2012

    Zorba: The Most Complete XQuery Processor

    Filed under: Data Mining,XQuery — Patrick Durusau @ 5:57 pm

    Zorba: The Most Complete XQuery Processor

    From the homepage:

    All Flavors Available

    General purpose XQuery processor – written in C++.

    Complete family of W3C familly of specifications: XPath, XQuery, Update, Scripting, Full-Text, XSLT, XQueryX, and more.

    Pluggable Store

    Seamlessly process XML data stored in different places.

    Main memory, mobile devices, browsers, disk-based, or cloud-based stores.

    Developer Friendly Tools

    Benefit from a rich ecosystem of tools.

    Eclipse plugins, command-line interface, and debugger.

    Rich Module Library

    Web mashups, cryptography, image processing, geo projections, emails, data cleaning… there is a module for that.

    Runs Everywhere

    Available on Windows, Linux, and Mac OS.

    Bindings available for 6 Programming Languages: C++, C, PHP, Ruby, Java and Python.

    Fun & Productive

    XQuery unifies development for all tiers; database, content management, application logic, and presentation.

    I started to mention this under the Cutting Edge Data Processing with PHP & XQuery post (which uses Zorba) but XQuery is important enough to list it separately.

    In the draft Topic Map Tool Chain, I would put this under mining/analysis, but as was pointed out in comments, the mining/analysis phase can be informed by an ontology.

    I would say “explicitly” informed by an ontology since there is always some ontology in play, whether explicit or not. (Formal ontologists, note the small “o” in ontology. An explicit ontology would have a name and be written <NAME> Ontology.

    December 29, 2011

    Web scraping with Python – the dark side of data

    Filed under: Data,Data Mining,Python,Web Scrapers — Patrick Durusau @ 9:14 pm

    Web scraping with Python – the dark side of data

    From the post:

    In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites.

    “..user-unfriendly websites.”? What about “user-hostile websites?” 😉

    Looks like a good presentation up to “user-unfriendly.”

    It will be useful for anyone who needs data from sites that are not configured to delivery it properly (that is to users).

    I suppose “user-hostile” would fall under some prohibited activity.

    Would make a great title for a book: “Penetration and Mapping of Hostile Hosts.” Could map of vulnerable hosts with their exploits as a network graph.

    Spectral Feature Selection for Data Mining

    Filed under: Data Mining,Spectral Feature Selection — Patrick Durusau @ 9:13 pm

    Spectral Feature Selection for Data Mining by Zheng Alan Zhao and Huan Liu.

    I did not find the publisher’s description all that helpful.

    You may want to review:

    The supplemental page maintained by the authors, Spectral Feature Selection for Data Mining. There you will also find source code by chapter in Matlab format and some other materials.

    Earlier work by the authors, see:

    Spectral feature selection for supervised and unsupervised learning (2007) by Zheng Zhao , Huan Liu.

    Slow going but the early work appears to hold a great deal of promise.

    If you have or get a copy of the book, please forward or point to your comments.

    December 26, 2011

    Pattern

    Filed under: Data Mining,Python — Patrick Durusau @ 8:21 pm

    Pattern

    From the webpage:

    Pattern is a web mining module for the Python programming language.

    It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).

    The module is bundled with 30+ example scripts.

    Consider it to be a late stocking stuffer. 😉

    December 21, 2011

    Thoughts on ICDM (the IEEE conference on Data Mining)

    Filed under: Data Mining,Graphs,Social Networks — Patrick Durusau @ 7:24 pm

    Thoughts on ICDM I: Negative Results (part A) by Suresh Venkatasubramanian.

    From (part A):

    I just got back from ICDM (the IEEE conference on Data Mining). Data mining conferences are quite different from theory conferences (and much more similar to ML or DB conferences): there are numerous satellite events (workshops, tutorials and panels in this case), many more people (551 for ICDM, and that’s on the smaller side), and a wide variety of papers that range from SODA-ish results to user studies and industrial case studies.

    While your typical data mining paper is still a string of techniques cobbled together without rhyme or reason (anyone for spectral manifold-based correlation clustering with outliers using MapReduce?), there are some general themes that might be of interest to an outside viewer. What I’d like to highlight here is a trend (that I hope grows) in negative results.

    It’s not particularly hard to invent a new method for doing data mining. It’s much harder to show why certain methods will fail, or why certain models don’t make sense. But in my view, the latter is exactly what the field needs in order to give it a strong inferential foundation to build on (I’ll note here that I’m talking specifically about data mining, NOT machine learning – the difference between the two is left for another post).

    From (part B):

    Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

    One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because

    • you can target your algorithms to graphs that look like this, hopefully making them more efficient
    • You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
    • If you’re lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.

    But there aren’t even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn’t have the heavy-tailed degree distribution that’s common in social networks. It also doesn’t have a property that’s common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

    Part C – forthcoming –

    I am perhaps more sceptical of modeling than the author but this is a very readable and interesting set of blog posts. I will be posting Part C as soon as it appears.

    Update: Thoughts on ICDM I: Negative results (part C)

    From Part C:

    If you come up with a better way of doing classification (for now let’s just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

    The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

    Now the more smug folks reading these are waiting with ‘ROC’ and “AUC” at the tip of their tongues, and they’d be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

    For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you’d want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

    The AUC is a popular way of comparing methods in order to balance the different error rates. It’s also attractive because it’s parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

    But is it ?

    December 19, 2011

    OSCAR4

    Filed under: Cheminformatics,Data Mining — Patrick Durusau @ 8:11 pm

    OSCAR4 Launch

    From the webpage:

    OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles. It can be used to identify chemical names, reaction names, ontology terms, enzymes and chemical prefixes and adjectives. In addition, where possible, any chemical names detected will be annotated with structures derived either by lookup, or name-to-structure parsing using OPSIN[1] or with identifiers from the ChEBI (`Chemical Entities of Biological Interest’) ontology.

    The current version of OSCAR. OSCAR4, focuses on providing a core library that facilitates integration with other tools. Its simple to use API is modularised to promote extension into other domains and allows for its use within workflow systems like Taverna[2] and U-Compare [3].

    We will be hosting a launch on the 13th of April to discuss the new architecture as well as demonstrate some applications that use OSCAR. Tutorial sessions on on how to use the new API will also be provided.

    Archived videos from the launch are now online: http://sms.cam.ac.uk/collection/1130934

    Just to put this into a topic map context, imagine that the annotation in question was placement in an association with mappings to other data, data that was held by your employer and leased to researchers.

    December 17, 2011

    IBM Redbooks Reveals Content Analytics

    Filed under: Analytics,Data Mining,Entity Extraction,Text Analytics — Patrick Durusau @ 6:31 am

    IBM Redbooks Reveals Content Analytics

    From Beyond Search:

    IBM Redbooks has put out some juicy reading for the azure chip consultants wanting to get smart quickly with IBM Content Analytics Version 2.2: Discovering Actionable Insight from Your Content. The sixteen chapters of this book take the reader from an overview of IBM content analytics, through understanding the details, to troubleshooting tips. The above link provides an abstract of the book, as well as links to download it as a PDF, view in HTML/Java, or order a hardcopy.

    Abstract:

    With IBM® Content Analytics Version 2.2, you can unlock the value of unstructured content and gain new business insight. IBM Content Analytics Version 2.2 provides a robust interface for exploratory analytics of unstructured content. It empowers a new class of analytical applications that use this content. Through content analysis, IBM Content Analytics provides enterprises with tools to better identify new revenue opportunities, improve customer satisfaction, and provide early problem detection.

    To help you achieve the most from your unstructured content, this IBM Redbooks® publication provides in-depth information about Content Analytics. This book examines the power and capabilities of Content Analytics, explores how it works, and explains how to design, prepare, install, configure, and use it to discover actionable business insights.

    This book explains how to use the automatic text classification capability, from the IBM Classification Module, with Content Analytics. It explains how to use the LanguageWare® Resource Workbench to create custom annotators. It also explains how to work with the IBM Content Assessment offering to timely decommission obsolete and unnecessary content while preserving and using content that has business value.

    The target audience of this book is decision makers, business users, and IT architects and specialists who want to understand and use their enterprise content to improve and enhance their business operations. It is also intended as a technical guide for use with the online information center to configure and perform content analysis with Content Analytics.

    The cover article points out the Redbooks have an IBM slant, which isn’t surprising. When you need big iron for an enterprise project, that IBM is one of a handful of possible players isn’t surprising either.

    December 16, 2011

    Detecting Novel Associations in Large Data Sets

    Filed under: Bioinformatics,Data Mining,Statistics — Patrick Durusau @ 8:23 am

    Detecting Novel Associations in Large Data Sets by David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, Pardis C. Sabeti.

    Abstract:

    Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

    Lay version: Tool detects patterns hidden in vast data sets by Haley Bridger.

    Data and software: http://exploredata.net/.

    From the article:

    Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs—far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones? Data sets of this size are increasingly common in fields as varied as genomics, physics, political science, and economics, making this question an important and growing challenge (1, 2).

    One way to begin exploring a large data set is to search for pairs of variables that are closely associated. To do this, we could calculate some measure of dependence for each pair, rank the pairs by their scores, and examine the top-scoring pairs. For this strategy to work, the statistic we use to measure dependence should have two heuristic properties: generality and equitability.

    By generality, we mean that with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships (3). The latter condition is desirable because not only do relationships take many functional forms, but many important relationships—for example, a superposition of functions—are not well modeled by a function (4–7).

    By equitability, we mean that the statistic should give similar scores to equally noisy relationships of different types. For example, we do not want noisy linear relationships to drive strong sinusoidal relationships from the top of the list. Equitability is difficult to formalize for associations in general but has a clear interpretation in the basic case of functional relationships: An equitable statistic should give similar scores to functional relationships with similar R2 values (given sufficient sample size).

    Here, we describe an exploratory data analysis tool, the maximal information coefficient (MIC), that satisfies these two heuristic properties. We establish MIC’s generality through proofs, show its equitability on functional relationships through simulations, and observe that this translates into intuitively equitable behavior on more general associations. Furthermore, we illustrate that MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximal information-based nonparametric exploration. MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity. We demonstrate the application of MIC and MINE to data sets in health, baseball, genomics, and the human microbiota. (footnotes omitted)

    As you can imagine the line:

    MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity.

    caught my eye.

    I usually don’t post until the evening but this looks very important. I wanted everyone to have a chance to grab the data and software before the weekend.

    New acronyms:

    MIC – maximal information coefficient

    MINE – maximal information-based nonparametric exploration

    Good thing they chose acronyms we would not be likely to confuse with other usages. 😉

    Full citation:

    Science 16 December 2011:
    Vol. 334 no. 6062 pp. 1518-1524
    DOI: 10.1126/science.1205438

    « Newer PostsOlder Posts »

    Powered by WordPress