Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 29, 2012

ProseVis

Filed under: Data Mining,Graphics,Text Analytics,Text Mining,Visualization — Patrick Durusau @ 10:12 am

ProseVis

A tool for exploring texts on non-word basis.

Or in the words of the project:

ProseVis is a visualization tool developed as part of a use case supported by the Andrew W. Mellon Foundation through a grant titled “SEASR Services,” in which we seek to identify other features than the “word” to analyze texts. These features comprise sound including parts-of-speech, accent, phoneme, stress, tone, break index.

ProseVis allows a reader to map the features extracted from OpenMary (http://mary.dfki.de/) Text-to-speech System and predictive classification data to the “original” text. We developed this project with the ultimate goal of facilitating a reader’s ability to analyze and disseminate the results in human readable form. Research has shown that mapping the data to the text in its original form allows for the kind of human reading that literary scholars engage: words in the context of phrases, sentences, lines, stanzas, and paragraphs (Clement 2008). Recreating the context of the page not only allows for the simultaneous consideration of multiple representations of knowledge or readings (since every reader’s perspective on the context will be different) but it also allows for a more transparent view of the underlying data. If a human can see the data (the syllables, the sounds, the parts-of-speech) within the context in which they are used to reading, with the data mapped back onto the full text, then the reader is empowered within this familiar context to read what might otherwise be an unfamiliar representation tabular representation of the text. For these reasons, we developed ProseVis as a reader interface to allow scholars to work with the data in a language or context in which we are used to saying things about the world.

Textual analysis tools are “smoking gun” detectors.

CEO is unlikely to make inappropriate comments in a spreadsheet or data feed. Emails on the other hand… 😉

Big or little data, the goal is to have the “right” data.

May 21, 2012

Call for Papers: PLoS Text Mining Collection

Filed under: Data Mining,Text Mining — Patrick Durusau @ 7:15 pm

Call for Papers: PLoS Text Mining Collection by Camron Assadi.

From the post:

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

Highly recommend that you follow up on this publication opportunity.

I am less certain that: “…the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model.”

Don’t recall seeing any research on a connection between a lack of Open Access and failure of text mining to accelerate research.

CiteSeer and arXiv have long been freely available in full text. If research were going to leap forward from open access, the opportunity has been present.

Open access does advance research and discovery but it isn’t a magic bullet. Accelerating and enhancing research is going to require more than simply indexing literature. A lot more.

My Favorite Graphs

Filed under: Data Mining,Graphs — Patrick Durusau @ 9:49 am

My Favorite Graphs by Nina Zumel

From the post:

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. – William Cleveland, The Elements of Graphing Data, Chapter 2

In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.

I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.

I rather like that: “…can [we] see something that would have been harder to see otherwise or that could not have been seen at all.”

A good criteria for all data mining techniques or approaches.

You will like the graphs as well.

May 20, 2012

Top 10 challenging problems in data mining

Filed under: Data Mining — Patrick Durusau @ 1:12 pm

Top 10 challenging problems in data mining by Sandro Saitta (March 27, 2008)

I mention the date of this post because the most recent response to it was four days ago, May 15, 2012.

I should write a post that gets comments that long after publication!

Sandro writes:

In a previous post, I wrote about the top 10 data mining algorithms, a paper that was published in Knowledge and Information Systems. The “selective” process is the same as the one that has been used to identify the most important (according to answers of the survey) data mining problems. The paper by Yang and Wu has been published (in 2006) in the International Journal of Information Technology & Decision Making. The paper contains the following problems (in no specific order):

  • Developing a unifying theory of data mining
  • Scaling up for high dimensional data and high speed data streams
  • Mining sequence data and time series data
  • Mining complex knowledge from complex data
  • Data mining in a network setting
  • Distributed data mining and mining multi-agent data
  • Data mining for biological and environmental problems
  • Data Mining process-related problems
  • Security, privacy and data integrity
  • Dealing with non-static, unbalanced and cost-sensitive data
  • It’s a little over five years later.

    Same list? Different list?

    BTW, the 2006 article by Yang and Wu, along with slides, can be found at: 10 Challenging Problems in Data Mining Research

    The full citation of the article is:

    Qiang Yang and Xindong Wu (Contributors: Pedro Domingos, Charles Elkan, Johannes Gehrke, Jiawei Han, David Heckerman, Daniel Keim, Jiming Liu, David Madigan, Gregory Piatetsky-Shapiro, Vijay V. Raghavan, Rajeev Rastogi, Salvatore J. Stolfo, Alexander Tuzhilin, and Benjamin W. Wah), 10 Challenging Problems in Data Mining Research, International Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006, 597-604.

    While searching for this paper I encountered:

    Xindong Wu’s Publications in Data Mining and Machine Learning

    Pick any paper at random and you are likely to learn something new.

    May 16, 2012

    Modeling vs Mining?

    Filed under: Data Mining,Data Models — Patrick Durusau @ 12:07 pm

    Steve Miller writes in Politics of Data Models and Mining:

    I recently came across an interesting thread, “Is data mining still a sin against the norms of econometrics?”, from the Advanced Business Analytics LinkedIn Discussion Group. The point of departure for the dialog is a paper entitled “Three attitudes towards data mining”, written by couple of academic econometricians.

    The data mining “attitudes” range from the extremes that DM techniques are to be avoided like the plague, to one where “data mining is essential and that the only hope that we have of using econometrics to uncover true economic relationships is to be found in the intelligent mining of data.” The authors note that machine learning phobia is currently the norm in economics research.

    Why is this? “Data mining is considered reprehensible largely because the world is full of accidental correlations, so that what a search turns up is thought to be more a reflection of what we want to find than what is true about the world.” In contrast, “Econometrics is regarded as hypothesis testing. Only a well specified model should be estimated and if it fails to support the hypothesis, it fails; and the economist should not search for a better specification.”

    In other words, econometrics focuses on explanation, expecting its practitioners to generate hypotheses for testing with regression models. ML, on the other hand, obsesses on discovery and prediction, often content to let the data talk directly, without the distraction of “theory.” Just as bad, the results of black-box ML might not be readily interpretable for tests of economic hypotheses.

    Watching other communities fight over odd questions is always more enjoyable than serious disputes of grave concern in our own. (See Using “Punning” to Answer httpRange-14 for example.)

    I mention the economist’s dispute, not simply to make jests at the expense of “econometricians.” (Do topic map supporters need a difficult name? TopicMapologists? Too short.)

    The economist’s debate is missing an understanding that modeling requires some knowledge of the domain (mining whether formal or informal) and mining requires some idea of an output (models whether spoken or unspoken). A failing that is all too common across modeling/mining domains.

    To put it another way:

    We never stumble upon data that is “untouched by human hands.”

    We never build models without knowledge of the data we are modeling.

    The relevant question is: Does the model or data mining provide a useful result?

    (Typically measured by your client’s joy or sorrow over your results.)

    May 15, 2012

    SIAM Data Mining 2012 Conference

    Filed under: Conferences,Data Mining — Patrick Durusau @ 7:04 pm

    SIAM Data Mining 2012 Conference

    Ryan Rosario writes:

    From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.

    Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.

    The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to run there, wait in line, guide myself through crowds, wait in line, get my food, eat it, and run back to the conference in 90 minutes on a weekend. After lunch on the first two days was another plenary session followed by breakout sessions. The evening of the first two days was reserved for poster sessions. Saturday hosted half-day and full-day workshops.

    Below is my summary of the conference. Of course, such a summary is very high level my description may miss things, or may not be entirely correct if I misunderstood the speaker.

    I doubt Ryan would claim his summary is “as good as being there” but in the absence of attending, you could do far worse.

    Suggestions of papers from the conference that I should read first?

    May 14, 2012

    TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

    Filed under: Data Mining,Data Source,Open Relevance Project,TREC — Patrick Durusau @ 12:47 pm

    TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

    From the post:

    TREC Legal Track — part of the U.S. government’s Text Retrieval Conference — announced last week that the 2012 edition of its annual document review project for testing new systems is canceled, while prominent e-discovery software company Recommind confirmed that it’s been asked to leave the project for prematurely sharing results.

    These difficulties highlight the need for:

    • open data sets and
    • protocols for reporting of results as they occur.

    That requires a data set with relevance judgments and other work.

    Have you thought about the: Open Relevance Project at the Apache Foundation?

    Email archives from Apache projects, the backbone of the web as we know it, are ripe for your contributions.

    Let me be the first to ask Recommind to join in building a public data set for everyone.

    May 12, 2012

    Outlier detection in two review articles (Part 1)

    Filed under: Data Mining,Outlier Detection — Patrick Durusau @ 3:38 pm

    Outlier detection in two review articles (Part 1) by Sandro Saitta.

    Sandro writes:

    The first one, Outlier Detection: A Survey, is written by Chandola, Banerjee and Kumar. They define outlier detection as the problem of “[…] finding patterns in data that do not conform to expected normal behavior“. After an introduction to what outliers are, authors present current challenges in this field. In my experience, non-availability of labeled data is a major one.

    One of their main conclusions is that “[…] outlier detection is not a well-formulated problem“. It is your job, as a data miner, to formulate it correctly.

    The final quote seems particularly well suited to subject identity issues. While any one subject identity may be well defined, the question is how to find and manage other subject identifications that may not be well defined.

    As Sandro points out, it has nineteen (19) pages of references. However, only nine of those are as recent at 2007. All the rest are older references. I am sure it remains an excellent reference source but suspect more recent review articles on outlier detection exist.

    Suggestions?

    May 10, 2012

    Book “R and Data Mining: Examples and Case Studies” on CRAN [blank chapters]

    Filed under: Data Mining,R — Patrick Durusau @ 2:03 pm

    Book “R and Data Mining: Examples and Case Studies” on CRAN [blank chapters]

    Yanchang Zhao, RDataMining.com, writes:

    My book in draft titled “R and Data Mining: Examples and Case Studies” is now available on CRAN at http://cran.r-project.org/other-docs.html. It is scheduled to be published by Elsevier in late 2012. Its latest version can be downloaded at http://www.rdatamining.com/docs.

    The book presents many examples on data mining with R, including data exploration, decision trees, random forest, regression, clustering, time series analysis & mining, and text mining. Some other chapters in progress are social network analysis, outlier detection, association rules and sequential patterns.

    Not to complain too much, what is present is good and an author can choose his/her distribution model but the above entry should be modified to add:

    Chapters 7, 9, 12 – 15 are blank and reserved for the book version to be published by Elsevier Inc.

    Just so you know.

    Cambridge and other presses have chosen to follow other access models. Perhaps someday Elsevier will as well.

    April 23, 2012

    Mining Basket Data

    Filed under: BigData,Data Mining — Patrick Durusau @ 6:00 pm

    Database mining is motivated by the decision support problem faced by most large retail oganizations [S+93]. Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data. A record in such data typically consists of transaction date and the items bought in the transaction. Successful organizations view such databases as important pieces of marketing infrastructure [Ass92]. They are interested in instituting information-driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies [Ass90]. (emphasis added)

    Sounds like a marketing pitch for big data doesn’t it?

    In 1994, Rakesh Agrawal and Ramakrishnan Srikant had basket data and wrote: Fast algorithms for mining association rules (1994). Now listed as the 18th most cited computer science article by Citeseer.

    Mining “data” isn’t new nor is mining for association rules. Not to mention your prior experience with association rules.

    With a topic map you can capture that prior experience along side new association rules and methods. Marketing wasn’t started yesterday and isn’t going to stop tomorrow. Successful firms are going to build on their experience, not re-invent it with each technology change.

    ICDM 2012

    ICDM 2012 Brussels, Belgium | December 10 – 13, 2012

    From the webpage:

    The IEEE International Conference on Data Mining series (ICDM) has established itself as the world’s premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications.

    ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing. By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining. Besides the technical program, the conference features workshops, tutorials, panels and, since 2007, the ICDM data mining contest.

    Important Dates:

    ICDM contest proposals: April 30
    Conference full paper submissions: June 18
    Demo and tutorial proposals: August 10
    Workshop paper submissions: August 10
    PhD Forum paper submissions: August 10
    Conference paper, tutorial, demo notifications: September 18
    Workshop paper notifications: October 1
    PhD Forum paper notifications: October 1
    Camera-ready copies and copyright forms: October 15

    April 19, 2012

    Knoema Launches the World’s First Knowledge Platform Leveraging Data

    Filed under: Data,Data Analysis,Data as Service (DaaS),Data Mining,Knoema,Statistics — Patrick Durusau @ 7:13 pm

    Knoema Launches the World’s First Knowledge Platform Leveraging Data

    From the post:

    DEMO Spring 2012 conference — Today at DEMO Spring 2012, Knoema launched publicly the world’s first knowledge platform that leverages data and offers tools to its users to harness the knowledge hidden within the data. Search and exploration of public data, its visualization and analysis have never been easier. With more than 500 datasets on various topics, gallery of interactive, ready to use dashboards and its user friendly analysis and visualization tools, Knoema does for data what YouTube did to videos.

    Millions of users interested in data, like analysts, students, researchers and journalists, struggle to satisfy their data needs. At the same time there are many organizations, companies and government agencies around the world collecting and publishing data on various topics. But still getting access to relevant data for analysis or research can take hours with final outcomes in many formats and standards that can take even longer to get it to a shape where it can be used. This is one of the issues that the search engines like Google or Bing face even after indexing the entire Internet due to the nature of statistical data and diversity and complexity of sources.

    One-stop shop for data. Knoema, with its state of the art search engine, makes it a matter of minutes if not seconds to find statistical data on almost any topic in easy to ingest formats. Knoema’s search instantly provides highly relevant results with chart previews and actual numbers. Search results can be further explored with Dataset Browser tool. In Dataset Browser tool, users can get full access to the entire public data collection, explore it, visualize data on tables/charts and download it as Excel/CSV files.

    Numbers made easier to understand and use. Knoema enables end-to-end experience for data users, allowing creation of highly visual, interactive dashboards with a combination of text, tables, charts and maps. Dashboards built by users can be shared to other people or on social media, exported to Excel or PowerPoint and embedded to blogs or any other web site. All public dashboards made by users are available in dashboard gallery on home page. People can collaborate on data related issues participating in discussions, exchanging data and content.

    Excellent!!!

    When “other” data becomes available, users will want to integrate it with their data.

    But “other” data will have different or incompatible semantics.

    So much for attempts to wrestle semantics to the ground (W3C) or build semantic prisons (unnamed vendors).

    What semantics are useful to you today? (patrick@durusau.net)

    April 17, 2012

    Data mining opens the door to predictive neuroscience (Google Hazing Rituals)

    Filed under: Data Mining,Neuroinformatics,Predictive Analytics — Patrick Durusau @ 7:10 pm

    Data mining opens the door to predictive neuroscience

    From the post:

    Ecole Polytechnique Fédérale de Lausanne (EPFL) researchers have discovered rules that relate the genes that a neuron switches on and off to the shape of that neuron, its electrical properties, and its location in the brain.

    The discovery, using state-of-the-art computational tools, increases the likelihood that it will be possible to predict much of the fundamental structure and function of the brain without having to measure every aspect of it.

    That in turn makes modeling the brain in silico — the goal of the proposed Human Brain Project — a more realistic, less Herculean, prospect.

    The fulcrum of predictive analytics is finding the “basis” for prediction and within what measurement of error.

    Curious how that would work in an employment situation?

    Rather than Google’s intellectual hazing rituals, project a thirty-minute questionnaire on Google hires against their evaluations at six-month intervals. Give prospective hires the same questionnaire and then “up” or “down” decisions on hiring. Likely to be as accurate as the current rituals.

    April 15, 2012

    Announcing Fech 1.0

    Filed under: Data Mining,Government Data,News — Patrick Durusau @ 7:15 pm

    Announcing Fech 1.0 by Derek Willis.

    From the post:

    Fech now retrieves a whole lot more campaign finance data.

    We’re excited to announce the 1.0 release of Fech, our Ruby library for parsing Federal Election Commission electronic campaign filings. Fech 1.0 now covers all of the current form types that candidates and committees submit. Originally developed to parse presidential committee filings, Fech now can be used for almost any kind of report (Senate candidates file on paper, so Fech can’t help there). The updated documentation, made with Github Pages, has a full listing of the supported formats.

    Now it’s possible to use Fech to parse the pre-election filings of candidates receiving contributions of $1,000 or more — one way to see the late money in politics — or to dig through political party and political action committee reports to see how committees spend their funds. At The Times, Fech now plays a much greater role in powering our Campaign Finance API and in interactives that make use of F.E.C. data.

    The additions to Fech include the ability to compare two filings and examine the differences between them. Since the F.E.C. requires that amendments replace the entire original filing, the comparison feature is especially useful for seeing what has changed between an original filing and an amendment to it. Another feature allows users to pass in a specific quote character (or parse a filing’s data without one at all) in order to avoid errors parsing comma-separated values that occasionally appear in filings.

    Kudos to the New York Times for development of software and Fech in particular, to give the average person access to “public” information. Without meaningful access, it can hardly qualify as “public” can it?

    Something the U.S. Senate should keep in mind as it remains mired in 19th century pomp and privilege. Or diplomats. The other remaining class of privilege. Transparency is coming.


    Update: Fech 1.1 Released.

    April 14, 2012

    Everything You Wanted to Know About Data Mining but Were Afraid to Ask

    Filed under: Data Mining,Marketing — Patrick Durusau @ 6:24 pm

    Everything You Wanted to Know About Data Mining but Were Afraid to Ask by Alexander Furnas.

    Interesting piece from the Atlantic that you can use to introduce a client to the concepts of data mining. And at the same time, use as the basis for discussing topic maps.

    For example, Furnas says:

    For the most part, data mining tells us about very large and complex data sets, the kinds of information that would be readily apparent about small and simple things. For example, it can tell us that “one of these things is not like the other” a la Sesame Street or it can show us categories and then sort things into pre-determined categories. But what’s simple with 5 datapoints is not so simple with 5 billion datapoints.

    Topic maps being more about things that are “like the other” so that we can have them all in one place. Or at least all the information about them in one place.

    See, that wasn’t hard.

    The editorial and technical side of it, how information is gathered for useful presentation to a user, is hard.

    But the client, like someone watching cable TV, is more concerned with the result than how it arrived.

    Perhaps a different marketing strategy, results first.

    Thoughts?

    April 12, 2012

    The most important decision in data mining

    Filed under: Data Mining,Topic Maps — Patrick Durusau @ 7:06 pm

    The most important decision in data mining

    A whimsical post that includes this pearl:

    It is a fact that a prediction model of the right target is much better than a good prediction model of the wrong or suboptimal target.

    Same is true for a topic map except there we would say: A topic map of the right subject(s) is much better than a good topic map of the wrong subject(s).

    That means understanding what your clients/users want to talk about. Not what hypothetical Martians might want to talk about. Unless they land and have something of value to trade for modifications to an existing topic map. 😉

    April 9, 2012

    The New World of Massive Data Mining

    Filed under: BigData,Data Mining — Patrick Durusau @ 4:32 pm

    The New World of Massive Data Mining

    From the webpage:

    Every time you go on the Internet, make a phone call, send an email, pass a traffic camera or pay a bill, you create data, electronic information. In all, 2.5 quintillion bytes of data are created each day. This massive pile of information from all sources is called “Big Data.” It gets stored somewhere, and everyday the pile gets bigger. Government and industry are finding new ways to analyze it. Last week the administration announced an initiative to aid the development of Big Data computing. A panel of experts join guest host Tom Gjelten to discuss the opportunities — for business, science, medicine, education, and security … but also the privacy concerns.

    Guests

    John Villasenor senior fellow at the Brookings Institution and professor of electrical engineering at UCLA.”

    Michael Leiter senior counselor,Palantir Technologies, former director, National Counterterrorism Center.

    Dr. Suzanne Iacono co-chair, Big Data Senior Steering Group and senior science adviser, Directorate for Computer and Information Science and Engineering at the National Science Foundation.

    Daphne Koller professor,Stanford Artificial Intelligence Laboratory

    You can listen to the show, download the podcast or a transcript of the discussion.

    May help shape your rhetoric with NPR listeners who caught the show.

    April 3, 2012

    Tracking Video Game Buzz

    Filed under: Blogs,Clustering,Data Mining,Tweets — Patrick Durusau @ 4:17 pm

    Tracking Video Game Buzz

    Matthew Hurst writes:

    Briefly, I pushed out an experimental version of track // games to track tropics in the blogosphere relating to video games. As with track // microsoft it uses gathers posts from blogs, clusters them and uses an attention metric based on Bitly and Twitter to rank the clusters, new posts and videos.

    Currently at the top of the stack is Bungie Waves Goodbye To Halo.

    Wonder if Matthew could be persuaded to do the same for the elections this Fall in the United States? 😉

    April 2, 2012

    iSAX

    Filed under: Data Mining,iSAX,SAX,Time Series — Patrick Durusau @ 5:47 pm

    iSAX

    An extension of the SAX software for larger data sets. Detailed in: iSAX: Indexing and Mining Terabyte Sized Time Series.

    Abstract:

    Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, it has not led to algorithms that can scale to the increasingly massive datasets encountered in science, engineering, and business domains. In this work, we show how a novel multiresolution symbolic representation can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature. Our approach allows both fast exact search and ultra fast approximate search. We show how to exploit the combination of both types of search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real world datasets, containing millions of time series.

    There are a number of data sets at this page with “…warning 500meg file.”

    SAX (Symbolic Aggregate approXimation)

    Filed under: Data Mining,SAX,Time Series — Patrick Durusau @ 5:47 pm

    SAX (Symbolic Aggregate approXimation)

    From the webpage:

    SAX is the first symbolic representation for time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. In classic data mining tasks such as clustering, classification, index, etc., SAX is as good as well-known representations such as Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT), while requiring less storage space. In addition, the representation allows researchers to avail of the wealth of data structures and algorithms in bioinformatics or text mining, and also provides solutions to many challenges associated with current data mining tasks. One example is motif discovery, a problem which we defined for time series data. There is great potential for extending and applying the discrete representation on a wide class of data mining tasks.

    From a testimonial on the webpage:

    the performance SAX enables is amazing, and I think a real breakthrough. As an example, we can find similarity searches using edit distance over 10,000 time series in 50 milliseconds. Ray Cromwell, Timepedia.org

    Don’t usually see “testimonials” on an academic website but they appear to be merited in this case.

    Serious similarity software. Take the time to look.

    BTW, you may also be interested in a SAX time series/Shape tutorial. (120 slides about what makes SAX special.)

    SAXually Explicit Images: Data Mining Large Shape Databases

    Filed under: Data Mining,Image Processing,Image Recognition,Shape — Patrick Durusau @ 5:46 pm

    SAXually Explicit Images: Data Mining Large Shape Databases by Eamonn Keogh.

    ABSTRACT

    The problem of indexing large collections of time series and images has received much attention in the last decade, however we argue that there is potentially great untapped utility in data mining such collections. Consider the following two concrete examples of problems in data mining.

    Motif Discovery (duplication detection): Given a large repository of time series or images, find approximately repeated patterns/images.

    Discord Discovery: Given a large repository of time series or images, find the most unusual time series/image.

    As we will show, both these problems have applications in fields as diverse as anthropology, crime…

    Ancient history in the view of some, this is a Google talk from 2006!

    But, it is quite well done and I enjoyed the unexpected application of time series representation to shape data for purposes of evaluating matches. It is one of those insights that will stay with you and that seems obvious after they say it.

    I think topic map authors (semantic investigators generally) need to report such insights for the benefit of others.

    March 31, 2012

    DS 2012 : The 15th International Conference on Discovery Science

    Filed under: Conferences,Data Mining,Machine Learning — Patrick Durusau @ 4:09 pm

    DS 2012 : The 15th International Conference on Discovery Science

    Important Dates:

    Important Dates for Submissions

    Full paper submission: 17 th May, 2012
    Author notification: 8th July, 2012
    Camera-ready papers due: 20th July, 2012

    Important dates for all DS 2012 attendees

    Deadline for early registration: 30th August, 2012
    DS 2012 conference dates: 29-31 October, 2012

    From the call for papers:

    DS-2012 will be collocated with ALT-2012, the 23rd International Conference on Algorithmic Learning Theory. The two conferences will be held in parallel, and will share their invited talks.

    DS 2012 provides an open forum for intensive discussions and exchange of new ideas among researchers working in the area of Discovery Science. The scope of the conference includes the development and analysis of methods for automatic scientific knowledge discovery, machine learning, intelligent data analysis, theory of learning, as well as their application to knowledge discovery. Very welcome are papers that focus on dynamic and evolving data, models and structures.

    We invite submissions of research papers addressing all aspects of discovery science. We particularly welcome contributions that discuss the application of data analysis, data mining and other support techniques for scientific discovery including, but not limited to, biomedical, astronomical and other physics domains.

    Possible topics include, but are not limited to:

    • Logic and philosophy of scientific discovery
    • Knowledge discovery, machine learning and statistical methods
    • Ubiquitous Knowledge Discovery
    • Data Streams, Evolving Data and Models
    • Change Detection and Model Maintenance
    • Active Knowledge Discovery
    • Learning from Text and web mining
    • Information extraction from scientific literature
    • Knowledge discovery from heterogeneous, unstructured and multimedia data
    • Knowledge discovery in network and link data
    • Knowledge discovery in social networks
    • Data and knowledge visualization
    • Spatial/Temporal Data
    • Mining graphs and structured data
    • Planning to Learn
    • Knowledge Transfer
    • Computational Creativity
    • Human-machine interaction for knowledge discovery and management
    • Biomedical knowledge discovery, analysis of micro-array and gene deletion data
    • Machine Learning for High-Performance Computing, Grid
      and Cloud Computing
    • Applications of the above techniques to natural or social sciences

    I looked very briefly at prior proceedings. If those are any indication, this should be a very good conference.

    HotSocial 2012

    Filed under: Conferences,Data Mining,Social Media — Patrick Durusau @ 4:09 pm

    HotSocial 2012: First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research August 12, 2012, Beijing, China (in conjunction with ACM KDD 2012, August 12-16, 2012) http://user.informatik.uni-goettingen.de/~fu/hotsocial/

    Important Dates:

    Deadline for submissions: May 9, 2012 (11:59 PM, EST)
    Notification of acceptance: June 1, 2012
    Camera-ready version: June 12, 2012
    HotSocial Workshop Day: Aug 12, 2012

    From the post:

    Among the fundamental open questions are:

    • How to access social networks data? Different communities have different means, each with pros and cons. Experience exchanges from different communities will be beneficial.
    • How to protect these data? Privacy and data protection techniques considering social and legal aspects are required.
    • How the complex systems and graph theory algorithms can be used for understanding social networks? Interdisciplinary collaboration are necessary.
    • Can social network features be exploited for a better computing and social network system design?
    • How do online social networks play a role in real-life (offline) community forming and evolution?
    • How does the human mobility and human interaction influence human behaviors and thus public health? How can we develop methodologies to investigate the public health and their correlates in the context of the social networks?

    Topics of Interest:

    Main topics of this workshop include (but are not limited to) the following:

    • methods for accessing social networks (e.g., sensor nets, mobile apps, crawlers) and bias correction for use in different communities (e.g., sociology, behavior studies, epidemiology)
    • privacy and ethic issues of data collection and management of large social graphs, leveraging social network properties as well as legal and social constraints
    • application of data mining and machine learning in the context of specific social networks
    • information spread models and campaign detection
    • trust and reputation and community evolution in the online and offline interacted social networks, including the presence and evolution of social identities and social capital in OSNs
    • understanding complex systems and scale-free networks from an interdisciplinary angle
    • interdisciplinary experiences and intermediate results on social network research

    Sounds relevant to the “big data” stuff of interest to the White House.

    PS: Have you noticed how some blogging software really sucks when you do “view source” on pages? Markup and data should be present. It makes content reuse easier. WordPress does it. How about your blogging software?

    March 27, 2012

    Data Mining Bitly

    Filed under: Bitly,Data Mining — Patrick Durusau @ 7:17 pm

    I was reading a highly entertaining post by Nathan Yau, What News Sites People are Reading, by State that had the following quote:

    Bitly’s dataset, wrangled by data scientists Hilary Mason and Anna Smith, consists of every click on every Bitly link on the Web. Bitly makes its data available publicly—just add ‘+’ to the end of any Bitly link to see how many clicks it’s gotten.

    It’s a little more complicated than that but not by much.

    From the Bitly help page:

    Beyond basics: Capturing data and using metrics

    How do I see how many times a bitly link was clicked on?

    Every bitly link has an info page, which reveals the number of related clicks and other relevant data. You can get to the info page in a few different ways. For example, to view the info page for the bitly link http://bit.ly/CUjV

    You can also use the the sidebar bookmarklet to instantly get information for your bitly link, or you can see basic information about all of your links on your Manage page.

    What do the numbers “x” out of “x” mean next to my links?

    The numbers next to your links might say “8 out of 8” or “14 out of 648,” or something else. The top number is the number of clicks that your bitly link specifically generated, for example: 30. The bottom number is the total number of bitly clicks generated for all bitly links created for that URL as a whole, for example: 100. So if you “30 out of 100” next to your link, that means the bitly link you created generated 30 clicks and 70 clicks were generated by other bitly links (from other bitly users) to that URL.

    Why does the number on top always match the number of total clicks, even when I’m not the one who was responsible for the clicks?

    The numbers displayed are total decodes (not total click-throughs), which JavaScript measures on the page. Decodes can be caused by bots or applications, like browser plug-ins, which expand the underlying URL without causing a click-through.? If you download a browser plug-in that automatically expands short URLs, for example, it looks a lot like a human user to an analytics program. Absent JavaScript on the page, it’s hard to distinguish between a decode and an intentional click-through. Ultimately, bitly complements rather than replaces JavaScript-based analytics utilities such as Google Analytics or Chartbeat.

    If someone else shortens the same URL, do we both see the same number of clicks?

    It depends on whether a user is signed in. bitly tracks the total number of clicks pointing to a single long link. Signed-in bitly users receive a unique bitly link that lets them track clicks and other data separately, while still seeing totals for all bitly links pointing to the same long link. But users who are not signed in all share the same bitly link.

    Is all bitly tracking data publically available? Where can I view it?

    To learn more about the life of any given bitly url, simply add a “+” sign to the end of that link and you will be directed to a page with that link’s statistics.

    The permanent 301 redirects of bitly mean that multiple bitly urls can point towards a single webpage.

    Sounds like having multiple identifiers doesn’t it?

    What’s more, I can create a bitly redirect for a webpage and then by adding “+” to the end, see if there are other redirects for that page.

    March 26, 2012

    The unreasonable necessity of subject experts

    Filed under: Data Mining,Domain Expertise,Subject Experts — Patrick Durusau @ 6:40 pm

    The unreasonable necessity of subject experts – Experts make the leap from correct results to understood results by Mike Loukides.

    From the post:

    One of the highlights of the 2012 Strata California conference was the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” If you weren’t there, Mike Driscoll’s summary is an excellent overview (full video of the debate is available here). To make the story short, the “cons” won; the audience was won over to the side that machine learning is more important. That’s not surprising, given that we’ve all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge. And Pete Warden (@petewarden) made the point that, when faced with the problem of finding “good” pictures on Facebook, he ran a data mining contest at Kaggle.

    A good impromptu debate necessarily raises as many questions as it answers. Here’s the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article,”The End of Theory,” asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you’ve gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they’re often closely coupled. Often, the only way to know you’ve put garbage in is that you’ve gotten garbage out.

    By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. “Stupid Data Miner Tricks” is a hilarious send-up of the problems of data mining: It shows how to “predict” the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.

    An interesting post and debate. Both worth the time to read/watch.

    I am not surprised the “cons” won, saying that machine learning is more important than subject expertise, but not for the reasons Mike gives.

    True enough, data is said to be “unreasonably” effective, but when judged against what?

    When asked, 90% of all drivers think they are better than average drivers. If I remember averages, there is something wrong with that result. 😉

    The trick is, according to Daniel Kahneman, is that drivers create an imaginary average and then say they are better than that.

    I wonder what “average” data is being evaluated against?

    March 25, 2012

    CS 194-16: Introduction to Data Science

    Filed under: Data Mining,Data Science — Patrick Durusau @ 7:16 pm

    CS 194-16: Introduction to Data Science

    From the homepage:

    Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term “Data Science”. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.

    Tip: Look closely at the resources page and the notes from the 2011 course.

    March 23, 2012

    Trouble at the text mine

    Filed under: Data Mining,Search Engines,Searching — Patrick Durusau @ 7:24 pm

    Trouble at the text mine by Richard Van Noorden.

    From the post:

    When he was a keen young biology graduate student in 2006, Max Haeussler wrote a computer program that would scan, or ‘crawl’, plain text and pull out any DNA sequences. To test his invention, the naive text-miner downloaded around 20,000 research papers that his institution had paid to access — and promptly found his IP address blocked by the papers’ publisher.

    It was not until 2009 that Haeussler, then at the University of Manchester, UK, and now at the University of California, Santa Cruz, returned to the project in earnest. He had come to realize that standard site licences do not permit systematic downloads, because publishers fear wholesale theft of their content. So Haeussler began asking for licensing terms to crawl and text-mine articles. His goal was to serve science: his program is a key part of the text2genome project, which aims to use DNA sequences in research papers to link the publications to an online record of the human genome. This could produce an annotated genome map linked to millions of research articles, so that biologists browsing a genomic region could immediately click through to any relevant papers.

    But Haeussler and his text2genome colleague Casey Bergman, a genomicist at the University of Manchester, have spent more than two years trying to agree terms with publishers — and often being ignored or rebuffed. “We’ve learned it’s a long, hard road with every journal,” says Bergman.

    What Haeussler and Bergman don’t seem to “get” is that publishers have no interest in advancing science. Their sole and only goal is profiting from the content they have published. (I am not going to argue right or wrong but am simply trying to call out the positions in question.)

    The question that Haeussler and Bergman should answer for publishers is this one: What is in this “indexing” for the publishers?

    I suspect one acceptable answer would run along the lines of:

    • The full content of articles cannot be reconstructed from the indexes. The largest block of content delivered will be the article abstract, along with bibliographic reference data.
    • Pointers to the articles will point towards either the publisher’s content site and/or other commercial content providers that carry the publisher’s content.
    • The publisher’s designated journal logo (of some specified size) will appear with every reported citation.
    • The indexed content will be provided to the publisher’s at no charge.

    Does this mean that publisher’s will be benefiting from allowing the indexing of their content? Yes. Next question.

    Building a Bigger Haystack

    Filed under: Data Mining,Marketing,Topic Maps — Patrick Durusau @ 7:23 pm

    Counterterrorism center increases data retention time to five years by Mark Rockwell.

    From the post:

    The National Counterterrorism Center, which acts as the government’s clearinghouse for terrorist data, has moved to hold onto certain types of data for up to five years to improve its ability to keep track of it across government databases.

    On March 22, NCTC implemented new guidelines allowing much lengthier data retention period for “terrorism information” in federal datasets including non-terrorism information. NCTC had previously been required to destroy data on citizens within three months if no ties were found to terrorism. Those rules, according to NCTC, limited the effectiveness of the data, since in some instances, the ability to link across data sets over time could help track threats that weren’t immediate, or immediately evident. According to the center, the longer retention time can aid in connecting dots that aren’t immediately evident when the initial data is collected.

    Director of National Intelligence James Clapper, Attorney General Eric Holder, and National Counterterrorism Center (NCTC) Director Matthew Olsen signed the updated guidelines designed on March 22 to allow NCTC to obtain and more effectively analyze certain data in the government’s possession to better address terrorism-related threats.

    I looked for the new guidelines but apparently they are not posted to the NCTC website.

    Here is the justification for the change:

    One of the issues identified by congress and the intelligence community after the 2009 Fort Hood shootings and the Christmas Day 2009 bombing attempt was the government’s limited ability to query multiple federal datasets and to correlate information from many sources that might relate to a potential attack, said the center. A review of those attacks recommended the intelligence community push for the of state-of-the-art search and correlation capabilities, including techniques that would provide a single point of entry to various government databases, it said.

    “Following the failed terrorist attack in December 2009, representatives of the counterterrorism community concluded it is vital for NCTC to be provided with a variety of datasets from various agencies that contain terrorism information,” said Clapper in a March 22 statement. “The ability to search against these datasets for up to five years on a continuing basis as these updated Guidelines permit will enable NCTC to accomplish its mission more practically and effectively than the 2008 Guidelines allowed.”

    OK, so for those two cases, what evidence would having search capabilities over five years worth of data uncover? Even with the clarity of hindsight, there has been no showing of what data could have been uncovered.

    The father of the attacker reported his son’s intentions to the CIA on November 19, 2009. That right, within 45 days of the attack.

    Building a bigger haystack is a singularly ineffectual way to fight terrorism. It will generate more data, more IT systems, with the personnel to man and sustain them, all of which are agency drone, not fighting terrorism goals.

    Cablegate was the result of a “bigger haystack” project. Do you think we need another one?

    Topic maps and other semantic technologies can produce smaller, relevant haystacks.

    I guess that is the question:

    Do you want more staff and a larger budget or to have the potential to combat terrorism? (The latter is only potential given that US intelligence can’t intercept bombers on 36 day notice.)

    March 17, 2012

    Lifebrowser: Data mining gets (really) personal at Microsoft

    Filed under: Data Mining,Microsoft,Privacy — Patrick Durusau @ 8:20 pm

    Lifebrowser: Data mining gets (really) personal at Microsoft

    Nancy Owano writes:

    Microsoft Research is doing research on software that could bring you your own personal data mining center with a touch of Proust for returns. In a recent video, Microsoft scientist Eric Horvitz demonstrated the Lifebrowser, which is prototype software that helps put your digital life in meaningful shape. The software uses machine learning to help a user place life events, which may span months or years, to be expanded or contracted selectively, in better context.

    Navigating the large stores of personal information on a user’s computer, the program goes through the piles of personal data, including photos, emails and calendar dates. A search feature can pull up landmark events on a certain topic. Filtering the data, the software calls up memory landmarks and provides a timeline interface. Lifebrowser’s timeline shows items that the user can associate with “landmark” events with the use of artificial intelligence algorithms.

    A calendar crawler, working with Microsoft Outlook extracts various properties from calendar events, such as location, organizer, and relationships between participants. The system then applies Bayesian machine learning and reasoning to derive atypical features from events that make them memorable. Images help human memory, and an image crawler analyzes a photo library. By associating an email with a relevant calendar date with a relevant document and photos, significance is gleaned from personal life events. With a timeline in place, a user can zoom in on details of the timeline around landmarks with a “volume control” or search across the full body of information.

    Sounds like the start towards a “personal” topic map authoring application.

    One important detail: With MS Lifebrowser the user is gathering information on themselves.

    Not the same as having Google or FaceBook gathering information on you. Is it?

    NASA Releases Atlas Of Entire Sky

    Filed under: Astroinformatics,Data Mining,Dataset — Patrick Durusau @ 8:19 pm

    NASA Releases Atlas Of Entire Sky

    J. Nicholas Hoover (InformationWeek) writes:

    NASA this week released to the Web an atlas and catalog of 18,000 images consisting of more than 563 million stars, galaxies, asteroids, planets, and other objects in the sky–many of which have never been seen or identified before–along with data on all of those objects.

    The space agency’s Wide-field Infrared Survey Explorer (WISE) mission, which was a collaboration of NASA’s Jet Propulsion Laboratory and the University of California Los Angeles, collected the data over the past two years, capturing more than 2.7 million images and processing more than 15 TB of astronomical data along the way. In order to make the data easier to use, NASA condensed the 2.7 million digital images down to 18,000 that cover the entire sky.

    The WISE mission, which mapped the entire sky, uncovered a number of never-before-seen objects in the night sky, including an entirely new class of stars and the first “Trojan” asteroid that shares the Earth’s orbital path. The study also determined that there were far fewer mid-sized asteroids near Earth than had been previously thought. Even before the mass release of data to the Web, there have already been at least 100 papers published detailing the more limited results that NASA had already released.

    Hoover also says that NASA has developed tutorials to assist developers in working with the data and that the entire database will be available in the not too distant future.

    When I see releases like this one, I am reminded of Jim Gray (MS). Jim was reported to like astronomy data sets because they are big and free. See what you think about this one.

    « Newer PostsOlder Posts »

    Powered by WordPress