Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 6, 2011

KDD and MUCMD 2011

Filed under: Bioinformatics,Biomedical,Data Mining,Knowledge Discovery — Patrick Durusau @ 5:33 pm

KDD and MUCMD 2011

An interesting review of KDD and MUCMD (Meaningful Use of Complex Medical Data) 2011:

At KDD I enjoyed Stephen Boyd’s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler’s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.

Cites a number their favorite papers. Which ones are yours?

October 5, 2011

Datawrangler

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 6:50 pm

Datawrangler

From the post:

Formatting data is a necessary pain, so anything that makes formatting easier is always welcome. Data Wrangler, from the Stanford Visualization Group, is the latest in the growing set of tools to get your data the way you need it (so that you can get to the fun part already). It’s similar to Google Refine in that they’re both browser-based, but my first impression is that Data Wrangler is more lightweight and it feels more responsive.

Data Wrangler also seems to do more guesswork, so you can set less specific parameters. Just roll over stuff, and it’ll show a preview of possible changes or formatting. Keep the change or easily undo it.

The video below describes what all the tool can do, but it’s better to just try it out. Copy and paste your own mangled data or give Data Wrangler a whirl with the sample provided.

From our friends at FlowingData. Perhaps we should ask: Does data exist if it isn’t visualized?

October 2, 2011

DMO (Data Mining Ontology) Foundry

Filed under: Data Mining,Ontology — Patrick Durusau @ 6:37 pm

Email from Agnieszka Lawrynowicz advises:

We are happy to announce the opening of the DMO (Data Mining Ontology) Foundry (http://www.dmo-foundry.org/), an initiative designed to promote the development of ontologies representing the data mining domain. The DMO Foundry will gather the most significant ontologies concerning data mining and the different algorithms and resources that have been developed to support the knowledge discovery process.

Each ontology in the DMO Foundry is freely available for browsing and open discussion, as well as collaborative development, by data mining specialists all over the world. We cordially welcome all interested researchers and practitioners to join the initiative. To find out how you can participate in ontology development, click on the “How to join” tab at the top of the DMO-Foundry page.

To access and navigate an ontology, and contribute to it, click on the “Ontologies” tab, then on your selected ontology and its OWL Browser tool. As you browse, you can click on the “Comment” button to share your insights, criticisms, and suggestions on the concept or relation you are currently exploring. For more general comments, go the the “Forum” tab and post a message to initiate a discussion thread. Please note that until the end of March 2012, this site is being road-tested on the Data Mining OPtimization (DMOP) Ontology developed in the EU FP7 ICT project e-LICO (2009-2012). We are in contact with authors of other DM ontologies, but if you are developing a relevant ontology that you think we are not aware of, please set up a post in the Forum. You are also invited to contact us by writing to an email address info@dmo-foundry.org.

Sad to say but they have omitted topic maps from their ontology. I am writing up a post for the authors. At a minimum, the terms with PSIs at http://psi.topicmaps.org. Others?

This sounds like a link I need to forward to the astronomy folks I mentioned in > 100 New KDD Models/Methods Appear Every Month. Could at least use the class listing as a starter set for mining journal literature.

Machine Learning with Hadoop

Filed under: Data Mining,Hadoop,Machine Learning — Patrick Durusau @ 6:34 pm

Machine Learning with Hadoop by Josh Patterson.

Very current (Sept. 2011) review of Hadoop, data mining and related issues. Plus pointers to software projects such as Lumberyard, which deals with terabyte-sized time series data.

September 30, 2011

Essential Elements of Data Mining

Filed under: Data Mining — Patrick Durusau @ 7:06 pm

Essential Elements of Data Mining by Keith McCormick

From the post:

This is my attempt to clarify what Data Mining is and what it isn’t. According to Wikipedia, “In philosophy, essentialism is the view that, for any specific kind of entity, there is a set of characteristics or properties all of which any entity of that kind must possess.” I do not seek the Platonic form of Data Mining, but I do seek clarity where it is often lacking. There is much confusion surrounding how Data Mining is distinct from related areas like Statistics and Business Intelligence. My primary goal is to clarify the characteristics that a project must have to be a Data Mining project. By implication, Statistical Analysis (hypothesis testing), Business Intelligence reporting, Exploratory Data Analysis, etc., do not have all of these defining properties. They are highly valuable, but have their own unique characteristics. I have come up with ten. It is quite appropriate to emphasize the first and the last. They are the bookends of the list, and they capture the heart of the matter.

Comments? Characteristics you would add or take away?

How important is it to have a definition? Recall that creeds are created to separate sheep from goats, wheat from chaff. Are “essential characteristics” any different from a creed? If so, how?

September 26, 2011

> 100 New KDD Models/Methods Appear Every Month

Filed under: Astroinformatics,Data Mining,KDD,Knowledge Discovery — Patrick Durusau @ 7:00 pm

Got your attention? It certainly got mine when I read:

Make an inventory of existing methods relevant for astrophysical applications (more than 100 new KDD models and methods appear every month on specialized journals).

A line from the charter of the KDD-IG (Knowledge Discovery and Data Mining-Interest Group) of IVOA (International Virtual Observatory Alliance).

See: IVOA Knowledge Discovery in Databases

I checked the A census of Data Mining and Machine Learning methods for astronomy wiki page but it had no takers, much less any content.

I have written to Professor Giuseppe Longo of University Federico II in Napoli, the chair of this activity to inquire about opportunities to participate in the KDD census. I will post an updated entry when I have more information.

Separate and apart from the census, over 1,200 new KDD models/methods a year, that is an impressive number. I don’t think a census will make that slow down. If anything, greater knowledge of other efforts may spur the creation of even more new models/methods.

DAta Mining & Exploration (DAME)

Filed under: Astroinformatics,Data Mining,Machine Learning — Patrick Durusau @ 7:00 pm

DAta Mining & Exploration (DAME)

From the website:

What is DAME

Nowadays, many scientific areas share the same need of being able to deal with massive and distributed datasets and to perform on them complex knowledge extraction tasks. This simple consideration is behind the international efforts to build virtual organizations such as, for instance, the Virtual Observatory (VObs). DAME (DAta Mining & Exploration) is an innovative, general purpose, Web-based, distributed data mining infrastructure specialized in Massive Data Sets exploration with machine learning methods.

Initially fine tuned to deal with astronomical data only, DAME has evolved in a general purpose platform program, hosting a cloud of applications and services useful also in other domains of human endeavor.

DAME is an evolving platform and new services as well as additional features are continuously added. The modular architecture of DAME can also be exploited to build applications, finely tuned to specific needs.

Follow DAME on YouTube

The project represents what is commonly considered an important element of e-science: a stronger multi-disciplinary approach based on the mutual interaction and interoperability between different scientific and technological fields (nowadays defined as X-Informatics, such as Astro-Informatics). Such an approach may have significant implications in the Knowledge Discovery in Databases process, where even near-term developments in the computing infrastructure which links data, knowledge and scientists will lead to a transformation of the scientific communication paradigm and will improve the discovery scenario in all sciences.

So far there is only one video at YouTube and it could lose the background music with no ill-effect.

The lessons learned (or applied) here should be applicable to other situations with very large data sets, say from satellites revolving the Earth?

September 23, 2011

Oresoft Live Web Class

Filed under: CS Lectures,Data Mining — Patrick Durusau @ 6:11 pm

Oresoft Live Web Class YouTube Channel

I ran across this YouTube channel on a data mining alert I get from a search service. The data mining course looks like one of the more complete ones.

It stems from the Oresoft Academy, which conducts live virtual classes. If you have an interest in teaching, see the FAQ to see what is required to contribute to this effort.

The Oresoft playlist offers (as of 22 September 2011):

  • Algorithms (101 sessions)
  • Compiler Design (42 sessions)
  • Computer Graphics (7 sessions)
  • Finite Automata (5 sessions)
  • Graph Theory (9 sessions)
  • Heap Sort (13 sessions)
  • Java Tutorials (16 sessions)
  • Non-Determistic Finite Automata (14 sessions)
  • Oracle PL/SQL (27 sessions)
  • Oracle Server Concept (48 sessions, slides number to 49 due to numbering error)
  • Oracle SQL (17 sessions)
  • Pumping Lemma (6 sessions)
  • Regular Expression (14 sessions)
  • Turing Machines (10 sessions)
  • Web Data Mining (127 sessions)

September 19, 2011

DiscoverText

Filed under: Data Mining,Text Analytics — Patrick Durusau @ 7:51 pm

DiscoverText

From the webpage:

DiscoverText helps you gain valuable insight about customers, products, employees, citizens, research data, and more through powerful text analytic methods. DiscoverText combines search, human judgments and inferences with automated software algorithms to create an active machine-learning loop.

DiscoverText is currently used for text analytics, market research, eDiscovery, FOIA processing, employee engagement analytics, health informatics, processing public comments by government agencies and university basic research.

Before I sign up for the free trial version, do you have any experience with this product? Suggested data sets that make it shine or not shine so much?

September 17, 2011

Open Data Tools

Filed under: Data Mining,Visualization — Patrick Durusau @ 8:14 pm

Open Data Tools

Not much in the way of tools, yet, but is a site worth watching.

I remain uneasy about the emphasis on tools for “open data.” Anyone can use tools to manipulate “open data,” but if you don’t know the semantics of the data, the results are problematic.

We Feel Fine

Filed under: Data Mining,Semantics — Patrick Durusau @ 8:14 pm

We Feel Fine – An exploration of human emotions, in six movements.

From the “mission” page:

Since August 2005, We Feel Fine has been harvesting human feelings from a large number of weblogs. Every few minutes, the system searches the world’s newly posted blog entries for occurrences of the phrases “I feel” and “I am feeling”. When it finds such a phrase, it records the full sentence, up to the period, and identifies the “feeling” expressed in that sentence (e.g. sad, happy, depressed, etc.). Because blogs are structured in largely standard ways, the age, gender, and geographical location of the author can often be extracted and saved along with the sentence, as can the local weather conditions at the time the sentence was written. All of this information is saved.

The result is a database of several million human feelings, increasing by 15,000 – 20,000 new feelings per day. Using a series of playful interfaces, the feelings can be searched and sorted across a number of demographic slices, offering responses to specific questions like: do Europeans feel sad more often than Americans? Do women feel fat more often than men? Does rainy weather affect how we feel? What are the most representative feelings of female New Yorkers in their 20s? What do people feel right now in Baghdad? What were people feeling on Valentine’s Day? Which are the happiest cities in the world? The saddest? And so on.

The interface to this data is a self-organizing particle system, where each particle represents a single feeling posted by a single individual. The particles’ properties – color, size, shape, opacity – indicate the nature of the feeling inside, and any particle can be clicked to reveal the full sentence or photograph it contains. The particles careen wildly around the screen until asked to self-organize along any number of axes, expressing various pictures of human emotion. We Feel Fine paints these pictures in six formal movements titled: Madness, Murmurs, Montage, Mobs, Metrics, and Mounds.

At its core, We Feel Fine is an artwork authored by everyone. It will grow and change as we grow and change, reflecting what’s on our blogs, what’s in our hearts, what’s in our minds. We hope it makes the world seem a little smaller, and we hope it helps people see beauty in the everyday ups and downs of life.

I mention this as an interesting data set and possible approach to discovering the semantic range in the use of particular terms.

Clearly we use a common enough vocabulary for Google and similar applications to be useful to most people a large part of the time. But they fail with alarming regularly and without warning as well. And therein lies the rub. How do I know that the information in the first ten (10) hits is the most important information about my query? Or even relevant, without hand examining each hit? To say nothing of the “hits” at 100+ and beyond.

The “problem” terms are going to vary by domain but I am curious if identification of domains, along with use of domain based vocabularies, might improve searches, at least of professional literature. Thinking there are norms of usage in professional literature that may make it a “special” case. Perhaps most of the searches of interest to enterprise searchers are “special” cases in some sense of the word.

September 15, 2011

How to kill a patent with Python

Filed under: Data Mining,Graphs,Python,Visualization — Patrick Durusau @ 7:56 pm

How to kill a patent with Python: or using NLP and graph theory for great good! by Van Linberg.

From the description:

Finding the right piece of “prior art” – technical documentation that described a patented piece of technology before the patent was filed – is like finding a needle in a very big haystack. This session will talk about how I am making that process faster and more accurate through the use of natural language processing, graph theory, machine learning, and lots of Python.

Very fascinating presentation with practical suggestions on mining patents.

Key topic map statement: “People are inconsistent in the use of language.”

From the OSCON description of the same presentation:

When faced with a patent case, it is essential to find “prior art” – patents and publications that describe a technology before a certain date. The problem is that the indexing mechanisms for patents and publications are not as good as they could be, making good prior art searching more of an art than a science. We can apply some of our natural language processing and “big data” techniques to the US patent database, getting us better results more quickly.

  • Part I: The USPTO as a data source. The full-text of each patent is available from the USPTO (and now from Google.) What does this data look like? How can it be harvested and normalized to create data structures that we can work with?
  • Part II: Once the patents have been cleaned and normalized, they can be turned into data structures that we can use to evaluate their relationship to other documents. This is done in two ways – by modeling each patent as a document vector and a graph node.
  • Part IIA: Patents as document vectors. Once we have a patent as a data structure, we can treat the patent as a vector in an n-dimensional space. In moving from a document into a vector space, we will touch on normalization, stemming, TF/IDF, Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA).
  • Part IIB: Patents as technology graphs. This will show building graph structures using the connections between patents – both the built-in connections in the patents themselves as well as the connections discovered while working with the patents as vectors. We apply some social network analysis to partition the patent graph and find other documents in the same technology space.
  • Part III: What have we built? Now that we have done all this analysis, we can see some interesting things about the patent database as a whole. How does the patent database act as a map to the world of technology? And how has this helped with the original problem – finding better prior art?

My suggestion was to use topic maps to capture the human analysis of the clusters at the end of the day for merging with other human analysis of other clusters.

Waiting to learn more about this project!

September 13, 2011

Discovering, Summarizing and Using Multiple Clusterings

Filed under: Clustering,Data Analysis,Data Mining — Patrick Durusau @ 7:16 pm

Proceedings of the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings
Athens, Greece, September 5, 2011.

This collection of papers reflects what I think is rapidly becoming the consensus view: There is no one/right way to look at data.

That is important because by the application of multiple techniques, in these papers clustering techniques, you may make unanticipated discoveries about your data. Recording the trail you followed, as all explorers should, will help others duplicate your steps, to test them or to go further. In topic map terms, I would you would be discovering and identifying subjects.

Edited by

Emmanuel Müller *
Stephan Günnemann **
Ira Assent ***
Thomas Seidl **

* Karlsruhe Institute of Technology, Germany
** RWTH Aachen University, Germany
*** Aarhus University, Denmark


Complete workshop proceedings as one file (~16 MB).

Table of Contents

    Invited Talks

  1. Combinatorial Approaches to Clustering and Feature Selection
    Michael E. Houle
  2. Cartification: Turning Similarities into Itemset Frequencies
    Bart Goethals
  3. Research Papers

  4. When Pattern Met Subspace Cluster
    Jilles Vreeken, Arthur Zimek
  5. Fast Multidimensional Clustering of Categorical Data
    Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, Hua Liu
  6. Factorial Clustering with an Application to Plant Distribution Data
    Manfred Jaeger, Simon Lyager, Michael Vandborg, Thomas Wohlgemuth
  7. Subjectively Interesting Alternative Clusters
    Tijl De Bie
  8. Evaluation of Multiple Clustering Solutions
    Hans-Peter Kriegel, Erich Schubert, Arthur Zimek
  9. Browsing Robust Clustering-Alternatives
    Martin Hahmann, Dirk Habich, Wolfgang Lehner
  10. Generating a Diverse Set of High-Quality Clusterings
    Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

September 10, 2011

GTD – Global Terrorism Database

Filed under: Authoring Topic Maps,Data,Data Integration,Data Mining,Dataset — Patrick Durusau @ 6:08 pm

GTD – Global Terrorism Database

From the homepage:

The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2010 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 98,000 cases.

While chasing down a paper that didn’t make the cut I ran across this data source.

Lacking an agreed upon definition of terrorism (see Chomsky for example), you may or may not find what you consider to be incidents of terrorism in this dataset.

Never the less, it is a dataset of events of popular interest and can be used to attract funding for your data integration project using topic maps.

September 7, 2011

A Performance Study of Data Mining Techniques: Multiple Linear Regression vs. Factor Analysis

Filed under: Data Mining,Factor Analysis,Linear Regression — Patrick Durusau @ 7:01 pm

A Performance Study of Data Mining Techniques: Multiple Linear Regression vs. Factor Analysis by Abhishek Taneja and R.K.Chauhan.

Abstract:

The growing volume of data usually creates an interesting challenge for the need of data analysis tools that discover regularities in these data. Data mining has emerged as disciplines that contribute tools for data analysis, discovery of hidden knowledge, and autonomous decision making in many application domains. The purpose of this study is to compare the performance of two data mining techniques viz., factor analysis and multiple linear regression for different sample sizes on three unique sets of data. The performance of the two data mining techniques is compared on following parameters like mean square error (MSE), R-square, R-Square adjusted, condition number, root mean square error(RMSE), number of variables included in the prediction model, modified coefficient of efficiency, F-value, and test of normality. These parameters have been computed using various data mining tools like SPSS, XLstat, Stata, and MS-Excel. It is seen that for all the given dataset, factor analysis outperform multiple linear regression. But the absolute value of prediction accuracy varied between the three datasets indicating that the data distribution and data characteristics play a major role in choosing the correct prediction technique.

I had to do a double-take when I saw “factor analysis” in the title of this article. I remember factor analysis from Schubert’s The judicial mind revisited : psychometric analysis of Supreme Court ideology, where Schubert used factor analysis to model the relative positions of the Supreme Court Justices. Schubert taught himself factor analysis on a Frieden rotary calculator. (I had one of those too but that’s a different story.)

The real lesson of this article comes at the end of the abstract: the data distribution and data characteristics play a major role in choosing the correct prediction technique.

September 6, 2011

First Look – Oracle Data Mining Update

Filed under: Data Mining,Database,Information Retrieval,SQL — Patrick Durusau @ 7:18 pm

First Look – Oracle Data Mining Update by James Taylor.

From the post:

I got an update from Oracle on Oracle Data Mining (ODM) recently. ODM is an in-database data mining and predictive analytics engine that allows you to build and use advanced predictive analytic models on data that can be accessed through your Oracle data infrastructure. I blogged about ODM extensively last year in this First Look – Oracle Data Mining and since then they have released ODM 11.2.

The fundamental architecture has not changed, of course. ODM remains a “database-out” solution surfaced through SQL and PL-SQL APIs and executing in the database. It has the 12 algorithms and 50+ statistical functions I discussed before and model building and scoring are both done in-database. Oracle Text functions are integrated to allow text mining algorithms to take advantage of them. Additionally, because ODM mines star schema data it can handle an unlimited number of input attributes, transactional data and unstructured data such as CLOBs, tables or views.

This release takes the preview GUI I discussed last time and officially releases it. This new GUI is an extension to SQL Developer 3.0 (which is available for free and downloaded by millions of SQL/database people). The “Classic” interface (wizard-based access to the APIs) is still available but the new interface is much more in line with the state of the art as far as analytic tools go.

BTW, the correct link to: First Look – Oracle Data Mining. (Taylor’s post last year on Oracle Data Mining.)

For all the buzz about NoSQL, topic map mavens should be aware of the near universal footprint of SQL and prepare accordingly.

JT on EDM

Filed under: Data Mining,Decision Making — Patrick Durusau @ 7:17 pm

JT on EDM – James Taylor on Everything Decision Management

From the about page:

James Taylor is a leading expert in Decision Management and an independent consultant specializing in helping companies automate and improve critical decisions. Previously James was a Vice President at Fair Isaac Corporation where he developed and refined the concept of enterprise decision management or EDM. Widely credited with the invention of the term and the best known proponent of the approach, James helped create the Decision Management market and is its most passionate advocate.

James has 20 years experience in all aspects of the design, development, marketing and use of advanced technology including CASE tools, project planning and methodology tools as well as platform development in PeopleSoft’s R&D team and consulting with Ernst and Young. He has consistently worked to develop approaches, tools and platforms that others can use to build more effective information systems.

Another mainstream IT/data site that you would do well to read.

Improving Entity Resolution with Global Constraints

Filed under: Data Integration,Data Mining,Entity Resolution — Patrick Durusau @ 7:00 pm

Improving Entity Resolution with Global Constraints by Jim Gemmell, Benjamin I. P. Rubinstein, and Ashok K. Chandra.

Abstract:

Some of the greatest advances in web search have come from leveraging socio-economic properties of online user behavior. Past advances include PageRank, anchor text, hubs-authorities, and TF-IDF. In this paper, we investigate another socio-economic property that, to our knowledge, has not yet been exploited: sites that create lists of entities, such as IMDB and Netflix, have an incentive to avoid gratuitous duplicates. We leverage this property to resolve entities across the different web sites, and find that we can obtain substantial improvements in resolution accuracy. This improvement in accuracy also translates into robustness, which often reduces the amount of training data that must be labeled for comparing entities across many sites. Furthermore, the technique provides robustness when resolving sites that have some duplicates, even without first removing these duplicates. We present algorithms with very strong precision and recall, and show that max weight matching, while appearing to be a natural choice turns out to have poor performance in some situations. The presented techniques are now being used in the back-end entity resolution system at a major Internet search engine.

Relies on entity resolution that has been performed in another context. I rather like that, as opposed to starting at ground zero.

I was amused that “adult titles” were excluded from the data set. I don’t have the numbers right off hand but “adult titles” account for a large percentage of movie income. Not unlike using stock market data but excluding all finance industry stocks. Seems incomplete.

September 3, 2011

Decision Support for e-Governance: A Text Mining Approach

Filed under: Data Mining,eGov,Text Extraction — Patrick Durusau @ 6:47 pm

Decision Support for e-Governance: A Text Mining Approach by G.Koteswara Rao, and Shubhamoy Dey.

Abstract:

Information and communication technology has the capability to improve the process by which governments involve citizens in formulating public policy and public projects. Even though much of government regulations may now be in digital form (and often available online), due to their complexity and diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with the advent of a number of electronic online forums, social networking sites and blogs, the opportunity of gathering citizens’ petitions and stakeholders’ views on government policy and proposals has increased greatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the other hand, text mining has come a long way from simple keyword search, and matured into a discipline capable of dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help in retrieval of information and relationships from textual data sources, thereby assisting policy makers in discovering associations between policies and citizens’ opinions expressed in electronic public forums and blogs etc. We also present here, an integrated text mining based architecture for e-governance decision support along with a discussion on the Indian scenario.

The principles of subject identity could usefully inform many aspects of this “project.” I hesitate to use the word “project” for an effort that will eventually involve twenty-two (22) official languages, several scripts and governance of several hundred million people.

A good starting point for learning about the issues facing e-Governance in India.

DiscoverText

Filed under: Data Analysis,Data Mining,DiscoverText — Patrick Durusau @ 6:46 pm

DiscoverText

From the website:

DiscoverText helps you gain valuable insight about customers, products, employees, citizens, research data, and more through powerful text analytic methods. DiscoverText combines search, human judgments and inferences with automated software algorithms to create an active machine-learning loop.

DiscoverText is currently used for text analytics, market research, eDiscovery, FOIA processing, employee engagement analytics, health informatics, processing public comments by government agencies and university basic research.

Interesting tool set, based in the cloud.

PCAT

Filed under: Data Analysis,Data Mining,PCAT — Patrick Durusau @ 6:45 pm

PCAT – Public Comment Analysis Toolkit

A cloud based analysis service.

PCAT can import:

Federal Docket Management System Archives
Email, Blog and Wiki Content
Plain text, HTML, or XML Documents
Microsoft Word and Adobe PDFs
Excel or CSV Spreadsheets
Archived RSS Feeds
CAT-style Datasets

PCAT capabilities:

Search for key concepts & code text
Remove duplicates & cluster similar comments
Form peer & project networks
Establish credentials & permissions
Assign multiple coders to tasks
Annotate coding with shared memos
Easily measure inter-coder reliability
Adjudicate valid & invalid coder decisions
Generate reports in RTF, CSV, PDF or XML format
Archive or share completed projects online

If you have used PCAT, please comment.

September 2, 2011

Welcome to The Matrix Factorization Jungle

Filed under: Data Mining,Matrix — Patrick Durusau @ 7:55 pm

Welcome to The Matrix Factorization Jungle [ A living documention on the state of the art algorithms dedicated to matrix factorization ]

From the webpage:

Matrix Decompositions has a long history and generally centers around a set of known factorizations such as LU, QR, SVD and eigendecompositions. With the advent of new methods based on random projections and convex optimization that started in part in the compressive sensing literature, we are seeing a surge of very diverse algorithms dedicated to many different kinds of matrix factorizations with constraints based on rank, positivity, sparsity,… As a result of this large increase in interest, I have decided to keep a list of them here following the success of the big picture in compressive sensing.

If you are unfamiliar with the use of matrices in data mining, consider Non-negative matrix factorization and the examples cited under Text mining.

Mining Associations and Patterns from Semantic Data

Filed under: Conferences,Data Mining,Pattern Matching,Pattern Recognition,Semantic Web — Patrick Durusau @ 7:52 pm

The editors of a special issue of the International Journal on Semantic Web and Information Systems on Mining Associations and Patterns from Semantic Data have issued the following call for papers:

Guest editors: Kemafor Anyanwu, Ying Ding, Jie Tang, and Philip Yu

Large amounts of Semantic Data is being generated through semantic extractions from and annotation of traditional Web, social and sensor data. Linked Open Data has provided excellent vehicle for representation and sharing of such data. Primary vehicle to get semantics useful for better integration, search and decision making is to find interesting relationships or associations, expressed as meaningful paths, subgraphs and patterns. This special issue seeks theories, algorithms and applications of extracting such semantic relationships from large amount of semantic data. Example topics include:

  • Theories to ground associations and patterns with social, socioeconomic, biological semantics
  • Representation (e.g. language extensions) to express meaningful relationships and patterns
  • Algorithms to efficiently compute and mine semantic associations and patterns
  • Techniques for filtering, ranking and/or visualization of semantic associations and patterns
  • Application of semantic associations and patterns in a domain with significant social or society impact

IJSWIS is included in most major indices including CSI, with Thomson Scientific impact factor 2.345. We seek high quality manuscripts suitable for an archival journal based on original research. If the manuscript is based on a prior workshop or conference submission, submissions should reflect significant novel contribution/extension in conceptual terms and/or scale of implementation and evaluation (authors are highly encouraged to clarify new contributions in a cover letter or within the submission).

Important Dates:
Submission of full papers: Feb 29, 2012
Notification of paper acceptance: May 30, 2012
Publication target: 3Q 2012

Details of the journal, manuscript preparation, and recent articles are available on the website:
http://www.igi-global.com/bookstore/titledetails.aspx?titleid=1092 or http://ijswis.org

Guest Editors: Prof. Kemafor Anyanwu, North Carolina State University
Prof. Ying Ding, Indiana University
Prof. Jie Tang, Tsinghua University
Prof. Philip Yu, University of Illinois, Chicago
Contact Guest Editor: Ying Ding <dingying@indiana.edu>

August 26, 2011

My Data Mining Weblog

Filed under: Data Mining — Patrick Durusau @ 6:30 pm

My Data Mining Weblog by Ridzuan Daud.

I stumbled across this blog when I found the Python Data Mining Resources post.

I poked around after reading that post and thought the blog itself needed separate mention. Appears to be a good source of current information as well as listings of books, software, tutorials, etc. Definitely a place to spend some time if you are interested in data mining.

Python Data Mining Resources

Filed under: Data Mining,Python — Patrick Durusau @ 6:29 pm

Python Data Mining Resources

From the post:

Python for data mining has been gaining some interest from data miner community due to its open source, general purpose programming and web scripting language. Below are some resources to kick start doing data mining using Python:

A resource for Lars Marius to point others to when they have questions about his data mining techniques. Errant Perl programmers for instance. 😉

August 25, 2011

Learn to Use DiscoverText – Free Tutorial Webinar

Filed under: Data Mining,DiscoverText,Text Extraction — Patrick Durusau @ 7:00 pm

Learn to Use DiscoverText – Free Tutorial Webinar

From the announcement:

This free, live Webinar introduces DiscoverText and key features used to ingest, filter, search & code text. We take your questions and demonstrate the newest tools, including a Do-It-Yourself (DIY) machine-learning classifier. You can create a classification scheme, train the system, and run the classifier in less than 20 minutes.

DiscoverText’s latest feature additions can be easily trained to perform customized mood, sentiment and topic classification. Any custom classification scheme or topic model can be created and implemented by the user. Once a classification scheme is created, you can then use advanced, threshold-sensitive filters to look at just the documents you want.

You can also generate interactive, custom, salient word clouds using the “Cloud Explorer” and drill into the most frequently occurring terms or use advanced search and filters to create “buckets” of text.

The system makes it possible to capture, share and crowd source text data analysis in novel ways. For example, you can collect text content off Facebook, Twitter & YouTube, as well as other social media or RSS feeds.

Apologies but if you notice the date this announcement was posted, the day before the webinar, I posted this late.

Puzzles me why there is a tendency to announce webinars the day or two in advance. Why not a week?

They have recorded prior versions of this presentation so you can still learn something about DiscoverText.

August 19, 2011

MONK

Filed under: Data Mining,Digital Library,Semantics,Text Analytics — Patrick Durusau @ 8:32 pm

MONK

From the Introduction:

The MONK Project provides access to the digitized texts described above along with tools to enable literary research through the discovery, exploration, and visualization of patterns. Users typically start a project with one of the toolsets that has been predefined by the MONK team. Each toolset is made up of individual tools (e.g. a search tool, a browsing tool, a rating tool, and a visualization), and these tools are applied to worksets of texts selected by the user from the MONK datastore. Worksets and results can be saved for later use or modification, and results can be exported in some standard formats (e.g., CSV files).

The public data set:

This instance of the MONK Project includes approximately 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare provided by the scholars and libraries at Northwestern University, Indiana University, the University of North Carolina at Chapel Hill, and the University of Virginia. These texts are available to all users, regardless of institutional affiliation.

Digging a bit further:

Each of these texts is normalized (using Abbot, a complex XSL stylesheet) to a TEI schema designed for analytic purposes (TEI-A), and each text has been “adorned” (using Morphadorner) with tokenization, sentence boundaries, standard spellings, parts of speech and lemmata, before being ingested (using Prior) into a database that provides Java access methods for extracting data for many purposes, including searching for objects; direct presentation in end-user applications as tables, lists, concordances, or visualizations; getting feature counts and frequencies for analysis by data-mining and other analytic procedures; and getting tokenized streams of text for working with n-gram and other colocation analyses, repetition analyses, and corpus query-language pattern-matching operations. Finally, MONK’s quantitative analytics (naive Bayesian analysis, support vector machines, Dunnings log likelihood, and raw frequency comparisons), are run through the SEASR environment.

Here’s my topic maps question: So, how do I reliably combine the results from a subfield that uses a different vocabulary than my own? For that matter, how do I discover it in the first place?

I think the MONK project is quite remarkable but lament the impending repetition of research across such a vast archive simply because it is unknown or expressed a “foreign” tongue.

August 17, 2011

Biodata Mining

Filed under: Bioinformatics,Biomedical,Data Mining — Patrick Durusau @ 6:47 pm

Biodata Mining

From the webpage:

BioData Mining is an open access, peer reviewed, online journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data.

What you would have seen since 1 July 2011:

An R Package Implementation of Multifactor Dimensionality Reduction

Hill-Climbing Search and Diversification within an Evolutionary Approach to Protein Structure Prediction

Detection of putative new mutacins by bioinformatic analysis using available web tools

Evolving hard problems: Generating human genetics datasets with a complex etiology

Taxon ordering in phylogenetic trees by means of evolutionary algorithms

Enjoy!

August 16, 2011

Reflective Random Indexing and indirect inference…

Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections by Trevor Cohen, Roger Schvaneveldt, Dominic Widdows.

Abstract:

The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.

The term “direct inference” is used for establishing a relationship between terms with a shared “bridging” term. That is the terms don’t co-occur in a text but share a third term that occurs in both texts. “Indirect inference,” that is finding terms with no shared “bridging” term is the focus of this paper.

BTW, if you don’t have access to the Journal of Biomedical Informatics version, try the draft: Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections

MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge

Filed under: Associations,Data Mining — Patrick Durusau @ 7:03 pm

MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge

Abstract:

Background

Since Swanson proposed the Undiscovered Public Knowledge (UPK) model, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections and are mainly applied to disease-effect relation. With the advancement in biomedical science, it has become imperative to extract and combine information from multiple disjoint researches, studies and articles to infer new hypotheses and expand knowledge.

Methods

We propose MKEM, a Multi-level Knowledge Emergence Model, to discover implicit relationships using Natural Language Processing techniques such as Link Grammar and Ontologies such as Unified Medical Language System (UMLS) MetaMap. The contribution of MKEM is as follows: First, we propose a flexible knowledge emergence model to extract implicit relationships across different levels such as molecular level for gene and protein and Phenomic level for disease and treatment. Second, we employ MetaMap for tagging biological concepts. Third, we provide an empirical and systematic approach to discover novel relationships.

Results

We applied our system on 5000 abstracts downloaded from PubMed database. We performed the performance evaluation as a gold standard is not yet available. Our system performed with a good precision and recall and we generated 24 hypotheses.

Conclusions

Our experiments show that MKEM is a powerful tool to discover hidden relationships residing in extracted entities that were represented by our Substance-Effect-Process-Disease-Body Part (SEPDB) model.

From the article:

Swanson defined UPK is public and yet undiscovered in two complementary and non-interactive literature sets of articles (independently created fragments of knowledge), when they are considered together, can reveal useful information of scientific interest not apparent in either of the two sets alone [cites omitted].

Basis of UPK:

The underlying discovery method is based on the following principle: some links between two complementary passages of natural language texts can be largely a matter of form “A cause B” (association AB) and “B causes C” (association BC) (See Figure 1). From this, it can be seen that they are linked by B irrespective of the meaning of A, B, or C. However, perhaps nothing at all has been published concerning a possible connection between A and C, even though such link if validated would be of scientific interest. This allowed for the generation of several hypotheses such as “Fish’s oil can be used for treatment of Raynaud’s Disease” [cite omitted].

Fairly easy reading and interesting as well.

If you recognize TF*IDF, the primary basis for Lucene, you will be interested to learn it has some weaknesses for UPK. If I understand the authors correctly, ranking terms statistically is insufficient to mine implied relationships. Related terms aren’t ranked high enough. I don’t think “boosting” would help because the terms are not known ahead of time. I say that, although I suppose you could “boost” on the basis of implied relationships. Will have to think about that some more.

You will find “non-interactive literature sets of articles” in computer science, library science, mathematics, law, just about any field you can name. Although you can mine across those “literature sets,” it would be interesting to identify those sets, perhaps with a view towards refining UPK mining. Can you suggest ways to distinguish such “literature sets?”

Oh, link to the software: MKEM (Note to authors: Always include a link to your software, assuming it is available. Make it easy on readers to find and use your hard work!)

« Newer PostsOlder Posts »

Powered by WordPress