Archive for the ‘Knowledge Discovery’ Category

Memantic: A Medical Knowledge Discovery Engine

Saturday, March 21st, 2015

Memantic: A Medical Knowledge Discovery Engine by Alexei Yavlinsky.


We present a system that constructs and maintains an up-to-date co-occurrence network of medical concepts based on continuously mining the latest biomedical literature. Users can explore this network visually via a concise online interface to quickly discover important and novel relationships between medical entities. This enables users to rapidly gain contextual understanding of their medical topics of interest, and we believe this constitutes a significant user experience improvement over contemporary search engines operating in the biomedical literature domain.

Alexei takes advantage of prior work on medical literature to index and display searches of medical literature in an “economical” way that can enable researchers to discover new relationships in the literature without being overwhelmed by bibliographic detail.

You will need to check my summary against the article but here is how I would describe Memantic:

Memantic indexes medical literature and records the co-occurrences of terms in every text. Those terms are mapped into a standard medical ontology (which reduces screen clutter). When a search is performed, the “results are displayed as nodes based on the medical ontology and includes relationships established by the co-occurrences found during indexing. This enables users to find relationships without the necessity of searching through multiple articles or deduping their search results manually.

As I understand it, Memantic is as much an effort at efficient visualization as it is an improvement in search technique.

Very much worth a slow read over the weekend!

I first saw this in a tweet by Sami Ghazali.

PS: I tried viewing the videos listed in the paper but wasn’t able to get any sound? Maybe you will have better luck.

Knowledge Base Completion…

Thursday, February 6th, 2014

Knowledge Base Completion via Search-Based Question Answering by Robert West,


Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. For example, over 70% of people included in Freebase have no known place of birth, and 99% have no known ethnicity. In this paper, we propose a way to leverage existing Web-search–based question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, we learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute. For example, if we want to find Frank Zappa’s mother, we could ask the query “who is the mother of Frank Zappa”. However, this is likely to return “The Mothers of Invention”, which was the name of his band. Our system learns that it should (in this case) add disambiguating terms, such as Zappa’s place of birth, in order to make it more likely that the search results contain snippets mentioning his mother. Our system also learns how many different queries to ask for each attribute, since in some cases, asking too many can hurt accuracy (by introducing false positives). We discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute. Finally, we evaluate our system and show that it is able to extract a large number of facts with high confidence.

I was glad to see this paper was relevant to searching because any paper with Frank Zappa and “The Mothers of Invention” in the abstract deserves to be cited. 😉 I will tell you that story another day.

It’s heavy reading and I have just begun but I wanted to mention something from early in the paper:

We show that it is better to ask multiple queries and aggregate the results, rather than rely on the answers to a single query, since integrating several pieces of evidence allows for more robust estimates of answer correctness.

Does the use of multiple queries run counter to the view that querying a knowledge base, be it RDF or topic maps or other, should result in a single answer?

If you were to ask me a non-trivial question five (5) days in a row (same question) you would get at least five different answers. All in response to the same question but eliciting slightly different information.

Should we take the same approach to knowledge bases? Or do we in fact already do take that approach by querying search engines with slightly different queries?


I first saw this in a tweet by Stefano Bertolo.

Hands-On Knowledge Co-Creation and Sharing

Saturday, November 30th, 2013

Hands-On Knowledge Co-Creation and Sharing, Abdul Samad Kazi, Liza Wohlfart, Patricia Wolf, editors.

From the preface:

The content management team of KnowledgeBoard launched its first book entitled “Real-Life Knowledge Management: Lessons from the Field” in April, 2006. This book was a collection of eighteen industrial case studies from twelve different countries. What differentiated this book from others lay in the fact that most of the case studies were a recording of the vast experiences of knowledge workers: the real people on the field. The book was and continues to remain a success and is used in numerous large and small organisations to solve real-life problems today based on learnings from and adaptation of the case studies to the operational norms of these organisations. It is furthermore used as valuable teaching, training and reference material, at different universities and training centres.

During a Contactivity event in 2006, participants of the event mentioned the need for a set of practical methods and techniques for effective knowledge co-creation and sharing. The initial idea was to prepare a list of existing methods and techniques in the form of a short article. During this process, we noted that while existing methods were reasonably well-documented, there existed several undocumented methods and techniques that were developed and used for
specific organisational contexts by knowledge workers. Through further requests from different KnowledgeBoard community members for a new book on practical methods and techniques for knowledge creation and sharing, the content management team of KnowledgeBoard launched a call for KnowledgeBoard’s second book. “Hands-On Knowledge Co-Creation and Sharing: Practical Methods and Techniques”, the book you now hold in your hands, or browse on your screen is the result.

This book presents thirty different hands-on methods and techniques for knowledge co-creation and sharing within collaborative settings. It showcases a wide range of moderation, facilitation, collaboration, and interaction mechanisms through the use of different face-to-face and online methods and techniques. Each presented method/technique is augmented with real-life cases on its use; provides directions on what needs to be done before, during, and after the use of each method/technique to achieve tangible and measurable results; provides a set of tips and tricks on the use and adaptation of the method/technique for different contexts and settings; and provides a list of potholes to avoid when using the method/technique.

The prime audience of this book is industry practitioners, event moderators, facilitators, consultants, researchers, and academia with an interest in the use and development of effective techniques and mechanisms to foster knowledge co-creation and sharing. This book is expected to equip them with a set of usable practical methods and techniques for knowledge co-creation and sharing.

You will have to select, adapt and modify these techniques to suit your particular situation but it does offer a wide range of approaches.

I am not as confident of the people sharing knowledge as the editors and their authors.

My experience with non-profit organizations could be called a cult of orality. There is little or no written documentation, be it requirements for projects, procedures for backups, installation details on applications, database schemas, etc.

Questions both large and small are answered only with oral and incomplete answers.

If answers to questions were in writing, it would be possible to hold people accountable for their answers.

Not to mention the job security that comes from being the only person who knows how applications are configured.

One reason for a lack of knowledge sharing is the lack of benefit for the person sharing the knowledge.

I would think continued employment would be benefit enough but that is a management choice.

SGIKDD explorations December 2012

Saturday, June 1st, 2013

SGIKDD explorations December 2012

The hard copy of SIGKDD explorations arrived in the last week.

Comments to follow on several of the articles but if you are not a regular SIGKDD explorations reader, this issue may convince you to change.

Quick peek:

  • War stories from Twitter (Would you believe semantic issues persist in modern IT organizations?)
  • Analyzing heterogeneous networks (Heterogeneity, everybody talks about it….)
  • “Big Graph” (Will “Big Graph” replace “Big Data?”)
  • Mining large data streams (Will “Big Streams” replace “Big Graph?”)

Along with the current state of Big Data mining, its future and other goodies.

Posts will follow on some of the articles but I wanted to give you a head’s up.

The hard copy?

I read it while our chickens are in the yard.

Local ordinance prohibits unleashed chickens on the street so I have to keep them in the yard.

Apache cTAKES

Wednesday, April 10th, 2013

Apache cTAKES

From the webpage:

Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named entities from various dictionaries including the Unified Medical Language System (UMLS) – medications, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, subject (patient, family member, etc.) and context (negated/not negated, conditional, generic, degree of certainty). Some of the attributes are expressed as relations, for example the location of a clinical condition (locationOf relation) or the severity of a clinical condition (degreeOf relation).

Apache cTAKES was built using the Apache UIMA Unstructured Information Management Architecture engineering framework and Apache OpenNLP natural language processing toolkit. Its components are specifically trained for the clinical domain out of diverse manually annotated datasets, and create rich linguistic and semantic annotations that can be utilized by clinical decision support systems and clinical research. cTAKES has been used in a variety of use cases in the domain of biomedicine such as phenotype discovery, translational science, pharmacogenomics and pharmacogenetics.

Apache cTAKES employs a number of rule-based and machine learning methods. Apache cTAKES components include:

  1. Sentence boundary detection
  2. Tokenization (rule-based)
  3. Morphologic normalization
  4. POS tagging
  5. Shallow parsing
  6. Named Entity Recognition
    • Dictionary mapping
    • Semantic typing is based on these UMLS semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, medications
  7. Assertion module
  8. Dependency parser
  9. Constituency parser
  10. Semantic Role Labeler
  11. Coreference resolver
  12. Relation extractor
  13. Drug Profile module
  14. Smoking status classifier

The goal of cTAKES is to be a world-class natural language processing system in the healthcare domain. cTAKES can be used in a great variety of retrievals and use cases. It is intended to be modular and expandable at the information model and method level.
The cTAKES community is committed to best practices and R&D (research and development) by using cutting edge technologies and novel research. The idea is to quickly translate the best performing methods into cTAKES code.

Processing a text with cTAKES is a processing of adding semantic information to the text.

As you can imagine, the better the semantics that are added, the better searching and other functions become.

In order to make added semantic information interoperable, well, that’s a topic map question.

I first saw this in a tweet by Tim O’Reilly.

Knowledge Discovery from Mining Big Data [Astronomy]

Tuesday, March 19th, 2013

Knowledge Discovery from Mining Big Data – Presentation by Kirk Borne by Bruce Berriman.

From the post:

My friend and colleague Kirk Borne, of George Mason University, is a specialist in the modern field of data mining and astroinformatics. I was delighted to learn that he was giving a talk on an introduction to this topic as part of the Space Telescope Engineering and Technology Colloquia, and so I watched on the webcast. You can watch the presentation on-line, and you can download the slides from the same page. The presentation is a comprehensive introduction to data mining in astronomy, and I recommend it if you want to grasp the essentials of the field.

Kirk began by reminding us that responding to the data tsunami is a national priority in essentially all fields of science – a number of nationally commissioned working groups have been unanimous in reaching this conclusion and in emphasizing the need for scientific and educational programs in data mining. The slides give a list of publications in this area.

Deeply entertaining presentation on big data.

The first thirty minutes or so are good for “big data” quotes and hype but the real meat comes at about slide 22.

Extends the 3 V’s (Volume, Variety, Velocity) to include Veracity, Variability, Venue, Vocabulary, Value.

And outlines classes of discovery:

  • Class Discovery
    • Finding new classes of objects and behaviors
    • Learning the rules that constrain the class boundaries
  • Novelty Discovery
    • Finding new, rare, one-in-a-million(billion)(trillion) objects and events
  • Correlation Discovery
    • Finding new patterns and dependencies, which reveal new natural laws or new scientific principles
  • Association Discovery
    • Finding unusual (improbable) co-occurring associations

A great presentation with references and other names you will want to follow on big data and astroinformatics.

Call for KDD Cup Competition Proposals

Sunday, February 10th, 2013

Call for KDD Cup Competition Proposals

From the post:

Please let us know if you are interested in being considered for the 2013 KDD Cup Competition by filling out the form below.

This is the official call for proposals for the KDD Cup 2013 competition. The KDD Cup is the well known data mining competition of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD-2013 conference will be held in Chicago from August 11 – 14, 2013. The competition will last between 6 and 8 weeks and the winners should be notified by end-June. The winners will be announced in the KDD-2013 conference and we are planning to run a workshop as well.

A good competition task is one that is practically useful, scientifically or technically challenging, can be done without extensive application domain knowledge, and can be evaluated objectively. Of particular interest are non-traditional tasks/data that require novel techniques and/or thoughtful feature construction.

Proposals should involve data and a problem whose successful completion will result in a contribution of some lasting value to a field or discipline. You may assume that Kaggle will provide the technical support for running the contest. The data needs to be available no later than mid-March.

If you have initial questions about the suitability of your data/problem feel free to reach out to claudia.perlich [at]

Do you have:

non-traditional tasks/data that require[s] novel techniques and/or thoughtful feature construction?

Is collocation of information on the basis of multi-dimensional subject identity a non-traditional task?

Does extraction of multiple dimensions of a subject identity from users require novel techniques?

If so, what data sets would you suggest using in this challenge?

I first saw this at: 19th ACM SIGKDD Knowledge Discovery and Data Mining Conference.

BIOKDD 2013 :…Biological Knowledge Discovery and Data Mining

Saturday, November 24th, 2012

BIOKDD 2013 : 4th International Workshop on Biological Knowledge Discovery and Data Mining

When Aug 26, 2013 – Aug 30, 2013
Where Prague, Czech Republic
Abstract Registration Due Apr 3, 2013
Submission Deadline Apr 10, 2013
Notification Due May 10, 2013
Final Version Due May 20, 2013

From the call for papers:

With the development of Molecular Biology during the last decades, we are witnessing an exponential growth of both the volume and the complexity of biological data. For example, the Human Genome Project provided the sequence of the 3 billion DNA bases that constitute the human genome. And, consequently, we are provided too with the sequences of about 100,000 proteins. Therefore, we are entering the post-genomic era: after having focused so many efforts on the accumulation of data, we have now to focus as much effort, and even more, on the analysis of these data. Analyzing this huge volume of data is a challenging task because, not only, of its complexity and its multiple and numerous correlated factors, but also, because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex biological mechanisms under study. From here comes the necessity to use computer tools and develop new in silico high performance approaches to support us in the analysis of biological data and, hence, to help us in our understanding of the correlations that exist between, on one hand, structures and functional patterns of biological sequences and, on the other hand, genetic and biochemical mechanisms. Knowledge Discovery and Data Mining (KDD) are a response to these new trends.

Topics of BIOKDD’13 workshop include, but not limited to:

Data Preprocessing: Biological Data Storage, Representation and Management (data warehouses, databases, sequences, trees, graphs, biological networks and pathways, …), Biological Data Cleaning (errors removal, redundant data removal, completion of missing data, …), Feature Extraction (motifs, subgraphs, …), Feature Selection (filter approaches, wrapper approaches, hybrid approaches, embedded approaches, …)

Data Mining: Biological Data Regression (regression of biological sequences…), Biological data clustering/biclustering (microarray data biclustering, clustering/biclustering of biological sequences, …), Biological Data Classification (classification of biological sequences…), Association Rules Learning from Biological Data, Text mining and Application to Biological Sequences, Web mining and Application to Biological Data, Parallel, Cloud and Grid Computing for Biological Data Mining

Data Postprocessing: Biological Nuggets of Knowledge Filtering, Biological Nuggets of Knowledge Representation and Visualization, Biological Nuggets of Knowledge Evaluation (calculation of the classification error rate, evaluation of the association rules via numerical indicators, e.g. measurements of interest, … ), Biological Nuggets of Knowledge Integration

Being held in conjunction with 24th International Conference on Database and Expert Systems Applications – DEXA 2013.

In case you are wondering about BIOKDD, consider the BIOKDD Programme for 2012.

Or the DEXA program for 2012.

Looks like a very strong set of conferences and workshops.

…Knowledge Extraction From Complex Astronomical Data Sets

Friday, November 23rd, 2012

CLaSPS: A New Methodology For Knowledge Extraction From Complex Astronomical Data Sets by R. D’Abrusco, G. Fabbiano, G. Djorgovski, C. Donalek, O. Laurino and G. Longo. (R. D’Abrusco et al. 2012 ApJ 755 92 doi:10.1088/0004-637X/755/2/92)


In this paper, we present the Clustering-Labels-Score Patterns Spotter (CLaSPS), a new methodology for the determination of correlations among astronomical observables in complex data sets, based on the application of distinct unsupervised clustering techniques. The novelty in CLaSPS is the criterion used for the selection of the optimal clusterings, based on a quantitative measure of the degree of correlation between the cluster memberships and the distribution of a set of observables, the labels, not employed for the clustering. CLaSPS has been primarily developed as a tool to tackle the challenging complexity of the multi-wavelength complex and massive astronomical data sets produced by the federation of the data from modern automated astronomical facilities. In this paper, we discuss the applications of CLaSPS to two simple astronomical data sets, both composed of extragalactic sources with photometric observations at different wavelengths from large area surveys. The first data set, CSC+, is composed of optical quasars spectroscopically selected in the Sloan Digital Sky Survey data, observed in the x-rays by Chandra and with multi-wavelength observations in the near-infrared, optical, and ultraviolet spectral intervals. One of the results of the application of CLaSPS to the CSC+ is the re-identification of a well-known correlation between the αOX parameter and the near-ultraviolet color, in a subset of CSC+ sources with relatively small values of the near-ultraviolet colors. The other data set consists of a sample of blazars for which photometric observations in the optical, mid-, and near-infrared are available, complemented for a subset of the sources, by Fermi γ-ray data. The main results of the application of CLaSPS to such data sets have been the discovery of a strong correlation between the multi-wavelength color distribution of blazars and their optical spectral classification in BL Lac objects and flat-spectrum radio quasars, and a peculiar pattern followed by blazars in the WISE mid-infrared colors space. This pattern and its physical interpretation have been discussed in detail in other papers by one of the authors.

A new approach for mining “…correlations in complex and massive astronomical data sets produced by the federation of the data from modern automated astronomical facilities.”

Mining complex and massive data sets. I have heard that somewhere recently. Sure it will come back to me.

Service-Oriented Distributed Knowledge Discovery

Thursday, October 25th, 2012

Service-Oriented Distributed Knowledge Discovery by Domenico Talia, University of Calabria, Rende, Italy; Paolo Trunfio.

The publisher’s summary reads:

A new approach to distributed large-scale data mining, service-oriented knowledge discovery extracts useful knowledge from today’s often unmanageable volumes of data by exploiting data mining and machine learning distributed models and techniques in service-oriented infrastructures. Service-Oriented Distributed Knowledge Discovery presents techniques, algorithms, and systems based on the service-oriented paradigm. Through detailed descriptions of real software systems, it shows how the techniques, models, and architectures can be implemented.

The book covers key areas in data mining and service-oriented computing. It presents the concepts and principles of distributed knowledge discovery and service-oriented data mining. The authors illustrate how to design services for data analytics, describe real systems for implementing distributed knowledge discovery applications, and explore mobile data mining models. They also discuss the future role of service-oriented knowledge discovery in ubiquitous discovery processes and large-scale data analytics.

Highlighting the latest achievements in the field, the book gives many examples of the state of the art in service-oriented knowledge discovery. Both novices and more seasoned researchers will learn useful concepts related to distributed data mining and service-oriented data analysis. Developers will also gain insight on how to successfully use service-oriented knowledge discovery in databases (KDD) frameworks.

The idea of service-oriented data mining/analysis is very compatible with topic maps as marketable information sets.

It is not available through any of my usual channels, yet, but I would be cautious at $89.95 for 230 pages of text.

More comments to follow when I have a chance to review the text.

I first saw this at KDNuggets.

Bisociative Knowledge Discovery

Monday, July 2nd, 2012

Bisociative Knowledge Discovery: An Introduction to Concept, Algorithms, Tools, and Applications by Michael R. Berthold. (Lecture Notes in Computer Science, Volume 7250, 2012, DOI: 10.1007/978-3-642-31830-6)

The volume where Berthold’s Towards Bisociative Knowledge Discovery appears.

Follow the links for article abstracts and additional information. “PDFs” are available under Springer Open Access.

If you are familiar with Steve Newcomb’s universes of discourse, this will sound hauntingly familiar.

How will diverse methodologies of bisociative knowledge discovery, being in different universes of discourse, interchange information?

Topic maps anyone?

Towards Bisociative Knowledge Discovery

Monday, July 2nd, 2012

Towards Bisociative Knowledge Discovery by Michael R. Berthold.


Knowledge discovery generally focuses on finding patterns within a reasonably well connected domain of interest. In this article we outline a framework for the discovery of new connections between domains (so called bisociations), supporting the creative discovery process in a more powerful way. We motivate this approach, show the difference to classical data analysis and conclude by describing a number of different types of domain-crossing connections.

What is a bisociation you ask?

Informally, bisociation can be defined as (sets of) concepts that bridge two otherwise not –or only very sparsely– connected domains whereas an association bridges concepts within a given domain.Of course, not all bisociation candidates are equally interesting and in analogy to how Boden assesses the interestingness of a creative idea as being new, surprising, and valuable [4], a similar measure for interestingness can be specified when the underlying set of domains and their concepts are known.

Berthold describes two forms of bisociation as bridging concepts and graphs, although saying subject identity and associations would be more familiar to topic map users.

This essay introduces more than four hundred pages of papers so there is much more to explore.

These materials are “open access” so take the opportunity to learn more about this developing field.

As always, terminology/identification is going to vary so there will be a role for topic maps.

Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud, 2012)

Thursday, June 21st, 2012

Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud, 2012)

From the website:

Paper Submission August 10, 2012

Acceptance Notice October 01, 2012

Camera-Read Copy October 15, 2012

Workshop December 10, 2012 Brussels, Belgium

Collocated with the IEEE International Conference on Data Mining, ICDM 2012

From the website:

The 3rd International Workshop on Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud, 2012) provides an international platform to share and discuss recent research results in adopting cloud and distributed computing resources for data mining and knowledge discovery tasks.

Synopsis: Processing large datasets using dedicated supercomputers alone is not an economical solution. Recent trends show that distributed computing is becoming a more practical and economical solution for many organizations. Cloud computing, which is a large-scale distributed computing, has attracted significant attention of both industry and academia in recent years. Cloud computing is fast becoming a cheaper alternative to costly centralized systems. Many recent studies have shown the utility of cloud computing in data mining, machine learning and knowledge discovery. This workshop intends to bring together researchers, developers, and practitioners from academia, government, and industry to discuss new and emerging trends in cloud computing technologies, programming models, and software services and outline the data mining and knowledge discovery approaches that can efficiently exploit this modern computing infrastructures. This workshop also seeks to identify the greatest challenges in embracing cloud computing infrastructure for scaling algorithms to petabyte sized datasets. Thus, we invite all researchers, developers, and users to participate in this event and share, contribute, and discuss the emerging challenges in developing data mining and knowledge discovery solutions and frameworks around cloud and distributed computing platforms.

Topics: The major topics of interest to the workshop include but are not limited to:

  • Programing models and tools needed for data mining, machine learning, and knowledge discovery
  • Scalability and complexity issues
  • Security and privacy issues relevant to KD community
  • Best use cases: are there a class of algorithms that best suit to cloud and distributed computing platforms
  • Performance studies comparing clouds, grids, and clusters
  • Performance studies comparing various distributed file systems for data intensive applications
  • Customizations and extensions of existing software infrastructures such as Hadoop for streaming, spatial, and spatiotemporal data mining
  • Applications: Earth science, climate, energy, business, text, web and performance logs, medical, biology, image and video.

It’s December, Belgium and an interesting workshop. Can’t ask for much more than that!

KDIR 2012 : International Conference on Knowledge Discovery and Information

Wednesday, February 15th, 2012

KDIR 2012 : International Conference on Knowledge Discovery and Information

Regular Paper Submission: April 17, 2012
Authors Notification (regular papers): June 12, 2012
Final Regular Paper Submission and Registration: July 4, 2012

From the call for papers:

Knowledge Discovery is an interdisciplinary area focusing upon methodologies for identifying valid, novel, potentially useful and meaningful patterns from data, often based on underlying large data sets. A major aspect of Knowledge Discovery is data mining, i.e. applying data analysis and discovery algorithms that produce a particular enumeration of patterns (or models) over the data. Knowledge Discovery also includes the evaluation of patterns and identification of which add to knowledge. This has proven to be a promising approach for enhancing the intelligence of software systems and services. The ongoing rapid growth of online data due to the Internet and the widespread use of large databases have created an important need for knowledge discovery methodologies. The challenge of extracting knowledge from data draws upon research in a large number of disciplines including statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing, to deliver advanced business intelligence and web discovery solutions.

Information retrieval (IR) is concerned with gathering relevant information from unstructured and semantically fuzzy data in texts and other media, searching for information within documents and for metadata about documents, as well as searching relational databases and the Web. Automation of information retrieval enables the reduction of what has been called “information overload”.

Information retrieval can be combined with knowledge discovery to create software tools that empower users of decision support systems to better understand and use the knowledge underlying large data sets.

Part of IC3K 2012 – International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management.

Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)

Monday, January 16th, 2012

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) will take place in Bristol, UK from September 24th to 28th, 2012.


Abstract submission deadline: Thu 19 April 2012
Paper submission deadline: Mon 23 April 2012
Early author notification: Mon 28 May 2012
Author notification: Fri 15 June 2012
Camera ready submission: Fri 29 June 2012
Conference: Mon – Fri, 24-28 September, 2012.

From the call for papers:

The European Conference on “Machine Learning” and “Principles and Practice of Knowledge Discovery in Databases” (ECML-PKDD) provides an international forum for the discussion of the latest high-quality research results in all areas related to machine learning and knowledge discovery in databases and other innovative application domains.

Submissions are invited on all aspects of machine learning, knowledge discovery and data mining, including real-world applications.

The overriding criteria for acceptance will be a paper’s:

  • potential to inspire the research community by introducing new and relevant problems, concepts, solution strategies, and ideas;
  • contribution to solving a problem widely recognized as both challenging and important;
  • capability to address a novel area of impact of machine learning and data mining.

Other criteria are scientific rigour and correctness, challenges overcome, quality and reproducibility of the experiments, and presentation.

I rather like that: quality and reproducibility of the experiments.

As opposed to the “just believe in the power of ….” and you will get all manner of benefits. But no one can produce data to prove those claims.

Reminds me of the astronomer in Ben Johnson’s who claimed to:

I have possessed for five years the regulation of the weather and the distribution of the seasons. The sun has listened to my dictates, and passed from tropic to tropic by my direction; the clouds at my call have poured their waters, and the Nile has overflowed at my command. I have restrained the rage of the dog-star, and mitigated the fervours of the crab. The winds alone, of all the elemental powers, have hitherto refused my authority, and multitudes have perished by equinoctial tempests which I found myself unable to prohibit or restrain. I have administered this great office with exact justice, and made to the different nations of the earth an impartial dividend of rain and sunshine. What must have been the misery of half the globe if I had limited the clouds to particular regions, or confined the sun to either side of the equator?’”

And when asked how he knew this to be true, replied:

“‘Because,’ said he, ‘I cannot prove it by any external evidence; and I know too well the laws of demonstration to think that my conviction ought to influence another, who cannot, like me, be conscious of its force. I therefore shall not attempt to gain credit by disputation. It is sufficient that I feel this power that I have long possessed, and every day exerted it. But the life of man is short; the infirmities of age increase upon me, and the time will soon come when the regulator of the year must mingle with the dust. The care of appointing a successor has long disturbed me; the night and the day have been spent in comparisons of all the characters which have come to my knowledge, and I have yet found none so worthy as thyself.’” (emphasis added)

Project Gutenberg has a copy online: Rasselas, Prince of Abyssinia, by Samuel Johnson.

For my part, I think semantic integration has been, is and will be hard, not to mention expensive.

Determining your ROI is just as necessary for semantic integration project, whatever technology you choose, as for any other project.

KDD and MUCMD 2011

Thursday, October 6th, 2011

KDD and MUCMD 2011

An interesting review of KDD and MUCMD (Meaningful Use of Complex Medical Data) 2011:

At KDD I enjoyed Stephen Boyd’s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler’s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.

Cites a number their favorite papers. Which ones are yours?

> 100 New KDD Models/Methods Appear Every Month

Monday, September 26th, 2011

Got your attention? It certainly got mine when I read:

Make an inventory of existing methods relevant for astrophysical applications (more than 100 new KDD models and methods appear every month on specialized journals).

A line from the charter of the KDD-IG (Knowledge Discovery and Data Mining-Interest Group) of IVOA (International Virtual Observatory Alliance).

See: IVOA Knowledge Discovery in Databases

I checked the A census of Data Mining and Machine Learning methods for astronomy wiki page but it had no takers, much less any content.

I have written to Professor Giuseppe Longo of University Federico II in Napoli, the chair of this activity to inquire about opportunities to participate in the KDD census. I will post an updated entry when I have more information.

Separate and apart from the census, over 1,200 new KDD models/methods a year, that is an impressive number. I don’t think a census will make that slow down. If anything, greater knowledge of other efforts may spur the creation of even more new models/methods.