Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 12, 2013

Essential Collection of Visualisation Resources

Filed under: Data Mining,Graphics,Visualization — Patrick Durusau @ 3:27 pm

Essential Collection of Visualisation Resources by Andy Kirk.

The categories are:

Some of the resources you will have seen before but this site comes as close to being “essential” as any I have seen for visualization resources.

If you discover new or improved visualization resources, do us all a favor and send Andy a note.

September 11, 2013

Mikut Data Mining Tools Big List – Update

Filed under: Data Mining,Software — Patrick Durusau @ 5:14 pm

Mikut Data Mining Tools Big List – Update

From the post:

An update of the Excel table describing 325 recent and historical data mining tools is now online (Excel format), 31 of them were added since the last update in November 2012. These new updated tools include new published tools and some well-established tools with a statistical background.

Here is the full updated table of tools, (XLS format) which contains additional material to the paper

R. Mikut, M. Reischl: “Data Mining Tools“. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. DOI: 10.1002/widm.24., September/October 2011, Vol. 1

Please help the authors to improve this Excel table:
Contact: ralf.mikut@kit.edu

The post includes a table of the active tools with hyperlinks.

After looking at the spreadsheet, I was puzzled to find that “active and relevant” tools number only one hundred (100).

Does that seem low to you? Especially with the duplication of basic capabilities in different languages?

If you spot any obvious omissions, please send them to: ralf.mikut@kit.edu

August 30, 2013

OpenAGRIS 0.9 released:…

Filed under: Agriculture,Data Mining,Linked Data,Open Data — Patrick Durusau @ 7:25 pm

OpenAGRIS 0.9 released: new functionalities, resources & look by Fabrizio Celli.

From the post:

The AGRIS team has released OpenAGRIS 0.9, a new version of the Web application that aggregates information from different Web sources to expand the AGRIS knowledge, providing as much data as possible about a topic or a bibliographical resource within the agricultural domain.

OpenAGRIS 0.9 contains new functionalities and resources, and received a new interface in English and Spanish, with French, Arabic, Chinese and Russian translations on their way.

Mission: To make information on agricultural research globally available, interlinked with other data resources (e.g. DBPedia, World Bank, Geopolitical Ontology, FAO fisheries dataset, AGRIS serials dataset etc.) following Linked Open Data principles, allowing users to access the full text of a publication and all the information the Web holds about a specific research area in the agricultural domain (1).

Curious what agricultural experts make of this resource?

As of today, the site claims 5,076,594 records. And with all the triple bulking up, some 134,276,804 triples based on those records.

What, roughly # of records * 26 for the number of triples?

Which is no mean feat but I wonder about the granularity of the information being offered?

That is how useful is it to find 10,000 resources when each will take an hour to read?

More granular retrieval, that is far below the level of a file or document, is going to be necessary to avoid repetition of human data mining.

Repetitive human data mining being one of the earmarks of today’s search technology.

August 29, 2013

Data Mining with Weka [Free MOOC]

Filed under: Data Mining,Machine Learning,Weka — Patrick Durusau @ 6:25 pm

Data Mining with Weka

From the webpage:

Welcome to the free online course Data Mining with Weka

This 5 week MOOC will introduce data mining concepts through practical experience with the free Weka tool.

The course features:

The course will start September 9, 2013, with enrolments now open.

An opportunity to both keep your mind in shape and learn something useful.

The need for data intuits who also know machine learning is increasing.

Are you going to be the pro from Dover or not?

August 24, 2013

Information Extraction from the Internet

Filed under: Data Mining,Information Retrieval,Information Science,Publishing — Patrick Durusau @ 3:30 pm

Information Extraction from the Internet by Nan Tang.

From the description at Amazon ($116.22):

As the Internet continues to become part of our lives, there now exists an overabundance of reliable information sources on this medium. The temporal and cognitive resources of human beings, however, do not change. “Information Extraction from the Internet” provides methods and tools for Web information extraction and retrieval. Success in this area will greatly enhance business processes and provide information seekers new tools that allow them to reduce their searching time and cost involvement. This book focuses on the latest approaches for Web content extraction, and analyzes the limitations of existing technology and solutions. “Information Extraction from the Internet” includes several interesting and popular topics that are being widely discussed in the area of information extraction: data spasity and field-associated knowledge (Chapters 1–2), Web agent design and mining components (Chapters 3–4), extraction skills on various documents (Chapters 5–7), duplicate detection for music documents (Chapter 8), name disambiguation in digital libraries using Web information (Chapter 9), Web personalization and user-behavior issues (Chapters 10–11), and information retrieval case studies (Chapters 12–14). “Information Extraction from the Internet” is suitable for advanced undergraduate students and postgraduate students. It takes a practical approach rather than a conceptual approach. Moreover, it offers a truly reader-friendly way to get to the subject related to information extraction, making it the ideal resource for any student new to this subject, and providing a definitive guide to anyone in this vibrant and evolving discipline. This book is an invaluable companion for students, from their first encounter with the subject to more advanced studies, while the full-color artworks are designed to present the key concepts with simplicity, clarity, and consistency.

I discovered this volume while searching for the publisher of: On-demand Synonym Extraction Using Suffix Arrays.

As you can see from the description, a wide ranging coverage of information extraction interests.

All of the chapters are free for downloading at the publisher’s site.

iConcepts Press has a number of books and periodicals you may find interesting.

August 23, 2013

Data miners strike gold on copyright

Filed under: Data Mining,Licensing,NSA — Patrick Durusau @ 5:40 pm

Data miners strike gold on copyright by Paul Jump.

From the post:

From early September, the biomedical publisher, which is owned by Springer, will publish all datasets under a Creative Commons CC0 licence, which waives all rights to the material.

Data miners, who use software to analyse data drawn from numerous papers, have called for CC0, also known as “no rights reserved”, to be the standard licence for datasets. Even the CC-BY licence, which is required by the UK research councils, is deemed to be a hindrance to data mining: although it does not impose restrictions on reuse, it requires every paper mined to be credited.

In a statement, the publisher says that “the true research potential of knowledge that is captured in data will only be released if data mining and other forms of data analysis and re-use are not in any form restricted by licensing requirements.

“The inclusion of the Creative Commons CC0 public domain dedication will make it clear that data from articles in BioMed Central journals is clearly and unambiguously available for sharing, integration and re-use without legal restrictions.”

As of September, the NSA won’t be violating copyright restrictions when it mines Biomed Central.

Being illegal does not bother the NSA but the Biomed news reduces the number of potential plaintiffs to less than the world population + N. (Where N = legal entities entitled to civil damages.)

You will be able to mine, manipulate and merge data from Biomed Central as well.

August 4, 2013

Web Scale? Or do you want to try for human scale?

Filed under: Data Mining,Machine Learning,Ontology — Patrick Durusau @ 4:41 pm

How often have your heard the claim this or that technology is “web scale?”

How big is “web scale?”

Visit http://www.worldwidewebsize.com/ to get an estimate of the size of the Web.

As of today, the estimated number of indexed web pages for Google is approximately 47 billion pages.

How does that compare, say to scholarly literature?

Would you believe 1 trillion pages of scholarly journal literature?

An incomplete inventory (Fig. 1), divided into biological, social, and physical sciences, contains 400, 200, and 65 billion pages, respectively (see supplemental data*).

Or better with an image:

webscale

I didn’t bother putting in the trillion page data but for your information, the indexed Web is < 5% of all scholarly journal literature.

Nor did I try to calculate the data that Chicago is collecting every day with 10,000 video cameras.

Is your app ready to step up to human scale information retrieval?

*Advancing science through mining libraries, ontologies, and communities by JA Evans, A. Rzhetsky. J Biol Chem. 2011 Jul 8;286(27):23659-66. doi: 10.1074/jbc.R110.176370. Epub 2011 May 12.

July 30, 2013

Large File/Data Tools

Filed under: BigData,Data,Data Mining — Patrick Durusau @ 3:12 pm

Essential tools for manipulating big data files by Daniel Rubio.

From the post:

You can leverage several tools that are commonly used to manipulate big data files, which include: Regular expressions, sed, awk, WYSIWYG editors (e.g. Emacs, vi and others), scripting languages (e.g. Bash, Perl, Python and others), parsers (e.g. Expat, DOM, SAX and others), compression utilities (e.g. zip, tar, bzip2 and others) and miscellaneous Unix/Linux utilities (e.g. split, wc, sort, grep)

And,

10 Awesome Examples for Viewing Huge Log Files in Unix by Ramesh Natarajan.

Viewing huge log files for trouble shooting is a mundane routine tasks for sysadmins and programmers.

In this article, let us review how to effectively view and manipulate huge log files using 10 awesome examples.

cover the same topic but with very little overlap (only grep as far as I can determine).

Are there other compilations of “tools” that would be handy for large data files?

June 5, 2013

International Tracing Service Archive

Filed under: Data Mining,Dataset,Topic Maps — Patrick Durusau @ 10:45 am

International Tracing Service Archive (U.S. Holocaust Memorial Museum)

The posting on Crowdsourcing + Machine Learning… reminded me to check on access to the archives of the International Tracking Service.

Let’s just say the International Tracking Service has a poor track record on accessibility to its archives. An archive of documents the ITS describes as:

Placed end-to-end, the documents in the ITS archives would extent to a length of about 26,000 metres.

Fortunately digitized copies of portions of the archives are available at other locations, such as the U.S. Holocaust Memorial Museum.

The FAQ on the archives answers the question “Are the records goings to be on the Internet?” this way:

Regrettably, the collection was neither organized nor digitized to be directly searchable online. Therefore, the Museum’s top priority is to develop software and a database that will efficiently search the records so we can quickly respond to survivor requests for information.

Only a small fraction of the records are machine readable. In order to be searched by Google or Yahoo! search engines, all of the data must be machine readable.

Searching the material is an arduous task in any event. The ITS records are in some 25 different languages and contain millions of names, many with multiple spellings. Many of the records are entirely handwritten. In cases where forms were used, the forms are written in German and the entries are often handwritten in another language.

The best way to ensure that survivors receive accurate information quickly and easily will be by submitting requests to the Museum by e-mail, regular mail, or fax, and trained Museum staff will assist with the research. The Museum will provide copies of all relevant original documents to survivors who wish to receive them via e-mail or regular mail.

The priority of the Museum is in answering requests for information from survivors.

However, we do know that multiple languages and handwritten texts are not barriers to creating machine readable texts for online searching.

The searches would not be perfect but even double-key entry of all the data would not be perfect.

What better way to introduce digital literate generations to the actuality of the Holocaust than to involve them in crowd-sourcing the proofing of a machine transcription of this archive?

Then the Holocaust would not a few weeks in history class or a museum or memorial to visit but experience with documents of the fates of millions.

PS: Creating trails through the multiple languages, spellings, locations, etc., by researchers than can be enhanced by other researchers, would highlight the advantages of topic maps in historical research.

June 1, 2013

SGIKDD explorations December 2012

Filed under: BigData,Data Mining,Graphs,Knowledge Discovery,Tweets — Patrick Durusau @ 9:29 am

SGIKDD explorations December 2012

The hard copy of SIGKDD explorations arrived in the last week.

Comments to follow on several of the articles but if you are not a regular SIGKDD explorations reader, this issue may convince you to change.

Quick peek:

  • War stories from Twitter (Would you believe semantic issues persist in modern IT organizations?)
  • Analyzing heterogeneous networks (Heterogeneity, everybody talks about it….)
  • “Big Graph” (Will “Big Graph” replace “Big Data?”)
  • Mining large data streams (Will “Big Streams” replace “Big Graph?”)

Along with the current state of Big Data mining, its future and other goodies.

Posts will follow on some of the articles but I wanted to give you a head’s up.

The hard copy?

I read it while our chickens are in the yard.

Local ordinance prohibits unleashed chickens on the street so I have to keep them in the yard.

May 13, 2013

Seventh ACM International Conference on Web Search and Data Mining

Filed under: Conferences,Data Mining,Searching,WWW — Patrick Durusau @ 10:08 am

WSDM 2014 : Seventh ACM International Conference on Web Search and Data Mining

Abstract submission deadline: August 19, 2013
Paper submission deadline: August 26, 2013
Tutorial proposals due: September 9, 2013
Tutorial and paper acceptance notifications: November 25, 2013
Tutorials: February 24, 2014
Main Conference: February 25-28, 2014

From the call for papers:

WSDM (pronounced “wisdom”) is one of the premier conferences covering research in the areas of search and data mining on the Web. The Seventh ACM WSDM Conference will take place in New York City, USA during February 25-28, 2014.

WSDM publishes original, high-quality papers related to search and data mining on the Web and the Social Web, with an emphasis on practical but principled novel models of search, retrieval and data mining, algorithm design and analysis, economic implications, and in-depth experimental analysis of accuracy and performance.

WSDM 2014 is a highly selective, single track meeting that includes invited talks as well as refereed full papers. Topics covered include but are not limited to:

(…)

Papers emphasizing novel algorithmic approaches are particularly encouraged, as are empirical/analytical studies of specific data mining problems in other scientific disciplines, in business, engineering, or other application domains. Application-oriented papers that make innovative technical contributions to research are welcome. Visionary papers on new and emerging topics are also welcome.

Authors are explicitly discouraged from submitting papers that do not present clearly their contribution with respect to previous works, that contain only incremental results, and that do not provide significant advances over existing approaches.

Sets a high bar but one that can be met.

Would be very nice PR to have a topic map paper among those accepted.

May 11, 2013

Data-rich astronomy: mining synoptic sky surveys [Data Bombing]

Filed under: Astroinformatics,BigData,Data Mining,GPU — Patrick Durusau @ 10:11 am

Data-rich astronomy: mining synoptic sky surveys by Stefano Cavuoti.

Abstract:

In the last decade a new generation of telescopes and sensors has allowed the production of a very large amount of data and astronomy has become, a data-rich science; this transition is often labeled as: “data revolution” and “data tsunami”. The first locution puts emphasis on the expectations of the astronomers while the second stresses, instead, the dramatic problem arising from this large amount of data: which is no longer computable with traditional approaches to data storage, data reduction and data analysis. In a new, age new instruments are necessary, as it happened in the Bronze age when mankind left the old instruments made out of stone to adopt the new, better ones made with bronze. Everything changed, even the social structure. In a similar way, this new age of Astronomy calls for a new generation of tools and, for a new methodological approach to many problems, and for the acquisition of new skills. The attempts to find a solution to this problems falls under the umbrella of a new discipline which originated by the intersection of astronomy, statistics and computer science: Astroinformatics, (Borne, 2009; Djorgovski et al., 2006).

Dissertation by the same Stefano Cavuoti of: Astrophysical data mining with GPU….

Along with every new discipline comes semantics that are transparent to insiders and opaque to others.

Not out of malice but economy. Why explain a term if all those attending the discussion understand what it means?

But that lack of explanation, like our current ignorance about the means used to construct the pyramids, can come back to bite you.

In some cases far more quickly than intellectual curiosity about ancient monuments by the tin hat crowd.

Take the continuing failure of data integration by the U.S. intelligence services for example.

Rather than the current mule-like resistance to sharing, I would data bomb the other intelligence services with incompatible data exports every week.

Full sharing, for all they would be able to do with it.

Unless they had a topic map.

May 2, 2013

HyperLogLog — Cornerstone of a Big Data Infrastructure

Filed under: Algorithms,BigData,Data Mining,HyperLogLog,Scalability — Patrick Durusau @ 10:44 am

HyperLogLog — Cornerstone of a Big Data Infrastructure

From the introduction:

In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS work FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.

After seeing the HyperLogLog references in Approximate Methods for Scalable Data Mining I started looking for a fuller explanation/illustration of HyperLogLog.

Stumbled on this posting.

Includes a great HyperLogLog (HLL) simulation written in JavaScript.

Enjoy!

Approximate Methods for Scalable Data Mining

Filed under: BigData,Data Mining,HyperLogLog,Scalability — Patrick Durusau @ 10:32 am

Approximate Methods for Scalable Data Mining by Andrew Clegg.

Slides from a presentation at: Data Science London 24/04/13.

To get your interest, a nice illustration of HyperLogLog algorithm, “Billions of distinct values in 1.5KB of RAM with 2% relative error.”

Has a number of other useful illustrations and great references.

April 27, 2013

Extracting and connecting chemical structures…

Filed under: Cheminformatics,Data Mining,Text Mining — Patrick Durusau @ 6:00 pm

Extracting and connecting chemical structures from text sources using chemicalize.org by Christopher Southan and Andras Stracz.

Abstract:

Background

Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.

Results

Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and/or merged extractions.

Conclusion

This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

A great example of building a resource to address identity issues in a specific domain.

The result speaks for itself.

PS: The results were not delayed awaiting a reformation of chemistry to use a common identifier.

April 23, 2013

Resources and Readings for Big Data Week DC Events

Filed under: BigData,Data,Data Mining,Natural Language Processing — Patrick Durusau @ 6:33 pm

Resources and Readings for Big Data Week DC Events

This is Big Data week in DC and Data Community DC has put together a list of books articles and posts to keep you busy all week.

Very cool!

April 21, 2013

Collaborative annotation… [Human + Machine != Semantic Monotony]

Collaborative annotation for scientific data discovery and reuse by Kirk Borne. (Borne, K. (2013), Collaborative annotation for scientific data discovery and reuse. Bul. Am. Soc. Info. Sci. Tech., 39: 44–45. doi: 10.1002/bult.2013.1720390414)

Abstract:

Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.

A persuasive call to arms to develop “collaborative annotation:”

Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams.

And more specifically for the Large Synoptic Survey Telescope (LSST):

The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System, www.biodas.org).

As you might expect, semantic diversity is going to be present with “collaborative annotation.”

Semantic Monotony (aka Semantic Web) has failed for machines alone.

No question it will fail for humans + machines.

Are you ready to step up to the semantic diversity of collaborative annotation (humans + machines)?

PLOS Text Mining Collection

Filed under: Data Mining,Text Mining — Patrick Durusau @ 3:43 pm

The PLOS Text Mining Collection has launched!

From the webpage:

Across all realms of the sciences and beyond, the rapid growth in the number of works published digitally presents new challenges and opportunities for making sense of this wealth of textual information. The maturing field of Text Mining aims to solve problems concerning the retrieval, extraction and analysis of unstructured information in digital text, and to revolutionize how scientists access and interpret data that might otherwise remain buried in the literature.

Here PLOS acknowledges the growing body of work in the area of Text Mining by bringing together major reviews and new research studies published in PLOS journals to create the PLOS Text Mining Collection. It is no coincidence that research in Text Mining in PLOS journals is burgeoning: the widespread uptake of the Open Access publishing model developed by PLOS and other publishers now makes it easier than ever to obtain, mine and redistribute data from published texts. The launch of the PLOS Text Mining Collection complements related PLOS Collections on Open Access and Altmetrics, and further underscores the importance of the PLOS Application Programming Interface, which provides an open source interface with which to mine PLOS journal content.

The Collection is now open across the PLOS journals to all authors who wish to submit research or reviews in this area. Articles are presented below in order of publication date and new articles will be added to the Collection as they are published.

An impressive start to what promises to be a very rich resource!

I first saw this at: New: PLOS Text Mining.

April 17, 2013

Practical tools for exploring data and models

Filed under: Data,Data Mining,Data Models,Exploratory Data Analysis — Patrick Durusau @ 2:37 pm

Practical tools for exploring data and models by Hadley Alexander Wickham. (PDF)

From the introduction:

This thesis describes three families of tools for exploring data and models. It is organised in roughly the same way that you perform a data analysis. First, you get the data in a form that you can work with; Section 1.1 introduces the reshape framework for restructuring data, described fully in Chapter 2. Second, you plot the data to get a feel for what is going on; Section 1.2 introduces the layered grammar of graphics, described in Chapter 3. Third, you iterate between graphics and models to build a succinct quantitative summary of the data; Section 1.3 introduces strategies for visualising models, discussed in Chapter 4. Finally, you look back at what you have done, and contemplate what tools you need to do better in the future; Chapter 5 summarises the impact of my work and my plans for the future.

The tools developed in this thesis are firmly based in the philosophy of exploratory data analysis (Tukey, 1977). With every view of the data, we strive to be both curious and sceptical. We keep an open mind towards alternative explanations, never believing we
have found the best model. Due to space limitations, the following papers only give a glimpse at this philosophy of data analysis, but it underlies all of the tools and strategies that are developed. A fuller data analysis, using many of the tools developed in this thesis, is available in Hobbs et al. (To appear).

Has a focus on R tools, including ggplot2 and Wilkinson’s The Grammar of Graphics.

The “…never believing we have found the best model” approach works for me!

You?

I first saw this at Data Scholars.

April 16, 2013

The non-negative matrix factorization toolbox for biological data mining

Filed under: Bioinformatics,Data Mining,Matrix — Patrick Durusau @ 4:20 pm

The non-negative matrix factorization toolbox for biological data mining by Yifeng Li and Alioune Ngom. (Source Code for Biology and Medicine 2013, 8:10 doi:10.1186/1751-0473-8-10)

From the post:

Background: Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data.

Results: We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison.

Conclusions: A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.

Written in a bioinformatics context but also used in text data mining (Enron emails), spectral analysis and other data mining fields. (See Non-negative matrix factorization)

April 15, 2013

2ND International Workshop on Mining Scientific Publications

Filed under: Conferences,Data Mining,Searching,Semantic Search,Semantics — Patrick Durusau @ 2:49 pm

2ND International Workshop on Mining Scientific Publications

May 26, 2013 – Submission deadline
June 23, 2013 – Notification of acceptance
July 7, 2013 – Camera-ready
July 26, 2013 – Workshop

From the CFP:

Digital libraries that store scientific publications are becoming increasingly important in research. They are used not only for traditional tasks such as finding and storing research outputs, but also as sources for mining this information, discovering new research trends and evaluating research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries makes it no longer sufficient to provide access to content to human readers only. It is equally important to allow machines analyse this information and by doing so facilitate the processes by which research is being accomplished. Recent developments in natural language processing, information retrieval, the semantic web and other disciplines make it possible to transform the way we work with scientific publications. However, in order to make this happen, researchers first need to be able to easily access and use large databases of scientific publications and research data, to carry out experiments.

This workshop aims to bring together people from different backgrounds who:
(a) are interested in analysing and mining databases of scientific publications,
(b) develop systems, infrastructures or datasets that enable such analysis and mining,
(c) design novel technologies that improve the way research is being accomplished or
(d) support the openness and free availability of publications and research data.

2. TOPICS

The topics of the workshop will be organised around the following three themes:

  1. Infrastructures, systems, open datasets or APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence and to aid content exploration.

Of particular interest for topic mappers:

Topics of interest relevant to theme 2 include, but are not limited to:

  • Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure etc.
  • Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
  • New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications, according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
  • Models for semantically representing and annotating publications. This topic is related to aspects of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of special interest.
  • Semantically enriching/annotating publications by crowdsourcing. Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing metadata, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, (d) other relevant crowdsourcing topics relevant to the domain of scientific publications.

The other themes could be viewed through a topic map lens but semantic enrichment seems like a natural.

April 13, 2013

Office of the Director of National Intelligence: Data Mining 2012

Filed under: Data Mining,Intelligence — Patrick Durusau @ 6:57 pm

Office of the Director of National Intelligence: Data Mining 2012

Office of the Director of National Intelligence = ODNI

To cut directly to the chase:

II. ODNI Data Mining Activities

The ODNI did not engage in any activities to use or develop data mining functionality during the reporting period.

My source, KDNuggets, provides the legal loophole analysis.

Who watches the watchers?

Looks like that it’s going to be you and me.

Every citizen who recognizes a government employee, agent, official, tweet the name you know them by with your location.

Just that.

If enough of us do that, patterns will begin to appear in the data stream.

If enough patterns appear in the data stream, the identities of government employees, agents, officials, will slowly become known.

Transparency won’t happen overnight or easily.

But if you are waiting for the watchers to watch themselves, you are going to be severely disappointed.

April 10, 2013

Can Big Data From Cellphones Help Prevent Conflict? [Privacy?]

Filed under: BigData,Data Mining,Privacy — Patrick Durusau @ 10:54 am

Can Big Data From Cellphones Help Prevent Conflict? by Emmanuel Letouzé.

From the post:

Data from social media and Ushahidi-style crowdsourcing platforms have emerged as possible ways to leverage cellphones to prevent conflict. But in the world of Big Data, the amount of information generated from these is too small to use in advanced data-mining techniques and “machine-learning” techniques (where algorithms adjust themselves based on the data they receive).

But there is another way cellphones could be leveraged in conflict settings: through the various types of data passively generated every time a device is used. “Phones can know,” said Professor Alex “Sandy” Pentland, head of the Human Dynamics Laboratory and a prominent computational social scientist at MIT, in a Wall Street Journal article. He says data trails left behind by cellphone and credit card users—“digital breadcrumbs”—reflect actual behavior and can tell objective life stories, as opposed to what is found in social media data, where intents or feelings are obscured because they are “edited according to the standards of the day.”

The findings and implications of this, documented in several studies and press articles, are nothing short of mind-blowing. Take a few examples. It has been shown that it was possible to infer whether two people were talking about politics using cellphone data, with no knowledge of the actual content of their conversation. Changes in movement and communication patterns revealed in cellphone data were also found to be good predictors of getting the flu days before it was actually diagnosed, according to MIT research featured in the Wall Street Journal. Cellphone data were also used to reproduce census data, study human dynamics in slums, and for community-wide financial coping strategies in the aftermath of an earthquake or crisis.

Very interesting post on the potential uses for cell phone data.

You can imagine what I think could be correlated with cellphone data using a topic map so I won’t bother to enumerate those possibilities.

I did want to comment on the concern about privacy or re-identification as Emmanuel calls it in his post from cellphone data.

Governments, who have declared they can execute any of us without notice or a hearing, are the guardians of that privacy.

That causes me to lack confidence in their guarantees.

Discussions of privacy should assume governments already have unfettered access to all data.

The useful questions become: How do we detect their misuse of such data? and How do we make them heartily sorry for that misuse?

For cell phone data, open access will give government officials more reason for pause than the ordinary citizen.

Less privacy for individuals but also less privacy for access, bribery, contract padding, influence peddling, and other normal functions of government.

In the U.S.A., we have given up our rights to public trial, probable cause, habeas corpus, protections against unreasonable search and seizure, to be free from touching by strangers, and several others.

What’s the loss of the right to privacy for cellphone data compared to catching government officials abusing their offices?

R Cheatsheets

Filed under: Data Mining,R — Patrick Durusau @ 10:29 am

R Cheatsheets

I ran across this collection of cheatsheets for R today.

The R Reference Card for Data Mining is interesting to me but you want to look at some of the others.

Enjoy!

Free Data Mining Tools [African Market?]

Filed under: Data Mining,jHepWork,Knime,Mahout,Marketing,Orange,PSPP,RapidMiner,Rattle,Weka — Patrick Durusau @ 10:17 am

The Best Data Mining Tools You Can Use for Free in Your Company by: Mawuna Remarque KOUTONIN.

Short descriptions of the usual suspects but a couple (jHepWork and PSPP) that were new to me.

  1. RapidMiner
  2. RapidAnalytics
  3. Weka
  4. PSPP
  5. KNIME
  6. Orange
  7. Apache Mahout
  8. jHepWork
  9. Rattle

An interesting site in general.

Consider the following pitch for business success in Africa:

Africa: Your Business Should be Profitable in 45 days or Die

And the reasons for that claim:

1. “It’s almost virgin here. There are lot of opportunities, but you have to fight!”

2. “Target the vanity class with vanity products. The “new rich” have lot of money. They are though on everything except their big ego and social reputation”

3. “Target the lazy executives and middle managers. Do the job they are paid for as a consultant. Be good, and politically savvy, and the money is yours”

4. “You’ll make more money in selling food or opening a restaurant than working for the Bank”

5. “You can’t avoid politics, but learn to think like the people your are talking with. Always finish your sentence with something like “the most important is the country’s development, not power. We all have to work in that direction”

6. “It’s about hard work and passion, but you should first forget about managing time like in Europe.

Take time to visit people, go to the vanity parties, have the patience to let stupid people finish their long empty sentences, and make the politicians understand that your project could make them win elections and strengthen their positions”

7. “Speed is everything. Think fast, Act fast, Be everywhere through friends, family and informants”

With the exception of #1, all of these points are advice I would give to someone marketing topic maps on any continent.

It may be easier to market topic maps where there are few legacy IT systems that might feel threatened by a new technology.

April 9, 2013

Astrophysical data mining with GPU…

Filed under: Astroinformatics,BigData,Data Mining,Genetic Algorithms,GPU — Patrick Durusau @ 10:02 am

Astrophysical data mining with GPU. A case study: genetic classification of globular clusters by Stefano Cavuoti, Mauro Garofalo, Massimo Brescia, Maurizio Paolillo, Antonio Pescape’, Giuseppe Longo, Giorgio Ventre.

Abstract:

We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200x in the training phase with respect to the CPU based version.

BTW, DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html.

In case you are curious about the application of genetic algorithms in a low signal/noise situation with really “big” data, this is a good starting point.

Makes me curious about the “noise” in other communications.

The “signal” is fairly easy to identify in astronomy, but what about in text or speech?

I suppose “background noise, music, automobiles” would count as “noise” on a tape recording of a conversation, but is there “noise” in a written text?

Or noise in a conversation that is clearly audible?

If we have 100% signal, how do we explain failing to understand a message in speech or writing?

If it is not “noise,” then what is the problem?

Scrapely

Filed under: Data Mining,Python — Patrick Durusau @ 9:36 am

Scrapely

From the webpage:

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

A tool for data mining similar HTML pages.

Supports a command line interface.

April 6, 2013

K-Nearest Neighbors: dangerously simple

Filed under: Data Mining,K-Nearest-Neighbors,Marketing,Topic Maps — Patrick Durusau @ 10:31 am

K-Nearest Neighbors: dangerously simple by Cathy O’Neil.

From the post:

I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.

After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.

I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well. Let me explain.

The devil is all in the detail of what you mean by close. And to make things trickier, as in easier to be deceptively easy, there are default choices you could make (and which you would make) which would probably be totally stupid. Namely, the raw numbers, and Euclidean distance.

Read and think about Cathy’s post.

All those nice, clean, clear number values and a simple math equation, muddied by meaning.

Undocumented meaning.

And undocumented relationships between the variables the number values represent.

You could document your meaning and the relationships between variables and still make dumb decisions.

The hope is you or your successor will use documented meaning and relationships to make better decisions.

For documentation you can:

  • Try to remember the meaning of “close” and the relationships for all uses of K-Nearest Neighbors where you work.
  • Write meaning and relationships down on sticky notes collected in your desk draw.
  • Write meaning and relationships on paper or in electronic files, the latter somewhere on the server.
  • Document meaning and relationships with a topic map, so you can leverage on information already known. Including identifiers for the VP who ordered you to use particular values, for example. (Along with digitally signed copies of the email(s) in question.)

Which one are you using?

PS: This link was forwarded to me by Sam Hunting.

A Programmer’s Guide to Data Mining

Filed under: Data Mining,Python — Patrick Durusau @ 8:56 am

A Programmer’s Guide to Data Mining – The Ancient Art of the Numerati by Ron Zacharski.

From the webpage:

Before you is a tool for learning basic data mining techniques. Most data mining textbooks focus on providing a theoretical foundation for data mining, and as result, may seem notoriously difficult to understand. Don’t get me wrong, the information in those books is extremely important. However, if you are a programmer interested in learning a bit about data mining you might be interested in a beginner’s hands-on guide as a first step. That’s what this book provides.

This guide follows a learn-by-doing approach. Instead of passively reading the book, I encourage you to work through the exercises and experiment with the Python code I provide. I hope you will be actively involved in trying out and programming data mining techniques. The textbook is laid out as a series of small steps that build on each other until, by the time you complete the book, you have laid the foundation for understanding data mining techniques. This book is available for download for free under a Creative Commons license (see link in footer). You are free to share the book, and remix it. Someday I may offer a paper copy, but the online version will always be free.

If you are looking for explanations of data mining that fall between the “dummies” variety and arXiv.org papers, you are at the right place!

Not new information but well presented information, always a rare thing.

Take the time to read this book.

If not for the content, to get some ideas on how to improve your next book.

April 5, 2013

A Newspaper Clipping Service with Cascading

Filed under: Authoring Topic Maps,Cascading,Data Mining,News — Patrick Durusau @ 5:34 am

A Newspaper Clipping Service with Cascading by Sujit Pal.

From the post:

This post describes a possible implementation for an automated Newspaper Clipping Service. The end-user is a researcher (or team of researchers) in a particular discipline who registers an interest in a set of topics (or web-pages). An assistant (or team of assistants) then scour information sources to find more documents of interest to the researcher based on these topics identified. In this particular case, the information sources were limited to a set of “approved” newspapers, hence the name “Newspaper Clipping Service”. The goal is to replace the assistants with an automated system.

The solution I came up with was to analyze the original web pages and treat keywords extracted out of these pages as topics, then for each keyword, query a popular search engine and gather the top 10 results from each query. The search engine can be customized so the sites it looks at is restricted by the list of approved newspapers. Finally the URLs of the results are aggregated together, and only URLs which were returned by more than 1 keyword topic are given back to the user.

The entire flow can be thought of as a series of Hadoop Map-Reduce jobs, to first download, extract and count keywords from (web pages corresponding to) URLs, and then to extract and count search result URLs from the keywords. I’ve been wanting to play with Cascading for a while, and this seemed like a good candidate, so the solution is implemented with Cascading.

Hmmm, but an “automated system” leaves the user to sort, create associations, etc., for themselves.

Assistants with such a “clipping service” could curate the clippings by creating associations with other materials and adding non-obvious but useful connections.

Think of the front page of the New York Times as an interface to curated content behind the stories that appear on it.

Where “home” is the article on the front page.

Not only more prose but a web of connections to material you might not even know existed.

For example, in Beijing Flaunts Cross-Border Clout in Search for Drug Lord by Jane Perlez and Bree Feng (NYT) we learn that:

Under Lao norms, law enforcement activity is not done after dark, (Liu Yuejin, leader of the antinarcotics bureau of the Ministry of Public Security)

Could be important information, depending upon your reasons for being in Laos.

« Newer PostsOlder Posts »

Powered by WordPress