Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 19, 2011

MONK

Filed under: Data Mining,Digital Library,Semantics,Text Analytics — Patrick Durusau @ 8:32 pm

MONK

From the Introduction:

The MONK Project provides access to the digitized texts described above along with tools to enable literary research through the discovery, exploration, and visualization of patterns. Users typically start a project with one of the toolsets that has been predefined by the MONK team. Each toolset is made up of individual tools (e.g. a search tool, a browsing tool, a rating tool, and a visualization), and these tools are applied to worksets of texts selected by the user from the MONK datastore. Worksets and results can be saved for later use or modification, and results can be exported in some standard formats (e.g., CSV files).

The public data set:

This instance of the MONK Project includes approximately 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare provided by the scholars and libraries at Northwestern University, Indiana University, the University of North Carolina at Chapel Hill, and the University of Virginia. These texts are available to all users, regardless of institutional affiliation.

Digging a bit further:

Each of these texts is normalized (using Abbot, a complex XSL stylesheet) to a TEI schema designed for analytic purposes (TEI-A), and each text has been “adorned” (using Morphadorner) with tokenization, sentence boundaries, standard spellings, parts of speech and lemmata, before being ingested (using Prior) into a database that provides Java access methods for extracting data for many purposes, including searching for objects; direct presentation in end-user applications as tables, lists, concordances, or visualizations; getting feature counts and frequencies for analysis by data-mining and other analytic procedures; and getting tokenized streams of text for working with n-gram and other colocation analyses, repetition analyses, and corpus query-language pattern-matching operations. Finally, MONK’s quantitative analytics (naive Bayesian analysis, support vector machines, Dunnings log likelihood, and raw frequency comparisons), are run through the SEASR environment.

Here’s my topic maps question: So, how do I reliably combine the results from a subfield that uses a different vocabulary than my own? For that matter, how do I discover it in the first place?

I think the MONK project is quite remarkable but lament the impending repetition of research across such a vast archive simply because it is unknown or expressed a “foreign” tongue.

Tutorial on Dr. Who (and Neo4j)

Filed under: Graphs,Neo4j — Patrick Durusau @ 8:32 pm

Neo4J User Group:A Short Tutorial on Doctor Who (and Neo4j)

From the webpage:

With June’s Neo4j meeting we’re moving to our new slot of last Wednesday of the month (29 June). But more importantly, we’re going to be getting our hands on the code. We’ve packed a wealth of Doctor Who knowledge into a graph, ready for you to start querying. At the end of 90 minutes and a couple of Koans, you’ll be answering questions about the Doctor Who universe like a die-hard fan. You’ll need a laptop, your Java IDE of choice, and a copy of the Koans, which you can grab from http://bit.ly/neo4j-koan

The short tutorial noted at: A Short Tutorial on Doctor Who (and Neo4j).

High Performance Computing (HPC)

Filed under: Cloud Computing — Patrick Durusau @ 8:31 pm

High Performance Computing (HPC) over at Amazon Web Services.

From the website:

Researchers and businesses alike have complex computational workloads such as tightly coupled parallel processes or demanding network-bound applications, from genome sequencing to financial modeling. Regardless of the application, one major issue affects them both: procuring and provisioning machines. In typical cluster environments, there is a long queue to access machines, and purchasing dedicated, purpose-built hardware takes time and considerable upfront investment.

With Amazon Web Services, businesses and researchers can easily fulfill their high performance computational requirements with the added benefit of ad-hoc provisioning and pay-as-you-go pricing.

I have a pretty full Fall but want to investigate AWS for topic map experiments and possibly even delivery of content.

Yes, AWS has crashed and on that see: Why the AWS Crash is Good Thing by Chris Hawkins.

Anyone using it presently? War stories you want to share?

Predicate dispatching: A unified theory of dispatch

Filed under: Classifier,Predicate Dispatch — Patrick Durusau @ 8:31 pm

Predicate dispatching: A unified theory of dispatch

The term predicate dispatching was new to me and so I checked at Stackoverflow and found: What is predicate dispatch?

This paper was one of answers, which is accompanied with slides, implementation and manual.

Abstract:

Predicate dispatching generalizes previous method dispatch mechanisms by permitting arbitrary predicates to control method applicability and by using logical implication between predicates as the overriding relationship. The method selected to handle a message send can depend not just on the classes of the arguments, as in ordinary object-oriented dispatch, but also on the classes of subcomponents, on an argument’s state, and on relationships between objects. This simple mechanism subsumes and extends object-oriented single and multiple dispatch, ML-style pattern matching, predicate classes, and classifiers, which can all be regarded as syntactic sugar for predicate dispatching. This paper introduces predicate dispatching, gives motivating examples, and presents its static and dynamic semantics. An implementation of predicate dispatching is available.

Thought it might be interesting weekend reading.

Pattern Matching & Predicate Dispatch

Filed under: Clojure,Pattern Matching,Predicate Dispatch — Patrick Durusau @ 8:30 pm

Pattern Matching & Predicate Dispatch by David Nolen, NYC Clojure Users Group 8/17/2011.

From the description:

From David:

“Pattern matching is a powerful tool for processing data based on type and structure. In this talk I’ll discuss a new library I’ve been working on that provides optimized, extensible pattern matching to Clojure. We’ll cover the theory, the implementation, and future directions for this library, in particular the true goal – predicate dispatch.”

David Nolen is the primary developer of the Clojure contrib library core.logic – an efficient Prolog-like logic engine. The pattern matching library continues his investigations into the relationships between object-oriented, functional, and logic programming.

Slides: http://www.scribd.com/doc/62571669/Patterns

Source: https://github.com/swannodette/match

August 18, 2011

Building data startups: Fast, big, and focused

Filed under: Analytics,BigData,Data,Data Analysis,Data Integration — Patrick Durusau @ 6:54 pm

Building data startups: Fast, big, and focused (O’Reilly original)

Republished by Forbes as:
Data powers a new breed of startup

Based on the talk Building data startups: Fast, Big, and Focused

by Michael E. Driscoll

From the post:

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Describes the emerging big data stack and says:

The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

The future isn’t going to be getting users to develop topic maps but your use of topic maps (and other tools) to create data products of interest to users.

Think of it as being the difference between selling oil change equipment versus being the local Jiffy Lube. (Sorry, for non-U.S. residents, Jiffy Lube is a chain of oil change and other services. Some 2,000 locations in the North America.) I dare say that Jiffy Lube and its competitors do more auto services than users of oil change equipment.

Integration Imperatives Around Complex Big Data

Filed under: BigData,Data as Service (DaaS),Data Integration,Marketing — Patrick Durusau @ 6:52 pm

Integration Imperatives Around Complex Big Data

  • Informatica Corporation (NASDAQ: INFA), the world’s number one independent provider of data integration software, today announced the availability of a new research report from the Aberdeen Group that shows how organizations can get the most from their data integration assets in the face of rapidly growing data volumes and increasing data complexity.
  • Entitled: Future Integration Needs: Embracing Complex Data, the Aberdeen report reveals that:
    • Big Data is the new reality – In 2010, organizations experienced a staggering average data volume growth of 40 percent.
    • XML adoption has increased dramatically – XML is the most common semi-structured data source that organizations integrate. 74 percent of organizations are integrating XML from external sources. 66 percent of organizations are integrating XML from internal sources.
    • Data complexity is skyrocketing – In the next 12 months enterprises plan to introduce more complex unstructured data sources – including office productivity documents, email, web content and social media data – than any other data type.
    • External data sources are proliferating – On average, organizations are integrating 14 external data sources, up from 11 a year ago.
    • Integration costs are rising – As integration of external data rises, it continues to be a labor- and cost-intensive task, with organizations integrating external sources spending 25 percent of their total integration budget in this area.
  • For example, according to Aberdeen, organizations that have effectively integrated complex data are able to:
    • Use up to 50 percent larger data sets for business intelligence and analytics.
    • Integrate twice as successfully external unstructured data into business processes (40 percent vs. 19 percent).
    • Deliver critical information in the required time window 2.5 times more often via automated data refresh.
    • Slash the incidence of errors in their data almost in half compared to organizations relying on manual intervention when performing data updates and refreshes.
    • Spend an average of 43 percent less on integration software (based on 2010 spend).
    • Develop integration competence more quickly with significantly lower services and support expenditures, resulting in less costly business results.

I like the 25% of data integration budgets being spend on integrating external data. Imagine making that easier for enterprises with a topic map based service.

Maybe “Data as service (DaaS)” will evolve from simply being data delivery to dynamic integration of data from multiple sources. Where currency, reliability, composition, and other features of the data are on a sliding scale of value.

Introduction to Logic Programming with Clojure

Filed under: Clojure,Logic — Patrick Durusau @ 6:51 pm

Introduction to Logic Programming with Clojure

Assumes no background in logic but some experience with Clojure. A logic beginner’s tutorial.

Everyone has to start somewhere. Enjoy.

Calling Mahout from Clojure

Filed under: Clojure,Mahout — Patrick Durusau @ 6:51 pm

Calling Mahout from Clojure

From the post:

Mahout is a set of libraries for running machine learning processes, such as recommendation, clustering and categorisation.

The libraries work against an abstract model that can be anything from a file to a full Hadoop cluster. This means you can start playing around with small data sets in files, a local database, a Hadoop cluster or a custom data store.

After a bit of research, it turned out not to be too complex to call via any JVM language. When you compile and install Mahout, the libraries are installed into your local Maven cache. This makes it very easy to include them into any JVM type project.

Concludes with two interesting references:

Visualizing Mahout’s output with Clojure and Incanter

Monte Carlo integration with Clojure and Mahout

Introduction to Databases

Filed under: CS Lectures,Database,SQL — Patrick Durusau @ 6:50 pm

Introduction to Databases by Jennifer Widom.

Course Description:

This course covers database design and the use of database management systems for applications. It includes extensive coverage of the relational model, relational algebra, and SQL. It also covers XML data including DTDs and XML Schema for validation, and the query and transformation languages XPath, XQuery, and XSLT. The course includes database design in UML, and relational design principles based on dependencies and normal forms. Many additional key database topics from the design and application-building perspective are also covered: indexes, views, transactions, authorization, integrity constraints, triggers, on-line analytical processing (OLAP), and emerging “NoSQL” systems.

The third free Stanford course being offered this Fall.

The others are: Introduction to Artificial Intelligence and Introduction to Machine Learning.

As of today, the AI course has a registration of 84,000 from 175 countries. I am sure the machine learning with Ng and the database class will post similar numbers.

My only problem is that I lack the time to take all three while working full time. Best hope is for an annual repeat of these offerings.

BMC Bioinformatics

Filed under: Bioinformatics,Biomedical,Clustering — Patrick Durusau @ 6:49 pm

BMC Bioinformatics

From the webpage:

BMC Bioinformatics is an open access journal publishing original peer-reviewed research articles in all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics (ISSN 1471-2105) is indexed/tracked/covered by PubMed, MEDLINE, BIOSIS, CAS, EMBASE, Scopus, ACM, CABI, Thomson Reuters (ISI) and Google Scholar.

Let me give you a sample of what you will find here:

MINE: Module Identification in Networks by Kahn Rhrissorrakrai and Kristin C Gunsalus. BMC Bioinformatics 2011, 12:192 doi:10.1186/1471-2105-12-192.

Abstract:

Graphical models of network associations are useful for both visualizing and integrating multiple types of association data. Identifying modules, or groups of functionally related gene products, is an important challenge in analyzing biological networks. However, existing tools to identify modules are insufficient when applied to dense networks of experimentally derived interaction data. To address this problem, we have developed an agglomerative clustering method that is able to identify highly modular sets of gene products within highly interconnected molecular interaction networks.

Medicine isn’t my field by profession (although I enjoy reading about it) but it doesn’t take much to see the applicability of an “agglomerative clustering method” to other highly interconnected networks.

Reading across domain specific IR publications can help keep you from re-inventing the wheel or perhaps sparking an idea for a better wheel of your own making.

Thinking Forth Project

Filed under: Forth — Patrick Durusau @ 6:47 pm

Thinking Forth Project

From the webpage:

Thinking Forth is a book about the philosophy of problem solving and programming style, applied to the unique programming language Forth. Published first in 1984, it could be among the timeless classics of computer books, such as Fred Brooks’ The Mythical Man-Month and Donald Knuth’s The Art of Computer Programming.

Many software engineering principles discussed here have been rediscovered in eXtreme Programming, including (re)factoring, modularity, bottom-up and incremental design. Here you’ll find all of those and more – such as the value of analysis and design – described in Leo Brodie’s down-to-earth, humorous style, with illustrations, code examples, practical real life applications, illustrative cartoons, and interviews with Forth’s inventor, Charles H. Moore as well as other Forth thinkers.

If you program in Forth, this is a must-read book. If you don’t, the fundamental concepts are universal: Thinking Forth is meant for anyone interested in writing software to solve problems. The concepts go beyond Forth, but the simple beauty of Forth throws those concepts into stark relief.

So flip open the book, and read all about the philosophy of Forth, analysis, decomposition, problem solving, style and conventions, factoring, handling data, and minimizing control structures. But be prepared: you may not be able to put it down.

PDF version of “Thinking Forth” available for free. Not to mention a revision project.

Many of the techniques in this book apply to data analysis/topic map design as well.

How You Should Go About Learning NoSQL

Filed under: Dynamo,MongoDB,NoSQL,Redis — Patrick Durusau @ 6:46 pm

How You Should Go About Learning NoSQL

Interesting post that expands on three rules for learning NoSQL:

1: Use MongoDB.
2: Take 20 minute to learn Redis
3: Watch this video to understand Dynamo.

Getting Started with Riak and .Net

Filed under: NoSQL,Riak — Patrick Durusau @ 6:46 pm

Getting Started with Riak and .Net by Adrian Hills.

Short “getting started” guide. The installation was on Ubuntu and then he connects to the server with a .Net client.

I wondered about the statement that Riak would not run on Windows (there are no pre-compiled binaries for Windows). Stackoverflow reports on Riak on Windows, several options to have Riak run on a Windows system. Compile under Windows, CYGwin, or run VMWARE or VirtualBox and run Riak inside the Linux VM.

August 17, 2011

What’s New in MySQL 5.6 – Part 1: Overview – Webinar 18 August 2011

Filed under: MySQL,NoSQL,SQL — Patrick Durusau @ 6:54 pm

What’s New in MySQL 5.6 – Part 1: Overview

From the webpage:

MySQL 5.6 builds on Oracle’s investment in MySQL by adding improvements to Performance, InnoDB, Replication, Instrumentation and flexibility with NoSQL (Not Only SQL) access. In the first session of this 5-part Webinar series, we’ll cover the highlights of those enhancements to help you begin the development and testing efforts around the new features and improvements that are now available in the latest MySQL 5.6 Development Milestone and MySQL Labs releases.

OK, I’ll ‘fess up, I haven’t kept up with MySQL like I did when I was a sysadmin and running it everyday in a production environment. So, maybe its time to do some catching up.

Besides, when you read:

We will also explore how you can now use MySQL 5.6 as a “Not Only SQL” data source for high performance key-value operations by leveraging the new Memcached Plug-in to InnoDB, running simultaneously with SQL for more complex queries, all across the same data set.

“…SQL for more complex queries,…” you almost have to look. 😉

So, get up early tomorrow and throw a recent copy of MySQL on a box.

Recent Advances in Literature Based Discovery

Recent Advances in Literature Based Discovery

Abstract:

Literature Based Discovery (LBD) is a process that searches for hidden and important connections among information embedded in published literature. Employing techniques from Information Retrieval and Natural Language Processing, LBD has potential for widespread application yet is currently implemented primarily in the medical domain. This article examines several published LBD systems, comparing their descriptions of domain and input data, techniques to locate important concepts from text, models of discovery, experimental results, visualizations, and evaluation of the results. Since there is no comprehensive “gold standard, ” or consistent formal evaluation methodology for LBD systems, the development and usage of effective metrics for such systems is also discussed, providing several options. Also, since LBD is currently often time-intensive, requiring human input at one or more points, a fully-automated system will enhance the efficiency of the process. Therefore, this article considers methods for automated systems based on data mining.

Not “recent” now because the paper dates from 2006 but it is a good overview of Literature Based Discovery (LBD) at the time.

Mental Shortcuts and Relational Databases

Filed under: SQL — Patrick Durusau @ 6:51 pm

Mental Shortcuts and Relational Databases by Robert Pickering.

Premise is that relational databases evolved to solve a particular set of hardware constraints and problems. Not all that surprising if you think about it. How would software attempt to solve problems not yet known? Doesn’t make SQL any less valuable for the problems it solves well.

GeeCON 2011 – A programmatic introduction to Neo4j – Jim Webber

Filed under: Graphs,Neo4j — Patrick Durusau @ 6:51 pm

GeeCON 2011 – A programmatic introduction to Neo4j – Jim Webber

From the description:

There’s been substantial interest in recent years in exploring data storage technology that defies the relational model orthodoxy. Many so-called NoSQL databases have grown in this space, each of tackles problems as diverse as scalability, availability, fault tolerance, and semantic richness. In this talk, I’ll provide a brief background on the NoSQL landscape, and a deeper introduction to my latest project Neo4j. Neo4j is an open source graph database which efficiently persists data in nodes and relationships and is optimised for extremely fast traversals, providing superior insight into data than is easily possible in traditional relational databases (or the semantically poorer category of NoSQL databases). The bulk of this talk will be in code, where we’ll see plenty of examples of how to write systems against Neo4j, starting with a simple social Web example.

Virtual Cell Software Repository

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:49 pm

Virtual Cell Software Repository

From the webpage:

Developing large volume multi-scale systems dynamics interpretation technology is very important source for making virtual cell application systems. Also, this technology is focused on the core research topics in a post-genome era in order to maintain the national competitive power. It is new analysis technology which can analyze multi-scale from nano level to physiological level in system level. Therefore, if using excellent information technology and super computing power in our nation, we can hold a dominant position in the large volume multi-scale systems dynamics interpretation technology. In order to take independent technology, we need to research a field of study which have been not well known in the bio system informatics technology like the large volume multi-scale systems dynamics interpretation technology.

The purpose of virtual cell application systems is developing the analysis technology and service which can model bio application circuits based on super computing technology. For success of virtual cell application systems based on super computing power, we have researched large volume multi-scale systems dynamics technology as a core sub technology.

  • Developing analysis and modeling technology of multi-scale convergence information from nano level to physiological level
  • Developing protein structure modeling algorithm using multi-scale bio information
  • Developing quality and quantity character analysis technology of multi-scale networks
  • Developing protein modification search algorithm
  • Developing large volume multi-scale systems dynamics interpretation technology interpreting possible circumstances in complex parameter spaces

Amazing set of resources available here:

PSExplorer: Parameter Space Explorer

Mathematical models of biological systems often have a large number of parameters whose combinational variations can yield distinct qualitative behaviors. Since it is intractable to examine all possible combinations of parameters for nontrivial biological pathways, it is required to have a systematic way to explore the parameter space in a computational way so that distinct dynamic behaviors of a given pathway are estimated.

We present PSExplorer, an efficient computational tool to explore high dimensional parameter space of computational models for identifying qualitative behaviors and key parameters. The software supports input models in SBML format. It provides a friendly graphical user interface allowing users to vary model parameters and perform time-course simulations at ease. Various graphical plotting features helps users analyze the model dynamics conveniently. Its output is a tree structure that encapsulates the parameter space partitioning results in a form that is easy to visualize and provide users with additional information about important parameters and sub-regions with robust behaviors.

MONET: MOdularized NETwork learning

Although gene expression data has been continuously accumulated and meta-analysis approaches have been developed to integrate independent expression profiles into larger datasets, the amount of information is still insufficient to infer large scale genetic networks. In addition, global optimization such as Bayesian network inference, one of the most representative techniques for genetic network inference, requires tremendous computational load far beyond the capacity of moderate workstations.

MONET is a Cytoscape plugin to infer genome-scale networks from gene expression profiles. It alleviates the shortage of information by incorporating pre-existing annotations. The current version of MONET utilizes thousands of parallel computational cores in the supercomputing center in KISTI, Korea, to cope with the computational requirement for large scale genetic network inference.

RBSDesigner

RBS Designer was developed to computationally design synthetic ribosome binding sites (RBS) to control gene expression levels. Generally transcription processes are the major target for gene expression control, however, without considering translation processes the control could lead to unexpected expression results since translation efficiency is highly affected by nucleotide sequences nearby RBS such as coding sequences leading to distortion of RBS secondary structure. Such problems obscure the intuitive design of RBS nucleotides with a desired level of protein expression. We developed RBSDesigner based on a mathematical model on translation initiation to design synthetic ribosome binding sites that yield a desired level of expression of user-specified coding sequences.

SBN simulator: Switching Boolean Networks Simulator

Switching Boolean Networks Simulator(SBNsimulator) was developed to simulate large-scale signaling network. Boolean Networks is widely used in modeling signaling networks because of its straightforwardness, robustness, and compatibility with qualitative data. Signaling networks are not completely known yet in Biology. Because of this, there are gaps between biological reality and modeling such as inhibitor-only or activator-only in signaling networks. Synchronous update algorithm in threshold Boolean network has limitation which cannot sample differences in the speed of signal propagation. To overcome these limitation which are modeling anomaly and Limitation of synchronous update algorithm, we developed SBNsimulator. It can simulate how each node effect to target node. Therefore, It can say which node is important for signaling network.

MKEM: Multi-level Knowledge Emergence Model

Since Swanson proposed the Undiscovered Public Knowledge (UPK) model, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections and are mainly applied to disease-effect relation. With the advancement in biomedical science, it has become imperative to extract and combine information from multiple disjoint researches, studies and articles to infer new hypotheses and expand knowledge. We propose MKEM, a Multi-level Knowledge Emergence Model, to discover implicit relationships using Natural Language Processing techniques such as Link Grammar and Ontologies such as Unified Medical Language System (UMLS) MetaMap. The contribution of MKEM is as follows: First, we propose a flexible knowledge emergence model to extract implicit relationships across different levels such as molecular level for gene and protein and Phenomic level for disease and treatment. Second, we employ MetaMap for tagging biological concepts. Third, we provide an empirical and systematic approach to discover novel relationships.

The system constitutes of two parts, tagger and the extractor (may require compilation)

A sentence of interest is given to the tagger which then proceeds to the creation of rule sets. The tagger stores this in a folder by the name of “ruleList”. These rule sets are then given by copying this folder to the extractor directory.

I blogged about an article on this project at: MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge.

Machine Learning – Stanford Class

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 6:48 pm

Machine Learning – Stanford Class

From the course description:

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). (iv) Reinforcement learning. The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

A free Stanford class on machine learning being taught by Professor Andrew Ng!

Over 200,000 people have viewed Professor Ng’s machine learning lectures on YouTube. Now you can participate and even get a certificate of accomplishment.

I am already planning to take the free Introduction to Artificial Intelligence class at Stanford so I can only hope they repeat Machine Learning next year.

Embracing Uncertainty: Applied Machine Learning Comes of Age

Filed under: Machine Learning,Recognition — Patrick Durusau @ 6:48 pm

Embracing Uncertainty: Applied Machine Learning Comes of Age

Christopher Bishop, Microsoft Research Cambridge, ICML 2011 Keynote.

Christopher reports the discovery that solving the problem of guesture controls isn’t one of tracking location, say of your arm from position to position.

Rather it is a question of recognition, at every frame. Which makes the computation tractable on older hardware.

Which makes me wonder how many other problems we have “viewed” the most difficult way possible? Or where viewing as problems of recognition will make previously intractable problems tractable? Won’t know unless we make the effort to ask.

Biodata Mining

Filed under: Bioinformatics,Biomedical,Data Mining — Patrick Durusau @ 6:47 pm

Biodata Mining

From the webpage:

BioData Mining is an open access, peer reviewed, online journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data.

What you would have seen since 1 July 2011:

An R Package Implementation of Multifactor Dimensionality Reduction

Hill-Climbing Search and Diversification within an Evolutionary Approach to Protein Structure Prediction

Detection of putative new mutacins by bioinformatic analysis using available web tools

Evolving hard problems: Generating human genetics datasets with a complex etiology

Taxon ordering in phylogenetic trees by means of evolutionary algorithms

Enjoy!

August 16, 2011

Semantic Vectors

Filed under: Implicit Associations,Indirect Inference,Random Indexing,Semantic Vectors — Patrick Durusau @ 7:07 pm

Semantic Vectors

From the webpage:

Semantic Vector indexes, created by applying a Random Projection algorithm to term-document matrices created using Apache Lucene. The package was created as part of a project by the University of Pittsburgh Office of Technology Management, and is now developed and maintained by contributors from the University of Texas, Queensland University of Technology, the Austrian Research Institute for Artificial Intelligence, Google Inc., and other institutions and individuals.

The package creates a WordSpace model, of the kind developed by Stanford University’s Infomap Project and other researchers during the 1990s and early 2000s. Such models are designed to represent words and documents in terms of underlying concepts, and as such can be used for many semantic (concept-aware) matching tasks such as automatic thesaurus generation, knowledge representation, and concept matching.

The Semantic Vectors package uses a Random Projection algorithm, a form of automatic semantic analysis. Other methods supported by the package include Latent Semantic Analysis (LSA) and Reflective Random Indexing. Unlike many other methods, Random Projection does not rely on the use of computationally intensive matrix decomposition algorithms like Singular Value Decomposition (SVD). This makes Random Projection a much more scalable technique in practice. Our application of Random Projection for Natural Language Processing (NLP) is descended from Pentti Kanerva’s work on Sparse Distributed Memory, which in semantic analysis and text mining, this method has also been called Random Indexing. A growing number of researchers have applied Random Projection to NLP tasks, demonstrating:

  • Semantic performance comparable with other forms of Latent Semantic Analysis.
  • Significant computational performance advantages in creating and maintaining models.

So, after reading about random indexing, etc., you can take those techniques out for a spin. It doesn’t get any better than that!

Distributional Semantics

Filed under: Distributional Semantics,Indexing,Indirect Inference,Random Indexing — Patrick Durusau @ 7:06 pm

Distributional Semantics.

Trevor Cohen’s, co-author with Roger Schvaneveldt, and Dominic Widdows of Reflective Random Indexing and indirect inference…, page on distributional semantics which starts with:

Empirical Distributional Semantics is an emerging discipline that is primarily concerned with the derivation of semantics (or meaning) from the distribution of terms in natural language text. My research in DS is concerned primarily with spatial models of meaning, in which terms are projected into high-dimensional semantic space, and an estimate of their semantic relatedness is derived from the distance between them in this space.

The relations derived by these models have many useful applications in biomedicine and beyond. A particularly interesting property of distributional semantics models is their capacity to recognize connections between terms that do not occur together in the same document, as this has implications for knowledge discovery. In many instances it is possible also to reveal a plausible pathway linking these terms by using the distances estimated by distributional semantic models to generate a network representation, and using Pathfinder networks (PFNETS) to reveal the most significant links in this network, as shown in the example below:

Links to projects, software and other cool stuff! Making a separate post on one of his software libraries.

Hyperdimensional Computing

Filed under: Random Indexing,von Neumann Architecture — Patrick Durusau @ 7:05 pm

Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors by Pentti Kanerva.

Reflective Random Indexing and indirect inference… cites Kanerva as follows:

Random Indexing (RI) [cites omitted] has recently emerged as a scalable alternative to LSA for the derivation of spatial models of semantic distance from large text corpora. For a thorough introduction to Random Indexing and hyper-dimensional computing in general, see [Kanerva, this paper] [cite omitted].

Kanerva’s abstract:

The 1990s saw the emergence of cognitive models that depend on very high dimensionality and randomness. They include Holographic Reduced Representations, Spatter Code, Semantic Vectors, Latent Semantic Analysis, Context-Dependent Thinning, and Vector-Symbolic Architecture. They represent things in high-dimensional vectors that are manipulated by operations that produce new high-dimensional vectors in the style of traditional computing, in what is called here hyperdimensional computing on account of the very high dimensionality. The paper presents the main ideas behind these models, written as a tutorial essay in hopes of making the ideas accessible and even provocative. A sketch of how we have arrived at these models, with references and pointers to further reading, is given at the end. The thesis of the paper is that hyperdimensional representation has much to offer to students of cognitive science, theoretical neuroscience, computer science and engineering, and mathematics.

This one will take a while to read and digest but I will be posting on it and the further reading it cites in the not too distant future.

Introduction to Random Indexing

Filed under: Indexing,Indirect Inference,Random Indexing — Patrick Durusau @ 7:04 pm

Introduction to Random Indexing by Magnus Sahlgren.

I thought this would be useful alongside Reflective Random Indexing and indirect inference….

Just a small sample of what you will find:

Note that this methodology constitutes a radically different way of conceptualizing how context vectors are constructed. In the “traditional” view, we first construct the co-occurrence matrix and then extract context vectors. In the Random Indexing approach, on the other hand, we view the process backwards, and first accumulate the context vectors. We may then construct a cooccurrence matrix by collecting the context vectors as rows of the matrix.

I like non-traditional approaches. Some work (like random indexing) and some don’t.

What new/non-traditional approaches have you tried in the last week? We learn as much (if not more) from failure as success.

Reflective Random Indexing and indirect inference…

Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections by Trevor Cohen, Roger Schvaneveldt, Dominic Widdows.

Abstract:

The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.

The term “direct inference” is used for establishing a relationship between terms with a shared “bridging” term. That is the terms don’t co-occur in a text but share a third term that occurs in both texts. “Indirect inference,” that is finding terms with no shared “bridging” term is the focus of this paper.

BTW, if you don’t have access to the Journal of Biomedical Informatics version, try the draft: Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections

MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge

Filed under: Associations,Data Mining — Patrick Durusau @ 7:03 pm

MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge

Abstract:

Background

Since Swanson proposed the Undiscovered Public Knowledge (UPK) model, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections and are mainly applied to disease-effect relation. With the advancement in biomedical science, it has become imperative to extract and combine information from multiple disjoint researches, studies and articles to infer new hypotheses and expand knowledge.

Methods

We propose MKEM, a Multi-level Knowledge Emergence Model, to discover implicit relationships using Natural Language Processing techniques such as Link Grammar and Ontologies such as Unified Medical Language System (UMLS) MetaMap. The contribution of MKEM is as follows: First, we propose a flexible knowledge emergence model to extract implicit relationships across different levels such as molecular level for gene and protein and Phenomic level for disease and treatment. Second, we employ MetaMap for tagging biological concepts. Third, we provide an empirical and systematic approach to discover novel relationships.

Results

We applied our system on 5000 abstracts downloaded from PubMed database. We performed the performance evaluation as a gold standard is not yet available. Our system performed with a good precision and recall and we generated 24 hypotheses.

Conclusions

Our experiments show that MKEM is a powerful tool to discover hidden relationships residing in extracted entities that were represented by our Substance-Effect-Process-Disease-Body Part (SEPDB) model.

From the article:

Swanson defined UPK is public and yet undiscovered in two complementary and non-interactive literature sets of articles (independently created fragments of knowledge), when they are considered together, can reveal useful information of scientific interest not apparent in either of the two sets alone [cites omitted].

Basis of UPK:

The underlying discovery method is based on the following principle: some links between two complementary passages of natural language texts can be largely a matter of form “A cause B” (association AB) and “B causes C” (association BC) (See Figure 1). From this, it can be seen that they are linked by B irrespective of the meaning of A, B, or C. However, perhaps nothing at all has been published concerning a possible connection between A and C, even though such link if validated would be of scientific interest. This allowed for the generation of several hypotheses such as “Fish’s oil can be used for treatment of Raynaud’s Disease” [cite omitted].

Fairly easy reading and interesting as well.

If you recognize TF*IDF, the primary basis for Lucene, you will be interested to learn it has some weaknesses for UPK. If I understand the authors correctly, ranking terms statistically is insufficient to mine implied relationships. Related terms aren’t ranked high enough. I don’t think “boosting” would help because the terms are not known ahead of time. I say that, although I suppose you could “boost” on the basis of implied relationships. Will have to think about that some more.

You will find “non-interactive literature sets of articles” in computer science, library science, mathematics, law, just about any field you can name. Although you can mine across those “literature sets,” it would be interesting to identify those sets, perhaps with a view towards refining UPK mining. Can you suggest ways to distinguish such “literature sets?”

Oh, link to the software: MKEM (Note to authors: Always include a link to your software, assuming it is available. Make it easy on readers to find and use your hard work!)

August 15, 2011

Index With Performance Close to Linear (was: Index With Linear Performance Close to Constant)

Filed under: NoSQL,OrientDB — Patrick Durusau @ 7:32 pm

I don’t have a link (yet) but @lgarulli reports that OrientDB’s new index has a measured growth factor of 0,000006 per entry stored.

Will update when more information becomes available.

See OrientDB.

Although for the new index you will need the sources I suspect: OrientDB sources.


Lars suggested the correction when this post appeared but I never quite got around to changing it. Preserved the original as I dislike content that changes under foot.

« Newer PostsOlder Posts »

Powered by WordPress