Apache Hadoop and HBase by Todd Lipcon.
Another introductory slide deck. Don’t know which one is going to click for any individual so including it here.
Apache Hadoop and HBase by Todd Lipcon.
Another introductory slide deck. Don’t know which one is going to click for any individual so including it here.
Anja Jentsch posted the following call on the public-lod@w3.org list:
we would like to thank you for putting so much effort in curating the CKAN packages for Linked Data sets since our last call.
We have compiled statistics for the 256 data sets[1] on CKAN that will be included in the next LOD Cloud: http://lod-cloud.net/state
Altogether 446 data sets are currently tagged on CKAN as LOD [2]. But the description of many of these data sets is still incomplete so that we can not find out whether they fulfil the minimal requirements for being included into the LOD cloud diagram (dereferencable URIs and RDF links to or from other data sources).
A list of data sets that could not include yet and an explanation of what is missing can be found here: http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/
Starting next week we will generate the next LOD cloud diagram [3].
Therefore we would like to invite those of you who publish data sets that we could not include yet to please review and update your entries. Please finalize your dataset descriptions until August 15th to ensure that your data set will be part of the LOD Cloud.
In order to aid you in this quest, we have provided a validation page for your CKAN entry with step-by-step guidance for the information needed:
http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/You can use the CKAN entry for DBpedia as an example:
http://ckan.net/package/dbpediaThank you for helping!
Cheers,
Anja, Chris and Richard[1] http://ckan.net/package/search?q=groups:lodcloud+AND+-tags:lodcloud.unconnected+AND+-tags:lodcloud.needsfixing
[2] http://ckan.net/tag/lod
[3] http://lod-cloud.net/
Just a reminder, today is the 10th of August so don’t wait to review your entry.
Whatever your approach, we all benefit from cleaner data.
LevelDB – Fast and Lightweight Key/Value Database From the Authors of MapReduce and BigTable
From the post:
LevelDB is an exciting new entrant into the pantheon of embedded databases, notable both for its pedigree, being authored by the makers of the now mythical Google MapReduce and BigTable products, and for its emphasis on efficient disk based random access using log-structured-merge (LSM) trees.
The plan is to keep LevelDB fairly low-level. The intention is that it will be a useful building block for higher-level storage systems. Basho is already investigating using LevelDB as one if its storage engines.
Includes a great summary of information from the LevelDB mailing list.
A must read if you are interested in LevelDB.
From the post:
Even though the term, NoSQL, has issues, it’s become important.
Recently, leaders from several NoSQL projects (Riak, HBase, CouchDB, Neo4j) came together for a session at Gluecon. And while they came from divergent perspectives, they all basically agreed that the term had been very helpful to developers and architects in identifying their systems as new database and/or database-alternative technologies.
There have been numerous NoSQL taxonomies, discussions about them, and calls to move beyond them. And while it’s clear to us, as well as our friends and customers, that MarkLogic Server sits among these technologies, we haven’t yet fully described why NoSQL folks should pay attention. To that end, this post is a first step at explaining why and how we’re more than “yet another NoSQL system”. And I’ll start with some context for NoSQL folks.
You should read the post for yourself but suffice for me to say that MarkLogic is an XML database that sports a universal index of the elements, attributes, hierarchy of documents as well as their content.
If that doesn’t sound interesting, see: MarkMail, which is powered by a MarkLogic server.
Interested now?
From the wiki page:
A Neo4J graph consists of the following element types:
- Node
- Relationship
- RelationshipType
- Property name
- Property value
These five types of elements don’t share a common interface, except for Node and Relationship, which both extend the PropertyContainer interface.
The Enhanced API unifies all Neo4j elements under the common interface Vertex.
Which has the result:
Generalizations of database elements
The Vertex interface support methods for the manipulation of Properties and Edges, thereby providing all methods normally associated with Nodes. Properties and Edges (including their Types) are Vertices too. This allows for the generalization of all Neo4j database elements as if they were Nodes.
Due to generalization it is possible to create Edges involving regular Vertices, Edges of any kind, including BinaryEdges and Properties.
The generalization also makes it possible to set Properties on all Vertices. So it even becomes possible to set a Property on a Property.
Hmmm, properties on properties, where have I heard that? 😉
Properties on properties and properties on values are what we need for robust data preservation, migration or even re-use.
I was reminded recently that SNOBOL turns 50 years old in 2012. Care to guess how many formats, schemas, data structures we have been through during just that time period. Some of them intended to be “legacy” formats, forever readable by those who follow. Except that people forget the meaning of the “properties” and their “values.”
If we had properties on properties and properties on values, we could at least record our present understandings of those items. And others could do the same to our properties and values.
Those mappings would not be universally useful to everyone. But if present, we would have the options to follow those mappings or not.
Perhaps that’s the key, topic maps are about transparent choice in the reuse of data.
Leaves the exercise or not of choice up to the user.
This is a step in that direction.
While we are talking about MapReduce, may as well mention a Riak project, Pipe, that went out in beta in mid-June of this year.
From Bryan Fink’s announcement:
I’m excited to announce the opening of a new beta-status Basho project today: Riak Pipe.
http://github.com/basho/riak_pipe
Riak Pipe is a new way to distribute work around a Riak cluster.
The README explains much more than I can here, but essentially Riak Pipe allows you to specify work in the form of a chain of function pairs. One function of that pair describes how to produce output from input, and the other describes where in the cluster an input should be processed. Riak Pipe handles the details of ferrying data between workers by building atop Riak Core’s distribution power.
At this point in time Riak Pipe is BETA-status software. We’d like anyone who is interested in it to take a look and send us feedback. Please do not put it into production. We will be continuing to improve Riak Pipe toward a future release date.
We have two plans for Riak Pipe. The first is to power Riak’s MapReduce system with it. We think Riak Pipe provides a cleaner, more manageable subsystem that will provide much easier monitoring, debugging, and general use of MapReduce in Riak. You can see our work toward that goal in the “pipe” branch of Riak KV (start at src/riak_kv_mrc_pipe.erl):
https://github.com/basho/riak_kv/tree/pipe
Our second plan for Riak Pipe is to expand Riak’s MapReduce system with more abilities (imagine a keyed-reduce phase, or additional processing languages), possibly to the extent of providing an entirely separate interface (new query syntax? offline/asynchronous processing?). But for this part, we need your help.
We have some ideas about what external client interfaces might look like. We also have some ideas about what an external processing interface might look like. We’re still in the early phases of creating these, though, so if exploring the riak_pipe repository gives you ideas, please don’t hesitate to get in touch.
And, again, Riak Pipe is BETA software. Basho does not support running it in production at this time.
From the post:
agamemnon is a Python-based graph database built on pycassa, the Python client library for Apache Cassandra. In short, it enables you to use Cassandra as a graph database. The API is inspired by the Python wrapper for Neo4j, neo4j.py.
The article also has a pointer to a nice summary of graph databases by Adam Wiggins.
From the post:
Imagine getting a friend’s advice on a personal problem and being safe in the knowledge that it would be impossible for your friend to divulge the question, or even his own reply.
Researchers at Microsoft have taken a step toward making something similar possible for cloud computing, so that data sent to an Internet server can be used without ever being revealed. Their prototype can perform statistical analyses on encrypted data despite never decrypting it. The results worked out by the software emerge fully encrypted, too, and can only be interpreted using the key in the possession of the data’s owner.
Uses a technique called homomorphic encryption.
The article says 5 to 10 years before practical application, but it was 30 years between its proposal and a formal proof it was even possible. In the 2 or 3 years since that proof, a number of almost practical demonstrations have emerged. Would not bet on the 5 to 10 year time frame.
Parallel Processing Using the Map Reduce Programming Model
Demonstrates parallel processing using map reduce on the IMDB. It starts with a Perl script (Robert should like that) but then moves to Java to to use Hadoop. Code listed in the article and is available at github.
Nothing ground breaking but it will help some users gain confidence with Hadoop/Map Reduce with a familiar data set.
The first in a series of posts on Solr and the ISFDB. (Try Solr-ISFDB for all the posts.)
ISFDB = Internet Speculative Fiction Database.
A bit over 650,000 documents when this series started last January so we aren’t talking “big data” but its a fun data set. And the lessons to be learned here will stay us in good stead with much larger data sets.
I haven’t read all the posts yet but did notice comments about modeling relationships. As I work through the posts, will see how close/far away that modeling comes to a topic maps approach.
Working through something like this won’t hurt in terms of preparing for Lucene/Solr certification either. Haven’t decided on that but until we have a topic map certification it would not hurt.
From the website:
Iris Couch is a free hosting service—a Couch in the cloud. Sign up to run Couchbase Server, a standard, orthodox Apache CouchDB platform, plus Volker Mische’s GeoCouch geospatial index.
If you can spell your name, you can run CouchDB.
I don’t know…, it’s pretty early in the morning here for spelling tests. 😉
Looks like a good way to promote experimentation with CouchDB.
Anyone who can spell their name care to comment on their experience here? (good spelling suggested but not required)
A New Best Friend: Gephi for Large-scale Networks
Though I never intended it, some posts of mine from a few years back dealing with 26 tools for large-scale graph visualization have been some of the most popular on this site. Indeed, my recommendation for Cytoscape for viewing large-scale graphs ranks within the top 5 posts all time on this site.
When that analysis was done in January 2008 my company was in the midst of needing to process the large UMBEL vocabulary, which now consists of 28,000 concepts. Like anything else, need drives research and demand, and after reviewing many graphing programs, we chose Cytoscape, then provided some ongoing guidelines in its use for semantic Web purposes. We have continued to use it productively in the intervening years.
Like for any tool, one reviews and picks the best at the time of need. Most recently, however, with growing customer usage of large ontologies and the development of our own structOntology editing and managing framework, we have begun to butt up against the limitations of large-scale graph and network analysis. With this post, we announce our new favorite tool for semantic Web network and graph analysis — Gephi — and explain its use and showcase a current example.
Times change and sometimes software choices do as well.
This is a case in point that reviews the current limitations of Cytoscape, the good points of Gephi, its needed improvements and pointers to more resources on Gephi. Can’t ask for much more.
Modifying a Lucene Snowball Stemmer
From the post:
This post is written for advanced users. If you do not know what SVN (Subversion) is or if you’re not ready to get your hands dirty, there might be something more interesting to read on Wikipedia. As usual. This is an introduction to how to get a Lucene development environment running, a Solr environment and lastly, to create your own Snowball stemmer. Read on if that seems interesting. The receipe for regenerating the Snowball stemmer (I’ll get back to that…) assumes that you’re running Linux. Please leave a comment if you’ve generated the stemmer class under another operating system.
When indexing data in Lucene (a fulltext document search library) and Solr (which uses Lucene), you may provide a stemmer (a piece of code responsible for “normalizing” words to their common form (horses => horse, indexing => index, etc)) to give your users better and more relevant results when they search. The default stemmer in Lucene and Solr uses a library named Snowball which was created to do just this kind of thing. Snowball uses a small definition language of its own to generate parsers that other applications can embed to provide proper stemming.
Really advanced users will want to check out Snowball’s namesake, SNOBOL, a powerful pattern matching language that was invented in 1962. (That’s not a typo, 1962.) See SNOBOL4.org for more information and resources.
The post outlines how to change the default stemmer for Lucene and Solr to improve its stemming of Norwegian words. Useful in case you want to write/improve a stemmer for another language.
Tottenham riots: Data journalists and social scientists should join forces
Interpretations of riots reek with racial and class prejudice. I remember the riots in the 1960’s as well as more recent ones. The interpretations that followed could be predicted based on what channel or commentator was on the TV.
Topic maps are well suited to bring up parallel events from history, along with calmer analysis.
I wonder if anyone would bother to read such a topic map? Or, like the various economic bubbles that keep repeating themselves, would they say “this time is different.”
Parallel Data Warehouse News and Hadoop Interoperability Plans
From the MS SQL Server Team Blog:
In the data deluge faced by businesses, there is also an increasing need to store and analyze vast amounts of unstructured data including data from sensors, devices, bots and crawlers. By many accounts, almost 80% of what businesses store is unstructured data – and this volume is predicted to grow exponentially over the next decade. We have entered the age of Big Data. Our customers have been asking us to help store, manage, and analyze both structured and unstructured data – in particular, data stored in Hadoop environments. As a first step, we will soon release a Community Technology Preview (CTP) of two new Hadoop connectors – one for SQL Server and one for PDW. The connectors provide interoperability between SQL Server/PDW and Hadoop environments, enabling customers to transfer data between Hadoop and SQL Server/PDW. With these connectors, customers can more easily integrate Hadoop with their Microsoft Enterprise Data Warehouses and Business Intelligence solutions to gain deeper business insights from both structured and unstructured data.
I don’t have a SQL Server or a Parallel Data Warehouse so someone will need to contribute comments on the new Hadoop connectors when they appear.
I will note that the more seamless data interchange becomes, the greater user satisfaction with the tools they are called up to use. Which is a good thing for long term market share.
About as hard to answer as: “What is a topic map designer?”
From the post:
- Obtain: pointing and clicking does not scale.
- …
- Scrub: the world is a messy place.
- …
- Explore: You can see a lot by looking.
- …
- Models: always bad, sometimes ugly.
- …
- iNterpret: “The purpose of computing is insight, not numbers.”
- …
See the post for the full details and suggest any other qualifications in your comments below. Thanks!
I saw a post on a mailing list from Adam Retter with the following news:
I would just like to let you all know that the Release Candidate for eXist-db 1.4.1 is out, this is the culmination of two years of hard work. We take our releases very seriously!
eXist-db is all about XML, indexing and querying. We provide various indexes including full-text indexing of structured, semi-structured and un-structured content. Our unit of storage is the Document, XML or Binary. We can also extract and make searchable content from Binary Documents.
Not only is eXist-db an OpenSource XML Native Database, its also a fully fledged web application platform for XRX applications, so don’t let the ‘-db’ bit fool you.
Had not meant to be neglecting the XML databases. You are going to encounter them in a number of contexts, either as storing data you need or as repositories you address from within a topic map.
Suicide Note Classification Using Natural Language Processing: A Content Analysis
Punch line (for the impatient):
…trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time.
Abstract:
Suicide is the second leading cause of death among 25–34 year olds and the third leading cause of death among 15–25 year olds in the United States. In the Emergency Department, where suicidal patients often present, estimating the risk of repeated attempts is generally left to clinical judgment. This paper presents our second attempt to determine the role of computational algorithms in understanding a suicidal patient’s thoughts, as represented by suicide notes. We focus on developing methods of natural language processing that distinguish between genuine and elicited suicide notes. We hypothesize that machine learning algorithms can categorize suicide notes as well as mental health professionals and psychiatric physician trainees do. The data used are comprised of suicide notes from 33 suicide completers and matched to 33 elicited notes from healthy control group members. Eleven mental health professionals and 31 psychiatric trainees were asked to decide if a note was genuine or elicited. Their decisions were compared to nine different machine-learning algorithms. The results indicate that trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time.
The researchers concede that the data set is small but apparently it is the only one of it kind.
I mention the study here as a reason to consider using ML techniques in your next topic map project.
Merging the results from different ML algorithms re-creates the original topic maps use case, how do you merge indexes made by different indexers?, but that can’t be helped. More patterns to discover to use as the basis for merging rules!*
PS: I spotted this at Improbable Results: Machines vs. Professionals: Recognizing Suicide Notes.
Another part wants to say that no, the results of classifiers, whether programmed by the same group or different groups, should not make a difference. Well, other than having to “merge” the results of the classifiers, which happens with an ensemble anyway. In that case you might have to think about it more.
Hard to say. Will have to investigate further.
Probability Primer – Mathematicalmonk
From the description:
A series of videos giving an introduction to the definitions, notation, and basic concepts one would encounter in a 1st year graduate probability course.
More math videos from the Mathematicalmonk.
I don’t think a presentation style can be “copied” but surely there are lessons we can take from it to apply in other venues.
Suggestions?
Creating an Elasticsearch Plugin
From the post:
Elasticsearch is a great search engine built on top of Apache Lucene. We came across the need to add new functionality and did not want to fork Elasticsearch for this. Luckily Elasticsearch comes with a plugin framework. We all ready leverage this framework to use the Apache Thrift transport. There was no documentation on how to create a plugin so after digging around in the code a little we where able to to create our own plugin.
Here is a tutorial on creating a plugin and installing it into Elasticsearch.
Just in case you are using and need to extent Elasticsearch.
This post started with my finding the data mining slides at Slideshare (about 4 years old) and after organizing those, deciding to check Professor Pier Luca Lanzi’s homepage for more recent material. I think you will find it useful material.
The professor is obviously interested in video games, a rapidly growing area of development and research.
Combining video games with data mining, that would be a real coup.
Data Mining Course page
Includes prior exams, video (2009 course), transparencies from all lectures.
Lecture slides on Data Mining and Machine Learning at Slideshare.
Not being a lemming, I don’t find most viewed a helpful sorting criteria.
I organized the data mining slides in course order (as nearly as I could determine, there are two #6 presentations and no #7 or #17 presentations):
05 Association rules: advanced topics
06 Clustering: Partitioning Methods
09 Density-based, Grid-based, and Model-based Clustering
10 Introduction to Classification
13 Nearest Neighbor and Bayesian Classifiers
15 Data Exploration and Preparation
Genetic Algorithms
Bayesian Reasoning and Machine Learning by David Barber.
Whom this book is for
The book is designed to appeal to students with only a modest mathematical background in undergraduate calculus and linear algebra. No formal computer science or statistical background is required to follow the book, although a basic familiarity with probability, calculus and linear algebra would be useful. The book should appeal to students from a variety of backgrounds, including Computer Science, Engineering, applied Statistics, Physics, and Bioinformatics that wish to gain an entry to probabilistic approaches in Machine Learning. In order to engage with students, the book introduces fundamental concepts in inference using only minimal reference to algebra and calculus. More mathematical techniques are postponed until as and when required, always with the concept as primary and the mathematics secondary.
The concepts and algorithms are described with the aid of many worked examples. The exercises and demonstrations, together with an accompanying MATLAB toolbox, enable the reader to experiment and more deeply understand the material. The ultimate aim of the book is to enable the reader to construct novel algorithms. The book therefore places an emphasis on skill learning, rather than being a collection of recipes. This is a key aspect since modern applications are often so specialised as to require novel methods. The approach taken throughout is to first describe the problem as a graphical model, which is then translated in to a mathematical framework, ultimately leading to an algorithmic implementation in the BRMLtoolbox.
The book is primarily aimed at final year undergraduates and graduates without significant experience in mathematics. On completion, the reader should have a good understanding of the techniques, practicalities and philosophies of probabilistic aspects of Machine Learning and be well equipped to understand more advanced research level material.
The main page for the book and link to software.
David Barber’s homepage.
The book is due to be published by Cambridge University Press in the summer of 2011.
Machine Learning – Mathematicalmonk
Engaging series of videos on machine learning. (159 as of 8 August 2011)
I have only watched the first five videos, but the videos are helpfully broken down into small segments (8 to 15 minutes), so you don’t have to commit to watching 30, 40 or 60 minutes of lecture at one time.
The lecturer has a very engaging style.
A style I would like to imitate for similar material on topic maps.
Introduction to Algorithms by Prof. Erik Demaine Prof. Charles Leiserson
From the description:
This course teaches techniques for the design and analysis of efficient algorithms, emphasizing methods useful in practice. Topics covered include: sorting; search trees, heaps, and hashing; divide-and-conquer; dynamic programming; amortized analysis; graph algorithms; shortest paths; network flow; computational geometry; number-theoretic algorithms; polynomial and matrix calculations; caching; and parallel computing.
The iTunes free version.
If you go to MIT: 6.006 Introduction to Algorithms, you can pickup lecture notes, exams and solutions, assignments (no solutions).
Must be the coming of Fall. A young person’s mind turns to CS lectures. 😉
Mahout: Scaleable Data Mining for Everybody by Ted Dunning.
Has to be the most entertaining and accessible presentations on classification I have seen to date.
Ted is a co-author of Mahout in Action with Sean Owen, Robin Anil, and Ellen Friedman.
If they had more of this sort of thing during the pledge drives to support public television I would bet that their numbers would be better. At least among a certain crowd! 😉
Computational Fairy Tales by Jeremy Kubica.
From the post:
Computer science concepts as told through fairy tales.
Quite an interesting collection of stories.
U.S. to Fund Hacking Projects That Thwart Cyber-Threats
From the post:
LAS VEGAS—Former L0pht hacker known as “Mudge” discussed a new government initiative to fund hacking projects designed to help block cyber-threats at the Black Hat security conference.
The Defense Advanced Research Projects Agency will fund new cyber-security proposals under the new Cyber-Fast Track project, Peiter Zatko, currently a program manager for the agency’s information innovation office, said in his Aug. 4 keynote speech at Black Hat. The project, originally announced at ShmooCon cyber-security conference back in January, will bridge the gap between hacker groups and government agencies, he said.
Under the Cyber-Fast Track initiative, DARPA will fund between 20 to 100 projects annually. Open to anybody, researchers can pitch DARPA with ideas and have a project approved and funded within 14 days of the application, Zatko said. Developers will retain intellectual property rights while DARPA will operate under government use rights, Zatko said.
That sounds awfully attractive.
Suspect the more specific the proposal the better chance of getting it funded, so will be omitting the universality arguments about topic maps and the coming data singularity. 😉
I don’t hang out in hacker circles (oversight on my part) so I guess the first step is to look at some of the conferences to see what threats are being discussed along with current remedies. To get a feel for where topic maps could make a difference.
If you do hang out in hacker circles (don’t tell me) and you are interested in working on a topic maps proposals for DARPA (I won’t ask where you get all your brilliant hacker ideas from, you must just read a lot), drop me a post.
On the Nature of Pipes by Marko Rodriguez.
From the post:
Pipes is a data flow framework developed by TinkerPop. The graph traversal language Gremlin is a Groovy-based domain-specific language for processing Blueprints-enabled graph databases with Pipes. Since the release of Pipes 0.7 on August 1, 2011, much of the functionality in Gremlin has been generalized and made available through Pipes. This has opened up the door for other JVM languages (e.g. JRuby, Jython, Clojure, etc.) to serve as host languages for graph traversal DSLs. In order to promote this direction, this post will explain Pipes from the vantage point of Gremlin.
You may not be a graph database enthusiast after reading Marko’s post but you will increase your understanding of them.
That you are not then a graph database enthusiast will be your own fault. 😉
The quiet rise of Gaussian Belief Propagation (GaBP) by Danny Bickson.
From the post:
Gaussian Belief Propagation is an inference method on a Gaussian graphical model which is related to solving a linear system of equations, one of the fundamental problems in computer science and engineering. I have published my PhD thesis on applications of GaBP in 2008.
When I started working on GaBP, it was absolutely useless algorithm with no documented applications.
Recently, I am getting a lot of inquiries from people who applying GaBP on real world problems. Some examples:
- Carnegie Mellon graduate student Kyung-Ah Sohn, working with Eric Xing, is working on regression problem for finding causal genetic variants of gene expressions, considered using GaBP for computing matrix inverses.
- UCSC researcher Daniel Zerbino using suing GaBP for smoothing genomic sequencing measurements with constraints.
- UCSB graduate student Yun Teng is working on implementing GaBP as part of the KDT (knowledge discovery toolbox package).
Furthermore, I was very excited to find out today from Noam Koenigstein, a Tel Aviv university graduate about Microsoft Research Cambridge project called MatchBox, which is using Gaussian BP for collaborative filtering and being actually deployed in MS. Some examples to other conversations I had are:
- Wall Street undisclosed company (that asked to remain private) who is using GaBP for parallel computation of linear regression of online stock market data.
- A gas and oil company was considering to use GaBP for computing the main diagonal of the inverse of a sparse matrix.
The MatchBox project is a recommender system that takes user choices into account, even ones in a current “session.”
Curious, to what extent are user preferences the same or different from way they identify subjects and the subjects they would identify?
The joy of algorithms and NoSQL: a MongoDB example
From the post:
In one of my previous blog posts, I debated the superficial idea that you should own billions of data records before you are eligible to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically MongoDB, to solve a specific Chemoinformatics problem in a truly elegant and efficient way. The complete source code can be found on the Datablend public GitHub repository.
1. Molecular similarity theory
Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that similar compounds generally exhibit similar biological activities.) Unfortunately, finding substructures in chemical compounds is a NP-complete problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of structural keys and fingerprints.
If similarity is domain specific, what are the similarity measures in your favorite domain?
Powered by WordPress