Archive for the ‘Pattern Recognition’ Category

Data Mining Patterns in Crossword Puzzles [Patterns in Redaction?]

Saturday, March 5th, 2016

A Plagiarism Scandal Is Unfolding In The Crossword World by Oliver Roeder.

From the post:

A group of eagle-eyed puzzlers, using digital tools, has uncovered a pattern of copying in the professional crossword-puzzle world that has led to accusations of plagiarism and false identity.

Since 1999, Timothy Parker, editor of one of the nation’s most widely syndicated crosswords, has edited more than 60 individual puzzles that copy elements from New York Times puzzles, often with pseudonyms for bylines, a new database has helped reveal. The puzzles in question repeated themes, answers, grids and clues from Times puzzles published years earlier. Hundreds more of the puzzles edited by Parker are nearly verbatim copies of previous puzzles that Parker also edited. Most of those have been republished under fake author names.

Nearly all this replication was found in two crosswords series edited by Parker: the USA Today Crossword and the syndicated Universal Crossword. (The copyright to both puzzles is held by Universal Uclick, which grew out of the former Universal Press Syndicate and calls itself “the leading distributor of daily puzzle and word games.”) USA Today is one of the country’s highest-circulation newspapers, and the Universal Crossword is syndicated to hundreds of newspapers and websites.

On Friday, a publicity coordinator for Universal Uclick, Julie Halper, said the company declined to comment on the allegations. FiveThirtyEight reached out to USA Today for comment several times but received no response.

Oliver does a great job setting up the background on crossword puzzles and exploring the data that underlies this story. A must read if you are interested in crossword puzzles or know someone who is.

I was more taken with “how” the patterns were mined, which Oliver also covers:

Tausig discovered this with the help of the newly assembled database of crossword puzzles created by Saul Pwanson [1. Pwanson changed his legal name from Paul Swanson] a software engineer. Pwanson wrote the code that identified the similar puzzles and published a list of them on his website, along with code for the project on GitHub. The puzzle database is the result of Pwanson’s own Web-scraping of about 30,000 puzzles and the addition of a separate digital collection of puzzles that has been maintained by solver Barry Haldiman since 1999. Pwanson’s database now holds nearly 52,000 crossword puzzles, and Pwanson’s website lists all the puzzle pairs that have a similarity score of at least 25 percent.

The .xd futureproof crossword format page reads in part:

.xd is a corpus-oriented format, modeled after the simplicity and intuitiveness of the markdown format. It supports 99.99% of published crosswords, and is intended to be convenient for bulk analysis of crosswords by both humans and machines, from the present and into the future.

My first thought was of mining patterns in government redacted reports.

My second thought was that an ASCII format that specifies line length (to allow for varying font sizes) in characters, plus line breaks and lines composed of characters, whitespace and markouts as single characters should fit the bill. Yes?

Surely such a format exists now, yes? Pointers please!

There are those who merit protection by redacted documents, but children are more often victimized by spy agencies than employed by them.

CVPR 2015 Papers

Sunday, June 14th, 2015

CVPR [Computer Vision and Pattern Recognition] 2015 Papers by @karpathy.

This is very cool!

From the webpage:

Below every paper are TOP 100 most-occuring words in that paper and their color is based on LDA topic model with k = 7.
(It looks like 0 = datasets?, 1 = deep learning, 2 = videos , 3 = 3D Computer Vision , 4 = optimization?, 5 = low-level Computer Vision?, 6 = descriptors?)

You can sort by LDA topics, view the PDFs, rank the other papers by tf-idf similarity to a particular paper.

Very impressive and suggestive of other refinements for viewing a large number of papers in a given area.


Signatures, patterns and trends: Timeseries data mining at Etsy

Sunday, June 7th, 2015

From the description:

Etsy loves metrics. Everything that happens in our data centres gets recorded, graphed and stored. But with over a million metrics flowing in constantly, it’s hard for any team to keep on top of all that information. Graphing everything doesn’t scale, and traditional alerting methods based on thresholds become very prone to false positives.

That’s why we started Kale, an open-source software suite for pattern mining and anomaly detection in operational data streams. These are big topics with decades of research, but many of the methods in the literature are ineffective on terabytes of noisy data with unusual statistical characteristics, and techniques that require extensive manual analysis are unsuitable when your ops teams have service levels to maintain.

In this talk I’ll briefly cover the main challenges that traditional statistical methods face in this environment, and introduce some pragmatic alternatives that scale well and are easy to implement (and automate) on Elasticsearch and similar platforms. I’ll talk about the stumbling blocks we encountered with the first release of Kale, and the resulting architectural changes coming in version 2.0. And I’ll go into a little technical detail on the algorithms we use for fingerprinting and searching metrics, and detecting different kinds of unusual activity. These techniques have potential applications in clustering, outlier detection, similarity search and supervised learning, and they are not limited to the data centre but can be applied to any high-volume timeseries data.

Blog post:

Signature, patterns and trends? Sounds relevant to monitoring network patterns. Yes?

Good focus on anomaly detection, pointing out that many explanations are overly simplistic.

Use case is one (1) million incoming metrics.

Looking forward to seeing this released as open source!

Breaking the Similarity Bottleneck

Saturday, May 9th, 2015

Ultra-Fast Data-Mining Hardware Architecture Based on Stochastic Computing by Antoni Morro, et al.


Minimal hardware implementations able to cope with the processing of large amounts of data in reasonable times are highly desired in our information-driven society. In this work we review the application of stochastic computing to probabilistic-based pattern-recognition analysis of huge database sets. The proposed technique consists in the hardware implementation of a parallel architecture implementing a similarity search of data with respect to different pre-stored categories. We design pulse-based stochastic-logic blocks to obtain an efficient pattern recognition system. The proposed architecture speeds up the screening process of huge databases by a factor of 7 when compared to a conventional digital implementation using the same hardware area.

I haven’t included the hyperlinks, but:

In this work we present a highly efficient methodology for data mining based on probabilistic processing. High dimensional data is inherently complex in clustering, classification and similarity search [15]. The proposed approach is evaluated showing its application to a similarity search over a huge database. Most data mining algorithms use similarity search as a subroutine core [16–18], and thus the time taken for this task is the bottleneck of virtually all data mining algorithms [19]. Similarity search plays a fundamental role in many data mining and machine learning problems, e.g. text categorization [20], collaborative filtering [21], time-series analysis [22,23], protein sequencing [24] or any application-specific task as petroglyphs comparison [25]. At the same time, the mining of huge datasets implies the use of large computer clusters [26,27]. The proposed approach based on the use of probabilistic processing shows large improvements in terms of hardware resources when compared with conventional solutions.

Sorry they omitted topic maps but what is a merging criteria if it isn’t a type of “similarity?”

From the conclusion:

This implementation uses less hardware resources than conventional digital methodologies (based on binary and not probabilistic logic) and is able to process the order of 13GBytes of information per second (in contrast to the estimated 2GBytes/s of speed that could be achieved by the conventional implementation using the same hardware area). With the 12-dimensional space used to allocate each vector in the example shown in this paper we obtain the order of 1 billion of comparisons per second. A patent application has been done for this new mining methodology [32].

The patent was filed in Spanish but English and French auto-translations are available.

Hopefully the patent will be used in such a way as to promote widespread implementation of this technique.

I could stand 1 billion comparisons a second, quite easily. Interactive development of merging algorithms anyone?

I first saw this in a tweet by Stefano Bertolo.

Parsing Drug Dosages in text…

Sunday, March 30th, 2014

Parsing Drug Dosages in text using Finite State Machines by Sujit Pal.

From the post:

Someone recently pointed out an issue with the Drug Dosage FSM in Apache cTakes on the cTakes mailing list. Looking at the code for it revealed a fairly complex implementation based on a hierarchy of Finite State Machines (FSM). The intuition behind the implementation is that Drug Dosage text in doctor’s notes tend to follow a standard-ish format, and FSMs can be used to exploit this structure and pull out relevant entities out of this text. The paper Extracting Structured Medication Event Information from Discharge Summaries has more information about this problem. The authors provide their own solution, called the Merki Medication Parser. Here is a link to their Online Demo and source code (Perl).

I’ve never used FSMs myself, although I have seen it used to model (more structured) systems. So the idea of using FSMs for parsing semi-structured text such as this seemed interesting and I decided to try it out myself. The implementation I describe here is nowhere nearly as complex as the one in cTakes, but on the flip side, is neither as accurate, nor broad nor bulletproof either.

My solution uses drug dosage phrase data provided in this Pattern Matching article by Erin Rhode (which also comes with a Perl based solution), as well as its dictionaries (with additions by me), to model the phrases with the state diagram below. I built the diagram by eyeballing the outputs from Erin Rhode’s program. I then implement the state diagram with a home-grown FSM implementation based on ideas from Electric Monk’s post on FSMs in Python and the documentation for the Java library Tungsten FSM. I initially tried to use Tungsten-FSM, but ended up with extremely verbose Scala code because of Scala’s stricter generics system.

This caught my attention because I was looking at a data import handler recently that was harvesting information from a minimal XML wrapper around mediawiki markup. Works quite well but seems like a shame to miss all the data in wiki markup.

I say “miss all the data in wiki markup” and that’s not really fair. It is dumped into a single field for indexing. But that is a field that loses the context distinctions between a note, appendix, bibliography, or even the main text.

If you need distinctions that aren’t the defaults, you may be faced with rolling your own FSM. This post should help get you started.

Pattern recognition toolbox

Monday, December 30th, 2013

Pattern recognition toolbox by Thomas W. Rauber.

From the webpage:

TOOLDIAG is a collection of methods for statistical pattern recognition. The main area of application is classification. The application area is limited to multidimensional continuous features, without any missing values. No symbolic features (attributes) are allowed. The program in implemented in the ‘C’ programming language and was tested in several computing environments. The user interface is simple, command-line oriented, but the methods behind it are efficient and fast. You can customize your own methods on the application programming level with relatively little effort. If you wish a presentation of the theory behind the program at your university, feel free to contact me.

Command line classification. A higher learning curve that some but expect greater flexibility as well.

I thought the requirement of “no missing values” was curious.

If you have a data set with some legitimately missing values, how are you going to replace them in a neutral way?

Making Sense of Patterns in the Twitterverse

Sunday, June 9th, 2013

Making Sense of Patterns in the Twitterverse

From the post:

If you think keeping up with what’s happening via Twitter, Facebook and other social media is like drinking from a fire hose, multiply that by 7 billion — and you’ll have a sense of what Court Corley wakes up to every morning.

Corley, a data scientist at the Department of Energy’s Pacific Northwest National Laboratory, has created a powerful digital system capable of analyzing billions of tweets and other social media messages in just seconds, in an effort to discover patterns and make sense of all the information. His social media analysis tool, dubbed “SALSA” (SociAL Sensor Analytics), combined with extensive know-how — and a fair degree of chutzpah — allows someone like Corley to try to grasp it all.

“The world is equipped with human sensors — more than 7 billion and counting. It’s by far the most extensive sensor network on the planet. What can we learn by paying attention?” Corley said.

Among the payoffs Corley envisions are emergency responders who receive crucial early information about natural disasters such as tornadoes; a tool that public health advocates can use to better protect people’s health; and information about social unrest that could help nations protect their citizens. But finding those jewels amidst the effluent of digital minutia is a challenge.

“The task we all face is separating out the trivia, the useless information we all are blasted with every day, from the really good stuff that helps us live better lives. There’s a lot of noise, but there’s some very valuable information too.”

The work by Corley and colleagues Chase Dowling, Stuart Rose and Taylor McKenzie was named best paper given at the IEEE conference on Intelligence and Security Informatics in Seattle this week.

Another one of those “name” issues as the IEEE conference site reports:

Courtney Corley, Chase Dowling, Stuart Rose and Taylor McKenzie. SociAL Sensor Analytics: Measuring Phenomenology at Scale.

I am assuming from the other researchers matching that this is the “Court/Courtney” in question.

I was unable to find an online copy of the paper but suspect it will eventually appear in an IEEE archive.

From the news report, very interesting and useful work.

Course on Information Theory, Pattern Recognition, and Neural Networks

Friday, November 23rd, 2012

Course on Information Theory, Pattern Recognition, and Neural Networks by David MacKay.

From the description:

A series of sixteen lectures covering the core of the book “Information Theory, Inference, and Learning Algorithms (Cambridge University Press, 2003)” which can be bought at Amazon, and is available free online. A subset of these lectures used to constitute a Part III Physics course at the University of Cambridge. The high-resolution videos and all other course material can be downloaded from the Cambridge course website.

Excellent lectures on information theory, the probability that a message sent is the one received.

Makes me wonder if there is a similar probability theory for the semantics of a message sent being the semantics of the message as received?

Information Theory, Pattern Recognition, and Neural Networks

Friday, July 27th, 2012

Information Theory, Pattern Recognition, and Neural Networks by David MacKay.

David MacKay’s lectures with slides on information theory, inference and neural networks. Spring/Summer of 2012.

Just in time for the weekend!

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

ICDM 2012

Monday, April 23rd, 2012

ICDM 2012 Brussels, Belgium | December 10 – 13, 2012

From the webpage:

The IEEE International Conference on Data Mining series (ICDM) has established itself as the world’s premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications.

ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing. By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining. Besides the technical program, the conference features workshops, tutorials, panels and, since 2007, the ICDM data mining contest.

Important Dates:

ICDM contest proposals: April 30
Conference full paper submissions: June 18
Demo and tutorial proposals: August 10
Workshop paper submissions: August 10
PhD Forum paper submissions: August 10
Conference paper, tutorial, demo notifications: September 18
Workshop paper notifications: October 1
PhD Forum paper notifications: October 1
Camera-ready copies and copyright forms: October 15

Fast Deep/Recurrent Nets for AGI Vision

Monday, October 24th, 2011

Fast Deep/Recurrent Nets for AGI Vision

Jürgen Schmidhuber at AGI-2011 delivers a deeply amusing presentation promoting neural networks, particularly deep/recurrent networks pioneered by his lab.

The jargon falls fast and furious so you probably want to visit his homepage for pointers to more information.

A wealth of information awaits! Suggestions on what looks the most promising for assisted topic map authoring welcome!

Mining Associations and Patterns from Semantic Data

Friday, September 2nd, 2011

The editors of a special issue of the International Journal on Semantic Web and Information Systems on Mining Associations and Patterns from Semantic Data have issued the following call for papers:

Guest editors: Kemafor Anyanwu, Ying Ding, Jie Tang, and Philip Yu

Large amounts of Semantic Data is being generated through semantic extractions from and annotation of traditional Web, social and sensor data. Linked Open Data has provided excellent vehicle for representation and sharing of such data. Primary vehicle to get semantics useful for better integration, search and decision making is to find interesting relationships or associations, expressed as meaningful paths, subgraphs and patterns. This special issue seeks theories, algorithms and applications of extracting such semantic relationships from large amount of semantic data. Example topics include:

  • Theories to ground associations and patterns with social, socioeconomic, biological semantics
  • Representation (e.g. language extensions) to express meaningful relationships and patterns
  • Algorithms to efficiently compute and mine semantic associations and patterns
  • Techniques for filtering, ranking and/or visualization of semantic associations and patterns
  • Application of semantic associations and patterns in a domain with significant social or society impact

IJSWIS is included in most major indices including CSI, with Thomson Scientific impact factor 2.345. We seek high quality manuscripts suitable for an archival journal based on original research. If the manuscript is based on a prior workshop or conference submission, submissions should reflect significant novel contribution/extension in conceptual terms and/or scale of implementation and evaluation (authors are highly encouraged to clarify new contributions in a cover letter or within the submission).

Important Dates:
Submission of full papers: Feb 29, 2012
Notification of paper acceptance: May 30, 2012
Publication target: 3Q 2012

Details of the journal, manuscript preparation, and recent articles are available on the website: or

Guest Editors: Prof. Kemafor Anyanwu, North Carolina State University
Prof. Ying Ding, Indiana University
Prof. Jie Tang, Tsinghua University
Prof. Philip Yu, University of Illinois, Chicago
Contact Guest Editor: Ying Ding <>

Pattern recognition and machine learning

Saturday, February 12th, 2011

Pattern recognition and machine learning by Christoper M. Bishop was mentioned in Which Automatic Differentiation Tool for C/C++?, a post by Bob Carpenter.

I ran across another reference to it today that took me to a page with exercise solutions, corrections and other materials that will be of interest if you are using the book for a class or self-study.

See: PRML: Pattern Recognition and Machine Learning

I was impressed enough by the materials to go ahead and order a copy of it.

It is fairly long and I have to start up a blog on ODF (Open Document Format), so don’t expect a detailed summary any time soon.

Which Automatic Differentiation Tool for C/C++?

Tuesday, February 8th, 2011

Which Automatic Differentiation Tool for C/C++?

OK, not immediately obvious why this is relevant to topic maps.

Nor is Bob Carpenter’s references:

I’ve been playing with all sorts of fun new toys at the new job at Columbia and learning lots of new algorithms. In particular, I’m coming to grips with Hamiltonian (or hybrid) Monte Carlo, which isn’t as complicated as the physics-based motivations may suggest (see the discussion in David MacKay’s book and then move to the more detailed explanation in Christopher Bishop’s book).

particularly useful.

I suspect the two book references are:

but I haven’t asked. In part to illustrate the problem of resolving any entity reference. Both authors have authored other books touching on the same subjects so my guesses may or may not be correct.

Oh, relevance to topic maps. The technique automatic differentiation is used in Hamiltonian Monte Carlo methods for the generation of gradients. Still not helpful? Isn’t to me either.

Ah, what about Bayesian models in IR? That made the light go on!

I will be discussing ways to show more immediate relevance to topic maps, at least for some posts, in post #1000.

It isn’t as far away as you might think.

A Brief Survey on Sequence Classification

Monday, December 6th, 2010

A Brief Survey on Sequence Classification Authors: Zhengzheng Xing, Jian Pei, Eamonn Keogh


Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.

Excellent survey article on sequence classification, which as the authors note, is a rapidly developing field of research.

This article was published in the “newsletter” of the ACM Special Interest Group on Knowledge Discovery and Data Mining. Far more substantive material than I am accustomed to seeing in any “newsletter.”

The ACM has very attractive student discounts and if you are serious about being an information professional, it is one of the organizations that I would recommend in addition to the usual library suspects.

Apache Mahout – Website

Tuesday, November 30th, 2010

Apache Mahout

From the website:

Apache Mahout’s goal is to build scalable machine learning libraries. With scalable we mean:

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Current capabilities include:

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)

A topic maps class will only have enough time to show some examples of using Mahout. Perhaps an informal group?

Pattern Recognition

Saturday, November 27th, 2010

Pattern Recognition by Robi Polikar.

Survey of pattern recognition.

Any method that augments your “recognition” of subjects in texts relies on some form of “pattern recognition.”

The suggested reading at the end of the article is very helpful.


  1. Reports of use of any of the pattern recognition techniques in library research? (2-3 pages, citations)
  2. Pick one of the reported techniques. What type of topic map would it be used with? Why? (3-5 pages, citations)
  3. Demonstrate the use of one of the reported techniques on a data set. (project/class presentation)

Classification and Pattern Discovery of Mood in Weblogs

Saturday, November 20th, 2010

Classification and Pattern Discovery of Mood in Weblogs Authors(s): Thin Nguyen, Dinh Phung, Brett Adams, Truyen Tran, Svetha Venkatesh


Automatic data-driven analysis of mood from text is an emerging problem with many potential applications. Unlike generic text categorization, mood classification based on textual features is complicated by various factors, including its context- and user-sensitive nature. We present a comprehensive study of different feature selection schemes in machine learning for the problem of mood classification in weblogs. Notably, we introduce the novel use of a feature set based on the affective norms for English words (ANEW) lexicon studied in psychology. This feature set has the advantage of being computationally efficient while maintaining accuracy comparable to other state-of-the-art feature sets experimented with. In addition, we present results of data-driven clustering on a dataset of over 17 million blog posts with mood groundtruth. Our analysis reveals an interesting, and readily interpreted, structure to the linguistic expression of emotion, one that comprises valuable empirical evidence in support of existing psychological models of emotion, and in particular the dipoles pleasure-displeasure and activation-deactivation.

The classification and pattern discovery of sentiment in weblogs will be a high priority for some topic maps.

Detection of teenagers who post to MySpace about violence for example.


  1. How would you use this technique for research on weblogs? (3-5 pages, no citations)
  2. What other word lists could be applied to research on weblogs? Thoughts on how they could be applied? (3-5 pages, citations)
  3. Does the “mood” of a text impact its classification in traditional schemes? How would you test that question? (3-5 pages, no citations)

Additional resources:

Affective Norms for English Words (ANEW) Instruction Manual and Affective Ratings

ANEW Message: Request form for ANEW word list.

The Complexity and Application of Syntactic Pattern Recognition Using Finite Inductive Strings

Thursday, November 4th, 2010

The Complexity and Application of Syntactic Pattern Recognition Using Finite Inductive Strings Authors: Elijah Myers, Paul S. Fisher, Keith Irwin, Jinsuk Baek, Joao Setubal Keywords: Pattern Recognition, finite induction, syntactic pattern recognition, algorithm complexity


We describe herein the results of implementing an algorithm for syntactic pattern recognition using the concept of Finite Inductive Sequences (FI). We discuss this idea, and then provide a big O estimate of the time to execute for the algorithms. We then provide some empirical data to support the analysis of the timing. This timing is critical if one wants to process millions of symbols from multiple sequences simultaneously. Lastly, we provide an example of the two FI algorithms applied to actual data taken from a gene and then describe some results as well as the associated data derived from this example.

Pattern matching is of obvious important for bioinformatics and in topic map terms, recognizing subjects.


  1. What “new problems continue to emerge” that you would use pattern matching to solve? (discussion)
  2. What about those problems makes them suitable for the application of pattern matching? (3-5 pages, no citations)
  3. What about those problems makes them suitable for the particular techniques described in this paper? (3-5 pages, no citations)

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context

Saturday, October 23rd, 2010

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context Authors: Norman E. Davey, Niall J. Haslam, Denis C. Shields, Richard J. Edwards Keywords: short linear motif, motif discovery, minimotif, elm

Short, linear motifs (SLiMs) play a critical role in many biological processes. The SLiMSearch (Short, Linear Motif Search) webserver is a flexible tool that enables researchers to identify novel occurrences of pre-defined SLiMs in sets of proteins. Numerous masking options give the user great control over the contextual information to be included in the analyses, including evolutionary filtering and protein structural disorder. User-friendly output and visualizations of motif context allow the user to quickly gain insight into the validity of a putatively functional motif occurrence. Users can search motifs against the human proteome, or submit their own datasets of UniProt proteins, in which case motif support within the dataset is statistically assessed for over- and under-representation, accounting for evolutionary relationships between input proteins. SLiMSearch is freely available as open source Python modules and all webserver results are available for download. The SLiMSearch server is available at: .


Seemed like an appropriate resource to follow on today’s earlier posting.

Note in the keywords, “elm.”

Care to guess what that means? If you are a bioinformatics or biology person you may get it correct.

What do you think the odds are that any person much less a general search engine will get it correct?

Topic maps are about making sure you find: Eukaryotic Linear Motif Resource without wading through what a search of any common search engine returns for “elm.”


  1. What other terms in this paper represent other subjects?
  2. What properties would you use to identify those subjects?
  3. How would you communicate those subjects to someone else?

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences

Saturday, October 23rd, 2010

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences Authors: Ashish Kishor Bindal, R. Sabarinathan, J. Sridhar, D. Sherlin, K. Sekar Keywords: Sequence motifs, nucleotide and protein sequences, identical motifs, dynamic programming, direct repeat and phylogenetic relationships

Sequence motifs are of greater biological importance in nucleotide and protein sequences. The conserved occurrence of identical motifs represents the functional significance and helps to classify the biological sequences. In this paper, a new algorithm is proposed to find all identical motifs in multiple nucleotide or protein sequences. The proposed algorithm uses the concept of dynamic programming. The application of this algorithm includes the identification of (a) conserved identical sequence motifs and (b) identical or direct repeat sequence motifs across multiple biological sequences (nucleotide or protein sequences). Further, the proposed algorithm facilitates the analysis of comparative internal sequence repeats for the evolutionary studies which helps to derive the phylogenetic relationships from the distribution of repeats.

Good illustration that subject identification, here sequence motifs in nucleotide and protein sequences, varies by domain.

Subject matching in this type of data on the basis of assigned URL identifiers for sequence motifs would be silly.

But that’s the question isn’t it? What is the appropriate basis for subject matching in a particular domain?


  1. Identify and describe one (1) domain where URL matching for subjects would be unnecessary overhead. (3 pages, no citations)
  2. Identify and describe one (1) domain where URL matching for subjects would be useful. (3 pages, no citations)
  3. What are the advantages of URLs as a lingua franca? (3 pages, no citations)
  4. What are the disadvantages of URLs as a lingua franca? (3 pages, no citations)

BTW, when you see “no citations” that does not mean you should not be reading the relevant literature. What is means is that I want your analysis of the issues and not your channeling of the latest literature. – machine learning open source software

Thursday, October 21st, 2010 – machine learning open source software

Open source repository of machine learning software.

Not only are subjects being recognized by these software packages but their processes and choices are subjects as well. Not to mention their description in the literature.

Fruitful grounds for adaptation to topic maps as well as being the subject of topic maps.

There are literally hundreds of software packages here so I welcome suggestions, comments, etc. on any and all of them.


  1. Examples of vocabulary mis-match in machine learning literature?
  2. Using one sample data set, how would you integrate results from different packages? Assume you are not merging classifiers.
  3. What if the classifiers are unknown? That is all you have are the final results. Is your result different? Reliable?
  4. Describe a (singular) merging of classifiers in subject identity terms.

Shogun – A Large Scale Machine Learning Toolbox

Thursday, October 21st, 2010

Shogun – A Large Scale Machine Learning Toolbox

Not for the faint of heart but an excellent resource for those interested in large scale kernel methods.

Offers several Support Vector Machine (SVM) implementations and implementations of the latest kernels. Has interfaces to Mathlab(tm), R, Octave and Python.


  1. Pick any one of the methods. How would you integrate it into augmented authoring for a topic map?
  2. What aspect(s) of this site would you change using topic maps?
  3. What augmented authoring techniques that would help you apply topic maps to this site?
  4. Apply topic maps to this site. (project)

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification

Wednesday, October 20th, 2010

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification
Authors: Aaron Smalter, Jun Huan and Gerald Lushington


Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogeneous sources and have been applied in modeling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge.

In this paper, we demonstrate a novel technique called Graph Pattern Matching kernel (GPM). Our idea is to leverage existing ? frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method.

The authors also note:

Publicly-available large-scale chemical compound databases have offered tremendous opportunities for creating highly efficient in silico drug design methods. Many machine learning and data mining algorithms have been applied to study the structure-activity relationship of chemicals with the goal of building classifiers for graph-structured data.

In other words, with a desktop machine, public data and a little imagination, you can make a fundamental contribution to drug design methods. (FWI, the pharmaceuticals are making money hand over fist.)

Integrating your contribution or its results into existing information, such as with topic maps, will only increase its value.

Integrating Biological Data – Not A URL In Sight!

Wednesday, October 20th, 2010

Actual title: Kernel methods for integrating biological data by Dick de Ridder, The Delft Bioinformatics Lab, Delft University of Technology.

Biological data integration to improve protein expression – read hugely profitable industrial processes based on biology.

Need to integrate biological data, including “prior knowledge.”

In case kernel methods aren’t your “thing,” one important point:

There are vast seas of economically important data unsullied by URLs.

Kernel methods are one method to integrate some of that data.


  1. How to integrate kernel methods into topic maps? (research project)
  2. Subjects in a kernel method? (research paper, limit to one method)
  3. Modeling specific uses of kernels in topic maps. (research project)
  4. Edges of kernels? Are there subject limits to kernels? (research project>

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs

Friday, October 15th, 2010

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs. Authors: B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju, and Christos Faloutsos Keywords: EigenSpokes – Communities – Graphs


We report a surprising, persistent pattern in large sparse social graphs, which we term EigenSpokes. We focus on large Mobile Call graphs, spanning about 186K nodes and millions of calls, and find that the singular vectors of these graphs exhibit a striking EigenSpokes pattern wherein, when plotted against each other, they have clear, separate lines that often neatly align along specific axes (hence the term “spokes”). Furthermore, analysis of several other real-world datasets e.g., Patent Citations, Internet, etc. reveals similar phenomena indicating this to be a more fundamental attribute of large sparse graphs that is related to their community structure.

This is the first contribution of this paper. Additional ones include (a) study of the conditions that lead to such EigenSpokes, and (b) a fast algorithm for spotting and extracting tightly-knit communities, called SpokEn, that exploits our findings about the EigenSpokes pattern.

The notion of “chipping” off communities for further study from a large graph is quite intriguing.

In part because those communities (need I say subjects?) are found as the result of a process of exploration rather than declaration.

To be sure, those subjects can be “declared” in a topic map but the finding, identifying, deciding on subject identity properties for subjects is a lot more fun.

DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud

Sunday, October 10th, 2010

DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud Authors: Jen-Wei Huang, Su-Chen Lin, Ming-Syan Chen Keywords: sequential pattern mining, period of interest (POI), customer transactions

The progressive sequential pattern mining problem has been discussed in previous research works. With the increasing amount of data, single processors struggle to scale up. Traditional algorithms running on a single machine may have scalability troubles. Therefore, mining progressive sequential patterns intrinsically suffers from the scalability problem. In view of this, we design a distributed mining algorithm to address the scalability problem of mining progressive sequential patterns. The proposed algorithm DPSP, standing for Distributed Progressive Sequential Pattern mining algorithm, is implemented on top of Hadoop platform, which realizes the cloud computing environment. We propose Map/Reduce jobs in DPSP to delete obsolete itemsets, update current candidate sequential patterns and report up-to-date frequent sequential patterns within each POI. The experimental results show that DPSP possesses great scalability and consequently increases the performance and the practicability of mining algorithms.

The phrase mining sequential patterns was coined in Mining Sequential Patterns, a paper authored by Rakesh Agrawal, Ramakrishnan Srikant, and cited by the authors of this paper.

The original research was to find patterns in customer transactions, which I suspect are important “subjects” for discovery and representation in commerce topic maps.

Natural Language Toolkit

Wednesday, September 29th, 2010

Natural Language Toolkit is a set of Python modules for natural language processing and text analytics. Brought to my attention by Kirk Lowery.

Two near term tasks come to mind:

  • Feature comparison to LingPipe
  • Finding linguistic software useful for topic maps

Suggestions of other toolkits welcome!

76 Binary Smilarity and Distance Measures

Saturday, September 11th, 2010

A Survey of Binary Similarity and Distance Measures Authors: Seung-Seok Choi, Sung-Hyuk Cha, Charles C. Tappert Keywords: binary similarity measure, binary distance measure, hierarchical clustering, classification, operational taxonomic unit. (Journal of Systemics, Cybernetics and Informatics, Vol. 8, No. 1, pp. 43-48, 2010)

High-Performance Dynamic Pattern Matching over Disordered Streams

Thursday, September 9th, 2010

High-Performance Dynamic Pattern Matching over Disordered Streams by Badrish Chandramouli, Jonathan Goldstein, and David Maier came to me by way of Jack Park.

From the abstract:

Current pattern-detection proposals for streaming data recognize the need to move beyond a simple regular-expression model over strictly ordered input. We continue in this direction, relaxing restrictions present in some models, removing the requirement for ordered input, and permitting stream revisions (modification of prior events). Further, recognizing that patterns of interest in modern applications may change frequently over the lifetime of a query, we support updating of a pattern specification without blocking input or restarting the operator.

In case you missed it, this is related to: Experience in Extending Query Engine for Continuous Analytics.

The algorithmic trading use case in this article made me think of Nikita Ogievetsky. For those of you who do not know Nikita, he is an XSLT/topic map maven, currently working in the finance industry.

Do trading interfaces allow user definition of subjects to be identified in data streams? And/or merged with subjects identified in other data streams? Or is that an upgrade from the basic service?