Archive for the ‘Pattern Matching’ Category

Data Mining Patterns in Crossword Puzzles [Patterns in Redaction?]

Saturday, March 5th, 2016

A Plagiarism Scandal Is Unfolding In The Crossword World by Oliver Roeder.

From the post:

A group of eagle-eyed puzzlers, using digital tools, has uncovered a pattern of copying in the professional crossword-puzzle world that has led to accusations of plagiarism and false identity.

Since 1999, Timothy Parker, editor of one of the nation’s most widely syndicated crosswords, has edited more than 60 individual puzzles that copy elements from New York Times puzzles, often with pseudonyms for bylines, a new database has helped reveal. The puzzles in question repeated themes, answers, grids and clues from Times puzzles published years earlier. Hundreds more of the puzzles edited by Parker are nearly verbatim copies of previous puzzles that Parker also edited. Most of those have been republished under fake author names.

Nearly all this replication was found in two crosswords series edited by Parker: the USA Today Crossword and the syndicated Universal Crossword. (The copyright to both puzzles is held by Universal Uclick, which grew out of the former Universal Press Syndicate and calls itself “the leading distributor of daily puzzle and word games.”) USA Today is one of the country’s highest-circulation newspapers, and the Universal Crossword is syndicated to hundreds of newspapers and websites.

On Friday, a publicity coordinator for Universal Uclick, Julie Halper, said the company declined to comment on the allegations. FiveThirtyEight reached out to USA Today for comment several times but received no response.

Oliver does a great job setting up the background on crossword puzzles and exploring the data that underlies this story. A must read if you are interested in crossword puzzles or know someone who is.

I was more taken with “how” the patterns were mined, which Oliver also covers:


Tausig discovered this with the help of the newly assembled database of crossword puzzles created by Saul Pwanson [1. Pwanson changed his legal name from Paul Swanson] a software engineer. Pwanson wrote the code that identified the similar puzzles and published a list of them on his website, along with code for the project on GitHub. The puzzle database is the result of Pwanson’s own Web-scraping of about 30,000 puzzles and the addition of a separate digital collection of puzzles that has been maintained by solver Barry Haldiman since 1999. Pwanson’s database now holds nearly 52,000 crossword puzzles, and Pwanson’s website lists all the puzzle pairs that have a similarity score of at least 25 percent.

The .xd futureproof crossword format page reads in part:

.xd is a corpus-oriented format, modeled after the simplicity and intuitiveness of the markdown format. It supports 99.99% of published crosswords, and is intended to be convenient for bulk analysis of crosswords by both humans and machines, from the present and into the future.

My first thought was of mining patterns in government redacted reports.

My second thought was that an ASCII format that specifies line length (to allow for varying font sizes) in characters, plus line breaks and lines composed of characters, whitespace and markouts as single characters should fit the bill. Yes?

Surely such a format exists now, yes? Pointers please!

There are those who merit protection by redacted documents, but children are more often victimized by spy agencies than employed by them.

CVPR 2015 Papers

Sunday, June 14th, 2015

CVPR [Computer Vision and Pattern Recognition] 2015 Papers by @karpathy.

This is very cool!

From the webpage:

Below every paper are TOP 100 most-occuring words in that paper and their color is based on LDA topic model with k = 7.
(It looks like 0 = datasets?, 1 = deep learning, 2 = videos , 3 = 3D Computer Vision , 4 = optimization?, 5 = low-level Computer Vision?, 6 = descriptors?)

You can sort by LDA topics, view the PDFs, rank the other papers by tf-idf similarity to a particular paper.

Very impressive and suggestive of other refinements for viewing a large number of papers in a given area.

Enjoy!

Making Sense of Patterns in the Twitterverse

Sunday, June 9th, 2013

Making Sense of Patterns in the Twitterverse

From the post:

If you think keeping up with what’s happening via Twitter, Facebook and other social media is like drinking from a fire hose, multiply that by 7 billion — and you’ll have a sense of what Court Corley wakes up to every morning.

Corley, a data scientist at the Department of Energy’s Pacific Northwest National Laboratory, has created a powerful digital system capable of analyzing billions of tweets and other social media messages in just seconds, in an effort to discover patterns and make sense of all the information. His social media analysis tool, dubbed “SALSA” (SociAL Sensor Analytics), combined with extensive know-how — and a fair degree of chutzpah — allows someone like Corley to try to grasp it all.

“The world is equipped with human sensors — more than 7 billion and counting. It’s by far the most extensive sensor network on the planet. What can we learn by paying attention?” Corley said.

Among the payoffs Corley envisions are emergency responders who receive crucial early information about natural disasters such as tornadoes; a tool that public health advocates can use to better protect people’s health; and information about social unrest that could help nations protect their citizens. But finding those jewels amidst the effluent of digital minutia is a challenge.

“The task we all face is separating out the trivia, the useless information we all are blasted with every day, from the really good stuff that helps us live better lives. There’s a lot of noise, but there’s some very valuable information too.”

The work by Corley and colleagues Chase Dowling, Stuart Rose and Taylor McKenzie was named best paper given at the IEEE conference on Intelligence and Security Informatics in Seattle this week.

Another one of those “name” issues as the IEEE conference site reports:

Courtney Corley, Chase Dowling, Stuart Rose and Taylor McKenzie. SociAL Sensor Analytics: Measuring Phenomenology at Scale.

I am assuming from the other researchers matching that this is the “Court/Courtney” in question.

I was unable to find an online copy of the paper but suspect it will eventually appear in an IEEE archive.

From the news report, very interesting and useful work.

SnoPy – SNOBOL Pattern Matching for Python

Friday, March 22nd, 2013

SnoPy – SNOBOL Pattern Matching for Python

Description:

SnoPy – A Python alternative to regular expressions. Borrowed from SNOBOL this alternative is both easier to use and more powerful than regular expressions. NO backslashes to count.

See also: SnoPy – SNOBOL Pattern Matching for Python Web Site.

For cross-disciplinary data mining, what could be more useful than SNOBOL pattern matching?

I first saw this in a Facebook link posted by Sam Hunting.

TXR: a Pattern Matching Language (Not Just)….

Sunday, January 22nd, 2012

TXR: a Pattern Matching Language (Not Just) for Convenient Text Extraction

From the webpage:

TXR (“texer” or “tee ex ar”) is a new and growing language oriented toward processing text, packaged as a utility (the txr command) that runs in POSIX environments and on Microsoft Windows.

Working with TXR is different from most text processing programming languages. Constructs in TXR aren’t imperative statements, but rather pattern-matching directives: each construct terminates by matching, failing, or throwing an exception. Searching and backtracking behaviors are implicit.

The development of TXR began when I needed a utility to be used in shell programming which would reverse the action of a “here-document”. Here-documents are a programming language feature for generating multi-line text from a boiler-plate template which contains variables to be substituted, and possibly other constructs such as various functions that generate text. Here-documents appeared in the Unix shell decades ago, but most of today’s web is basically a form of here-document, because all non-trivial websites generate HTML dynamically, substituting variable material into templates on the fly. Well, in the given situation I was programming in, I didn’t want here documents as much as “there documents”: I wanted to write a template of text containing variables, but not to generate text but to do the reverse: match the template against existing text which is similar to it, and capture pieces of it into variables. So I developed a utility to do just that: capture these variables from a template, and then generate a set of variable assignments that could be eval-ed in a shell script.

That was sometime in the middle of 2009. Since then TXR has become a lot more powerful. It has features like structured named blocks with nonlocal exits, structured exception handling, pattern matching functions, and numerous other features. TXR is powerful enough to parse grammars, yet simple to use on trivial tasks.

For things that can’t be easily done in the pattern matching language, TXR has a built-in Lisp dialect, which supports goodies like first class functions with closures over lexical environments, I/O (including string and list streams), hash tables with optional weak semantics, and arithmetic with arbitrary precision (“bignum”) integers.

A powerful tool for text extraction/manipulation.

Tech Survivors: Geek Technologies…

Thursday, October 20th, 2011

Tech Survivors: Geek Technologies That Still Survive 25 to 50 Years Later

Simply awesome!

Useful review for a couple of reasons:

First: New languages, formats, etc., will emerge but legacy systems “….will be with you always.” (Or at least it will feel that way so being able to interface with legacy systems (understand their semantics) is going to be important for very long time.)

Second: What was it about these technologies that made them succeed? (I don’t have the answer or I would be at the USPTO filing every patent and variant of patent that I could think of. 😉 It is clearly non-obvious because no one else is there either.) Equally long-lived technologies are with us today, we just don’t know which ones.

Would not hurt to put this on your calendar to review every year or so. The more you know about new technologies, the more likely you are to spot a resemblance or pattern matching one of these technologies. Maybe.

Tying up the loose ends in fully LZW-compressed pattern matching

Tuesday, September 27th, 2011

Tying up the loose ends in fully LZW-compressed pattern matching by Pawel Gawrychowski.

Abstract:

We consider a natural generalization of the classical pattern matching problem: given compressed representations of a pattern p[1..M] and a text t[1..N] of sizes m and n, respectively, does p occur in t? We develop an optimal linear time solution for the case when both p and t are compressed using the LZW method. This improves the previously known O((n+m)log(n+m)) time solution of Gasieniec and Rytter, and essentially closes the line of research devoted to studying LZW-compressed exact pattern matching.

I don’t know of any topic maps that are yet of the size that they and queries against them need to take advantage of compressed queries against compress data but this paper outlines “an optimal linear time solution … for fully LZW-compressed pattern matching.” (page 2)

I suspect it may be more relevant to data mining prior to the construction of a topic map. But in either case, when needed it will be a welcome solution.

Faster Approximate Pattern Matching in Compressed Repetitive Texts

Saturday, September 17th, 2011

Faster Approximate Pattern Matching in Compressed Repetitive Texts by Travis Gagie, Pawel Gawrychowski, and Simon J. Puglisi.

Abstract:

Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with $r$ rules for a string $s$ of length $n$, we can build an \Oh\({r}\)-word data structure that allows us to extract any substring \(s [i..j]\) in $\Oh{\log n + j – i}$ time. They also showed how, given a pattern $p$ of length $m$ and an edit distance (k \leq m), their data structure supports finding all \occ approximate matches to $p$ in $s$ in $\Oh{r (\min (m k, k^4 + m) + \log n) + \occ}$ time. Rytter (2003) and Charikar et al. (2005) showed that $r$ is always at least the number $z$ of phrases in the LZ77 parse of $s$, and gave algorithms for building straight-line programs with $\Oh{z \log n}$ rules. In this paper we give a simple $\Oh{z \log n}$-word data structure that takes the same time for substring extraction but only $\Oh{z (\min (m k, k^4 + m)) + \occ}$ time for approximate pattern matching.

It occurs to me that this could be useful for bioinformatic applications of topic maps that map between data sets and literature.

Interesting that the authors mention redundancy in web crawls. I suspect that there are a number of domains that have highly repetitive data should we choose to look at them that way. Markup documents for example.

Mining Associations and Patterns from Semantic Data

Friday, September 2nd, 2011

The editors of a special issue of the International Journal on Semantic Web and Information Systems on Mining Associations and Patterns from Semantic Data have issued the following call for papers:

Guest editors: Kemafor Anyanwu, Ying Ding, Jie Tang, and Philip Yu

Large amounts of Semantic Data is being generated through semantic extractions from and annotation of traditional Web, social and sensor data. Linked Open Data has provided excellent vehicle for representation and sharing of such data. Primary vehicle to get semantics useful for better integration, search and decision making is to find interesting relationships or associations, expressed as meaningful paths, subgraphs and patterns. This special issue seeks theories, algorithms and applications of extracting such semantic relationships from large amount of semantic data. Example topics include:

  • Theories to ground associations and patterns with social, socioeconomic, biological semantics
  • Representation (e.g. language extensions) to express meaningful relationships and patterns
  • Algorithms to efficiently compute and mine semantic associations and patterns
  • Techniques for filtering, ranking and/or visualization of semantic associations and patterns
  • Application of semantic associations and patterns in a domain with significant social or society impact

IJSWIS is included in most major indices including CSI, with Thomson Scientific impact factor 2.345. We seek high quality manuscripts suitable for an archival journal based on original research. If the manuscript is based on a prior workshop or conference submission, submissions should reflect significant novel contribution/extension in conceptual terms and/or scale of implementation and evaluation (authors are highly encouraged to clarify new contributions in a cover letter or within the submission).

Important Dates:
Submission of full papers: Feb 29, 2012
Notification of paper acceptance: May 30, 2012
Publication target: 3Q 2012

Details of the journal, manuscript preparation, and recent articles are available on the website:
http://www.igi-global.com/bookstore/titledetails.aspx?titleid=1092 or http://ijswis.org

Guest Editors: Prof. Kemafor Anyanwu, North Carolina State University
Prof. Ying Ding, Indiana University
Prof. Jie Tang, Tsinghua University
Prof. Philip Yu, University of Illinois, Chicago
Contact Guest Editor: Ying Ding <dingying@indiana.edu>

Pattern Matching & Predicate Dispatch

Friday, August 19th, 2011

Pattern Matching & Predicate Dispatch by David Nolen, NYC Clojure Users Group 8/17/2011.

From the description:

From David:

“Pattern matching is a powerful tool for processing data based on type and structure. In this talk I’ll discuss a new library I’ve been working on that provides optimized, extensible pattern matching to Clojure. We’ll cover the theory, the implementation, and future directions for this library, in particular the true goal – predicate dispatch.”

David Nolen is the primary developer of the Clojure contrib library core.logic – an efficient Prolog-like logic engine. The pattern matching library continues his investigations into the relationships between object-oriented, functional, and logic programming.

Slides: http://www.scribd.com/doc/62571669/Patterns

Source: https://github.com/swannodette/match

Playing with Scala’s pattern matching – Post

Wednesday, February 16th, 2011

Playing with Scala’s pattern matching

François Sarradin writes:

How many times have you been stuck in your frustration because you were unable to use strings as entries in switch-case statements. Such an ability would be really useful for example to analyze the arguments of your application or to parse a file, or any content of a string. Meanwhile, you have to write a series of if-else-if statements (and this is annoying). Another solution is to use a hash map, where the keys are those strings and values are the associated reified processes, for example a Runnable or a Callable in Java (but this is not really natural, long to develop, and boring too).

If a switch-case statement that accepts strings as entries would be a revolution for you, the Scala’s pattern matching says that this is not enough! Indeed, there are other cases where a series of if-else-if statements would be generously transformed into a look-alike switch-case statement. For example, it would be really nice to simplify a series of instanceof and cast included in if-else-if to execute the good process according to the type of a parameter.

In this post, we see the power of the Scala’s pattern matching in different use cases.

What language you choose for topic map development is going to depend upon its pattern matching abilities.

Here’s a chance to evaluate Scala in that regard.