Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 5, 2016

Data Mining Patterns in Crossword Puzzles [Patterns in Redaction?]

Filed under: Crossword Puzzle,Data Mining,Pattern Matching,Pattern Recognition,Security — Patrick Durusau @ 12:06 pm

A Plagiarism Scandal Is Unfolding In The Crossword World by Oliver Roeder.

From the post:

A group of eagle-eyed puzzlers, using digital tools, has uncovered a pattern of copying in the professional crossword-puzzle world that has led to accusations of plagiarism and false identity.

Since 1999, Timothy Parker, editor of one of the nation’s most widely syndicated crosswords, has edited more than 60 individual puzzles that copy elements from New York Times puzzles, often with pseudonyms for bylines, a new database has helped reveal. The puzzles in question repeated themes, answers, grids and clues from Times puzzles published years earlier. Hundreds more of the puzzles edited by Parker are nearly verbatim copies of previous puzzles that Parker also edited. Most of those have been republished under fake author names.

Nearly all this replication was found in two crosswords series edited by Parker: the USA Today Crossword and the syndicated Universal Crossword. (The copyright to both puzzles is held by Universal Uclick, which grew out of the former Universal Press Syndicate and calls itself “the leading distributor of daily puzzle and word games.”) USA Today is one of the country’s highest-circulation newspapers, and the Universal Crossword is syndicated to hundreds of newspapers and websites.

On Friday, a publicity coordinator for Universal Uclick, Julie Halper, said the company declined to comment on the allegations. FiveThirtyEight reached out to USA Today for comment several times but received no response.

Oliver does a great job setting up the background on crossword puzzles and exploring the data that underlies this story. A must read if you are interested in crossword puzzles or know someone who is.

I was more taken with “how” the patterns were mined, which Oliver also covers:


Tausig discovered this with the help of the newly assembled database of crossword puzzles created by Saul Pwanson [1. Pwanson changed his legal name from Paul Swanson] a software engineer. Pwanson wrote the code that identified the similar puzzles and published a list of them on his website, along with code for the project on GitHub. The puzzle database is the result of Pwanson’s own Web-scraping of about 30,000 puzzles and the addition of a separate digital collection of puzzles that has been maintained by solver Barry Haldiman since 1999. Pwanson’s database now holds nearly 52,000 crossword puzzles, and Pwanson’s website lists all the puzzle pairs that have a similarity score of at least 25 percent.

The .xd futureproof crossword format page reads in part:

.xd is a corpus-oriented format, modeled after the simplicity and intuitiveness of the markdown format. It supports 99.99% of published crosswords, and is intended to be convenient for bulk analysis of crosswords by both humans and machines, from the present and into the future.

My first thought was of mining patterns in government redacted reports.

My second thought was that an ASCII format that specifies line length (to allow for varying font sizes) in characters, plus line breaks and lines composed of characters, whitespace and markouts as single characters should fit the bill. Yes?

Surely such a format exists now, yes? Pointers please!

There are those who merit protection by redacted documents, but children are more often victimized by spy agencies than employed by them.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress