Sony Breach Result of Self Abuse

December 17th, 2014

In Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data I wrote in part:

The bitching and catching by Sony are sure signs that something went terribly wrong internally. The current circus is an attempt to distract the public from that failure. Probably a member of management with highly inappropriate security clearance because “…they are important!”

Inappropriate security clearances for management to networks is a sign of poor systems administration. I wonder when that shoe is going to drop? (emphasis added)

The other shoe dropping did not take long! Later that same day, Sony employees file a suit largely to the same effect: Sony employees file lawsuit, blame company over hacked data by Jeff John Roberts.

Jeff writes in part:

They accuse Sony of negligence for failing to secure its network, and not taking adequate steps to protect employees once the company knew the information was compromised.

The complaint also cites various security and news reports to say that Sony lost the cryptographic “keys to the kingdom,” which allowed the hackers to root around in its system undetected for as long as a year.

That is the other reason for the obsession with secrecy in the computer security business. The management that signs the checks for security contractors is the same management that is responsible for the security breaches.

Honest security reporting (which does happen) bites the hand that feeds it.

Tracking Government/Terrorist Financing

December 17th, 2014

Deep Learning Intelligence Platform – Addressing the KYC AML Terrorism Financing Challenge Dr. Jerry A. Smith.

From the post:

Terrorism impacts our lives each and every day; whether directly through acts of violence by terrorists, reduced liberties from new anti-terrorism laws, or increased taxes to support counter terrorism activities. A vital component of terrorism is the means through which these activities are financed, through legal and illicit financial activities. Recognizing the necessity to limit these financial activities in order to reduce terrorism, many nation states have agreed to a framework of global regulations, some of which have been realized through regulatory programs such as the Bank Secrecy Act (BSA).

As part of the BSA (an other similar regulations), governed financial services institutions are required to determine if the financial transactions of a person or entity is related to financing terrorism. This is a specific report requirement found in Response 30, of Section 2, in the FinCEN Suspicious Activity Report (SAR). For every financial transaction moving through a given banking system, the institution need to determine if it is suspicious and, if so, is it part of a larger terrorist activity. In the event that it is, the financial services institution is required to immediately file a SAR and call FinCEN.

The process of determining if a financial transaction is terrorism related is not merely a compliance issue, but a national security imperative. No solution exist today that adequately addresses this requirement. As such, I was asked to speak on the issue as a data scientist practicing in the private intelligence community. These are some of the relevant points from that discussion.

Jerry has a great outline of the capabilities you will need for tracking government/terrorist financing. Depending upon your client’s interest, you may be required to monitor data flows in order to trigger the filing of a SAR and calling FinCEN or to avoid triggering the filing of a SAR and calling FinCEN. For either goal the tools and techniques are largely the same.

Or for monitoring government funding for torture or groups to carry out atrocities on its behalf. Same data mining techniques apply.

Have you ever noticed that government data leaks rarely involve financial records? Thinking of the consequences of the accounts payable ledger that listed all the organizations and people paid by the Bush administration, sans all the SS and retirement recipients.

That would be near the top of my most wanted data leaks list.


Apache Spark I & II [Pacific Northwest Scala 2014]

December 16th, 2014

Apache Spark I: From Scala Collections to Fast Interactive Big Data with Spark by Evan Chan.


This session introduces you to Spark by starting with something basic: Scala collections and functional data transforms. We then look at how Spark expands the functional collection concept to enable massively distributed, fast computations. The second half of the talk is for those of you who want to know the secrets to make Spark really fly for querying tabular datasets. We will dive into row vs columnar datastores and the facilities that Spark has for enabling interactive data analysis, including Spark SQL and the in-memory columnar cache. Learn why Scala’s functional collections are the best foundation for working with data!

Apache Spark II: Streaming Big Data Analytics with Team Apache, Scala & Akka by Helena Edelson.


In this talk we will step into Spark over Cassandra with Spark Streaming and Kafka. Then put it in the context of an event-driven Akka application for real-time delivery of meaning at high velocity. We will do this by showing how to easily integrate Apache Spark and Spark Streaming with Apache Cassandra and Apache Kafka using the Spark Cassandra Connector. All within a common use case: working with time-series data, which Cassandra excells at for data locality and speed.

Back to back excellent presentations on Spark!

I need to replace my second monitor (died last week) so I can run the video at full screen with a REPL open!


Cartography with complex survey data

December 16th, 2014

Cartography with complex survey data by David Smith.

From the post:

Visualizing complex survey data is something of an art. If the data has been collected and aggregated to geographic units (say, counties or states), a choropleth is one option. But if the data aren't so neatly arranged, making visual sense often requires some form of smoothing to represent it on a map. 

R, of course, has a number of features and packages to help you, not least the survey package and the various mapping tools. Swmap (short for "survey-weighted maps") is a collection of R scripts that visualize some public data sets, for example this cartogram of transportation share of household spending based on data from the 2012-2013 Consumer Expenditure Survey.


In addition to finding data, there is also the problem of finding tools to process found data.

As in when I follow a link to a resource, that link is also submitted to a repository of other things associated with the data set I am requesting, such as the current locations of its authors, tools for processing the data, articles written using the data, etc.

That’s a long ways off but at least today you can record having found one more cache of tools for data processing.

Type systems and logic

December 16th, 2014

Type systems and logic by Alyssa Carter (From Code Word – Hacker School)

From the post:

An important result in computer science and type theory is that a type system corresponds to a particular logic system.

How does this work? The basic idea is that of the Curry-Howard Correspondence. A type is interpreted as a proposition, and a value is interpreted as a proof of the proposition corresponding to its type. Most standard logical connectives can be derived from this idea: for example, the values of the pair type (A, B) are pairs of values of types A and B, meaning they’re pairs of proofs of A and B, which means that (A, B) represents the logical conjunction “A && B”. Similarly, logical disjunction (“A | | B”) corresponds to what’s called a “tagged union” type: a value (proof) of Either A B is either a value (proof) of A or a value (proof) of B.

This might be a lot to take in, so let’s take a few moments for concrete perspective.

Types like Int and String are propositions – you can think of simple types like these as just stating that “an Int exists” or “a String exists”. 1 is a proof of Int, and "hands" is a proof of String. (Int, String) is a simple tuple type, stating that “there exists an Int and there exists a String”. (1, "hands") is a proof of (Int, String). Finally, the Either type is a bit more mysterious if you aren’t familiar with Haskell, but the type Either a b can contain values of type a tagged as the “left” side of an Either or values of type b tagged as the “right” side of an Either. So Either Int String means “either there exists an Int or there exists a String”, and it can be proved by either Left 1 or Right "hands". The tags ensure that you don’t lose any information if the two types are the same: Either Int Int can be proved by Left 1 or Right 1, which can be distinguished from each other by their tags.

Heavy sledding but should very much be on your reading list.

It has gems like:

truth is useless for computation and proofs are not

I would have far fewer objections to some logic/ontology discussions if they limited their claims to computation.

People are free to accept or reject any result of computation. Depends on their comparison of the result to their perception of the world.

Case in point, the five year old who could not board a plane because they shared a name with someone on the no-fly list.

One person, a dull TSA agent, could not see beyond the result of a calculation on the screen.

Everyone else could see a five year old who, while cranky, wasn’t on the no-fly list.

I first saw this in a tweet by Rahul Goma Phulore.


December 16th, 2014

Slooh I want to be an astronaut astronomer.

From the webpage:

Robotic control of Slooh’s three telescopes in the northern (Canary Islands) and southern hemispheres (Chile)

Schedule time and point the telescopes at any object in the night sky. You can make up to five reservations at a time in five or ten minute increments depending on the observatory. There are no limitations on the total number of reservations you can book in any quarter.

Capture, collect, and share images, including PNG and FITS files. You can view and take images from any of the 250+ “missions” per night, including those scheduled by other members.

Watch hundreds of hours of live and recorded space shows with expert narration featuring 10+ years of magical moments in the night sky including eclipses, transits, solar flares, NEA, comets, and more.

See and discuss highlights from the telescopes, featuring member research, discoveries, animations, and more.

Join groups with experts and fellow citizen astronomers to learn and discuss within areas of interest, from astrophotography and tracking asteroids to exoplanets and life in the Universe.

Access Slooh activities with step by step how-to instructions to master the art and science of astronomy.

A reminder that for all the grim data that is available for analysis/mining, there is an equal share of interesting and/or beautiful data as well.

There is a special on right now for $1.00 you can obtain four (4) weeks of membership. The fine print says every yearly quarter of membership is $74.85. $74.85 / 4 = $18.71 per month or $224.25 per year. Less than cable and/or cellphone service. It also has the advantage of not making you dumber. Surprised they didn’t mention that.

I first saw this in a tweet by Michael Peter Edson.

UX Newsletter

December 16th, 2014

Our New Ebook: The UX Reader

From the post:

This week, MailChimp published its first ebook, The UX Reader. I could just tell you that it features revised and updated pieces from our UX Newsletter, that you can download it here for $5, and that all proceeds go to RailsBridge. But instead, I’m hearing the voice of Mrs. McLogan, my high school physics teacher:

“Look, I know you’ve figured out the answer, but I want you to show your work.”

Just typing those words makes me sweat—I still get nervous when I’m asked to show how to solve a problem, even if I’m confident in the solution. But I always learn new things and get valuable feedback whenever I do.

So today I want to show you the work of putting together The UX Reader and talk more about the problem it helped us solve.

After you read this post, you too will be a subscriber to the UX Newsletter. Not to mention having a copy of the updated book, The UX Reader.

Worth the time to read and put in to practice what it reports.

Or as I told an old friend earlier today:

The greatest technology/paradigm without use is only interesting, not compelling or game changing.

Melville House to Publish CIA Torture Report:… [Publishing Gone Awry?]

December 16th, 2014

Melville House to Publish CIA Torture Report: An Interview with Publisher Dennis Johnson by Jonathon Sturgeon.

From the post:

In what must be considered a watershed moment in contemporary publishing, Brooklyn-based independent publisher Melville House will release the Senate Intelligence Committee’s executive summary of a government report — “Study of the Central Intelligence Agency’s Detention and Interrogation Program” — that is said to detail the monstrous torture methods employed by the Central Intelligence Agency in its counter-terrorism efforts.

Melville House’s co-publisher and co-founder Dennis Johnson has called the report “probably the most important government document of our generation, even one of the most significant in the history of our democracy.”

Melville House’s press release confirms that they are releasing both print and digital editions on December 30, 2014.

As of December 30, 2014, I can read and mark my copy, print or digital and you can mark your copy, print or digital, but no collaboration on the torture report.

For the “…most significant [document] in the history of our democracy” that seems rather sad. That is that each of us is going to be limited to whatever we know or can find out when we are reading our copies of the same report.

If there was ever a report (and there have been others) that merited a collaborative reading/annotation, the CIA Torture Report would be one of them.

Given the large number of people who worked on this report and the diverse knowledge required to evaluate it, that sounds like bad publishing choices. Or at least that there are better publishing choices available.

What about casting the entire report into the form of wiki pages, broken down by paragraphs? Once proofed, the original text can be locked and comments only allowed on the text. Free to view but $fee to comment.

What do you think? Viable way to present such a text? Other ways to host the text?

PS: Unlike other significant government reports, major publishing houses did not receive incentives to print the report. Jerry attributes that to Dianne Feinstein not wanting to favor any particular publisher. That’s one explanation. Another would be that if published in hard copy at all, a small press will mean it fades more quickly from public view. Your call.

Graph data from MySQL database in Python

December 16th, 2014

Graph data from MySQL database in Python

From the webpage:

All Python code for this tutorial is available online in this IPython notebook.

Thinking of using Plotly at your company? See Plotly’s on-premise, Plotly Enterprise options.

Note on operating systems: While this tutorial can be followed by Windows or Mac users, it assumes a Ubuntu operating system (Ubuntu Desktop or Ubuntu Server). If you don’t have a Ubuntu server, its possible to set up a cloud one with Amazon Web Services (follow the first half of this tutorial). If you’re using a Mac, we recommend purchasing and downloading VMware Fusion, then installing Ubuntu Desktop through that. You can also purchase an inexpensive laptop or physical server from Zareason, with Ubuntu Desktop or Ubuntu Server preinstalled.

Reading data from a MySQL database and graphing it in Python is straightforward, and all the tools that you need are free and online. This post shows you how. If you have questions or get stuck, email, write in the comments below, or tweet to @plotlygraphs.

Just in case you want to start on adding a job skill over the holidays!

Whenever I see “graph” used in this sense, I wish it were some appropriate form of “visualize.” Unfortunately, “graphing” of data stuck too long ago to expect anyone to change now. To be fair, it is marking nodes on an edge, except that we treat all the space on one side or the other of the edge as significant.

Perhaps someone has treated the “curve” of a graph as a hyperedge? Connecting multiple nodes? I don’t know. You?

Whether they have or haven’t, I will continue to think of this type of “graphing” as visualization. Very useful but not the same thing as graphs with nodes/edges, etc.

Warning: Verizon Scam – Secure Cypher

December 16th, 2014

Scams during the holiday season are nothing new but this latest scam has a “…man bites dog” quality to it.

The scam in this case is being run by the vendor offering the service: Verizon.

Karl Bode writes in: Verizon Offers Encrypted Calling With NSA Backdoor At No Additional Charge:

Verizon’s marketing materials for the service feature young, hip, privacy-conscious users enjoying the “industry’s most secure voice communication” platform:


Verizon says it’s initially pitching the $45 per phone service to government agencies and corporations, but would ultimately love to offer it to consumers as a line item on your bill. Of course by “end-to-end encryption,” Verizon means that the new $45 per phone service includes an embedded NSA backdoor free of charge. Apparently, in Verizon-land, “end-to-end encryption” means something entirely different than it does in the real world:

“Cellcrypt and Verizon both say that law enforcement agencies will be able to access communications that take place over Voice Cypher, so long as they’re able to prove that there’s a legitimate law enforcement reason for doing so. Seth Polansky, Cellcrypt’s vice president for North America, disputes the idea that building technology to allow wiretapping is a security risk. “It’s only creating a weakness for government agencies,” he says. “Just because a government access option exists, it doesn’t mean other companies can access it.”


What do you think? Is the added * Includes Free NSA Backdoor sufficient notice to consumers?

I am more than willing to donate my rights to this image to Verizon for advertising purposes. Perhaps you should forward a copy to them and your friends on Verizon.


December 16th, 2014

LT-Accelerate: LT-Accelerate is a conference designed to help businesses, researchers and public administrations discover business value via Language Technology.

From the about page:

LT-Accelerate is a joint production of LT-Innovate, the European Association of the Language Technology Industry, and Alta Plana Corporation, a Washington DC based strategy consultancy headed by analyst Seth Grimes.

Held December 4-5, 2014 in Brussels, the website reports seven (7) interviews with key speakers and slides from thirty-eight speakers.

Not as in depth as papers nor as useful as videos of the presentations but still capable of sparking new ideas as you review the slides.

For example, the slides from Multi-Dimensional Sentiment Analysis by Stephen Pulman made me wonder what sentiment detection design would be appropriate for the Michael Brown grand jury transcripts?

Sentiment detection has been successfully used with tweets (140 character limit) and I am reliably informed that most of the text strings in the Michael Brown grand jury transcript are far longer than one hundred and forty (140) characters. ;-)

Any sentiment detectives in the audience?

US Congress OKs ‘unprecedented’ codification of warrantless surveillance

December 16th, 2014

US Congress OKs ‘unprecedented’ codification of warrantless surveillance by Lisa Vaas.

From the post:

Congress last week quietly passed a bill to reauthorize funding for intelligence agencies, over objections that it gives the government “virtually unlimited access to the communications of every American”, without warrant, and allows for indefinite storage of some intercepted material, including anything that’s “enciphered”.

That’s how it was summed up by Rep. Justin Amash, a Republican from Michigan, who pitched and lost a last-minute battle to kill the bill.

The bill is titled the Intelligence Authorization Act for Fiscal Year 2015.

Amash said that the bill was “rushed to the floor” of the house for a vote, following the Senate having passed a version with a new section – Section 309 – that the House had never considered.

Lisa reports that the bill codifies Executive Order 12333, a Ronald Reagan remnant from an earlier attempt to dismantle the United States Constitution.

There is a petition underway to ask President Obama to veto the bill. Are you a large bank? Skip the petition and give the President a call.

From Lisa’s report, it sounds like Congress needs a DEW Line for legislation:

Rep. Zoe Lofgren, a California Democrat who voted against the bill, told the National Journal that the Senate’s unanimous passage of the bill was sneaky and ensured that the House would rubberstamp it without looking too closely:

If this hadn’t been snuck in, I doubt it would have passed. A lot of members were not even aware that this new provision had been inserted last-minute. Had we been given an additional day, we may have stopped it.

How do you “sneak in” legislation in a public body?

Suggestions on an early warning system for changes to legislation between the two houses of Congress?

More Missing Evidence In Ferguson (Michael Brown)

December 15th, 2014

Saturday’s data dump from St. Louis County Prosecutor Robert McCulloch is still short at least two critical pieces of evidence. There is no copy of the “documents that we gave you to help in your deliberation.” And, there is no copy of the police map to “…guide the grand jury.”

I. The “documents that we gave you to help in your deliberations:”

The prosecutors gave the grand jury written documents that supplemented their various oral misstatements of the law in this case.

From Volume 24 - November 21, 2014 - Page  138: 

2 You have all the information you need in 

3 those documents that we gave you to help in your 

4 deliberation. 

That follows verbal mis-statement of the law by Ms. Whirley:

Volume 24 - November 21, 2014 - Page  137


13 	    MS. WHIRLEY: Is that in order to vote 

14 true bill, you also must consider whether you 

15 believe Darren Wilson, you find probable cause, 

16 that's the standard to believe that Darren Wilson 

17 committed the offense and the offenses are what is 

18 in the indictment and you must find probable cause 

19 to believe that Darren Wilson did not act in lawful 

20 self—defense, and you've got the last sheet talks 

21 about self—defense and talks about officer's use of 

22 force, because then you must also have probable 

23 cause to believe that Darren Wilson did not use 

24 lawful force in making an arrest. So you are 

25 considering self—defense and use of force in making 

Volume 24 - November 21, 2014 - Page  138 

Grand Jury — Ferguson Police Shooting Grand Jury 11/21/2014 

1 an arrest.

Where are the “documents that we gave you to help in your deliberation?”

Have you seen those documents? I haven’t.

And consider this additional misstatement of the law:

Volume 24 - November 21, 2014 - Page  139 

8 And the one thing that Sheila has 

9 explained as far as what you must find and as she 

10 said, it is kind of in Missouri it is kind of, the 

11 State has to prove in a criminal trial, the State 

12 has to prove that the person did not act in lawful 

13 self—defense or did not use lawful force in making, 

14 it is kind of like we have to prove the negative. 

15 So in this case because we are talking 

16 about probable cause, as we've discussed, you must 

17 find probable cause to believe that he committed the 

18 offense that you're considering and you must find 

19 probable cause to believe that he did not act in 

20 lawful self—defense. Not that he did, but that he 

21 did not and that you find probable cause to believe 

22 that he did not use lawful force in making the 

23 arrest. 

Just for emphasis:

the State has to prove that the person did not act in lawful self—defense or did not use lawful force in making, it is kind of like we have to prove the negative.

How hard is it to prove a negative? James Randi, James Randi Lecture @ Caltech – Cant Prove a Negative, points out that proving a negative is a logical impossibility.

The grand jury was given a logically impossible task in order to indict Darren Wilson.

What choice did the grand jury have but to return a “no true bill?”

More Misguidance: The police map, Grand Jury 101

A police map was created to guide the jury in its deliberations, a map that reflected the police view of the location of witnesses.

Volume 24 - November 21, 2014 - Page  26 

Grand Jury — Ferguson Police Shooting Grand Jury 11/21/2014 


10	 Q (By Ms. Alizadeh) Extra, okay, that's 

11 right. And you indicated that you, along with other 

12 investigators prepared this, which is your 

13 interpretation based upon the statements made of 

14 witnesses as to where various eyewitnesses were 

15 during, when I say shooting, obviously, there was a 

16 time period that goes along, the beginning of the 

17 time of the beginning of the incident until after 

18 the shooting had been done. And do you still feel 

19 that this map accurately reflects where witnesses 

20 said they were? 

21 A I do. 

22	 Q And just for your instruction, this just, 

23 this map is for your purposes in your deliberations 

24 and if you disagree with anything that's on the map, 

25 these little sticky things come right off. So 

Volume 24 - November 21, 2014 - Page  27 

Grand Jury — Ferguson Police Shooting Grand Jury 11/21/2014 

1 supposedly they come right off. 

2 A They do. 

3	 Q If you feel that this witness is not in 

4 the right place, you can move any of these stickers 

5 that you want and put them in the places where you 

6 think they belong. 

7 This is just something that is 

8 representative of what this witness believes where 

9 people were. If you all do with this what you will. 

10 Also there was a legend that was 

11 provided for all of you regarding the numbers 

12 because the numbers that were assigned witnesses are 

13 not the same numbers as the witnesses testimony in 

14 this grand jury. 


Two critical statements:


11... And you indicated that you, along with other 

12 investigators prepared this, which is your 

13 interpretation based upon the statements made of 

14 witnesses as to where various eyewitnesses were 

15 during, when I say shooting,

So the map represents the detective’s opinion about other witnesses, and:

3	 Q If you feel that this witness is not in 

4 the right place, you can move any of these stickers 

5 that you want and put them in the places where you 

6 think they belong.

The witness gave the grand jury a map, to guide its deliberations but we will never know what map that was, because the stickers can be moved.

Pretty neat trick, giving the grand jury guidance that can never be disclosed to others.


You have seen the quote from the latest data dump from the prosecutor’s office:

McCulloch apologized in a written statement for any confusion that may have occurred by failing to initially release all of the interview transcripts. He said he believes he has now released all of the grand jury evidence, except for photos of Brown’s body and anything that could lead to witnesses being identified.

The written instructions to the grand jury and the now unknowable map (Grand Jury 101) aren’t pictures of Brown’s body or anything that could identify a witness. Where are they?

Please make a donation to support further research on the grand jury proceedings concerning Michael Brown. Future work will include:

  • A witness index to the grand jury transcripts
  • An exhibit index to the grand jury transcripts
  • Analysis of the grand jury transcript for patterns by the prosecuting attorneys, both expected and unexpected
  • A concordance of the grand jury transcripts
  • Suggestions?

Donations will enable continued analysis of the grand jury transcripts, which, along with other evidence, may establish a pattern of conduct that was not happenstance or coincidence, but in fact was, enemy action.

Thanks for your support!

Other Michael Brown Posts

Missing From Michael Brown Grand Jury Transcripts December 7, 2014. (The witness index I propose to replace.)

New recordings, documents released in Michael Brown case [LA Times Asks If There’s More?] Yes! December 9, 2014 (before the latest document dump on December 14, 2014).

Michael Brown Grand Jury – Presenting Evidence Before Knowing the Law December 10, 2014.

How to Indict Darren Wilson (Michael Brown Shooting) December 12, 2014.

More Missing Evidence In Ferguson (Michael Brown) December 15, 2014. (above)

Tweet Steganography?

December 15th, 2014

Hacking The Tweet Stream by Brett Lawrie.

Brett covers two popular methods for escaping the 140 character limit of Twitter, Tweetstorms and inline screen shots of text.

Brett comes down in favor of inline screen shots over Tweetstorms but see his post to get the full flavor of his comments.

What puzzled me was that Brett did not mention the potential for the use of steganography with inline screen shots. Whether they are of text or not. Could very well be screen shots of portions of the 1611 version of the King James Version (KJV) of the Bible with embedded information that some find offensive if not dangerous.

Or I suppose the sharper question is, How do you know that isn’t happening right now? On Flickr, Instagram, Twitter, one of many other photo sharing sites, blogs, etc.

Oh, I just remembered, I have an image for you. ;-)


(Image from a scan hosted at the Schoenberg Center for Electronic Text and Image (UPenn))

A downside to Twitter text images is that they won’t be easily indexed. Assuming you want your content to be findable. Sometimes you don’t.

Some tools for lifting the patent data treasure

December 15th, 2014

Some tools for lifting the patent data treasure by by Michele Peruzzi and Georg Zachmann.

From the post:

…Our work can be summarized as follows:

  1. We provide an algorithm that allows researchers to find the duplicates inside Patstat in an efficient way
  2. We provide an algorithm to connect Patstat to other kinds of information (CITL, Amadeus)
  3. We publish the results of our work in the form of source code and data for Patstat Oct. 2011.

More technically, we used or developed probabilistic supervised machine-learning algorithms that minimize the need for manual checks on the data, while keeping performance at a reasonably high level.

The post has links for source code and data for these three papers:

A flexible, scaleable approach to the international patent “name game” by Mark Huberty, Amma Serwaah, and Georg Zachmann

In this paper, we address the problem of having duplicated patent applicants’ names in the data. We use an algorithm that efficiently de-duplicates the data, needs minimal manual input and works well even on consumer-grade computers. Comparisons between entries are not limited to their names, and thus this algorithm is an improvement over earlier ones that required extensive manual work or overly cautious clean-up of the names.

A scaleable approach to emissions-innovation record linkage by Mark Huberty, Amma Serwaah, and Georg Zachmann

PATSTAT has patent applications as its focus. This means it lacks important information on the applicants and/or the inventors. In order to have more information on the applicants, we link PATSTAT to the CITL database. This way the patenting behaviour can be linked to climate policy. Because of the structure of the data, we can adapt the deduplication algorithm to use it as a matching tool, retaining all of its advantages.

Remerge: regression-based record linkage with an application to PATSTAT by Michele Peruzzi, Georg Zachmann, Reinhilde Veugelers

We further extend the information content in PATSTAT by linking it to Amadeus, a large database of companies that includes financial information. Patent microdata is now linked to financial performance data of companies. This algorithm compares records using multiple variables, learning their relative weights by asking the user to find the correct links in a small subset of the data. Since it is not limited to comparisons among names, it is an improvement over earlier efforts and is not overly dependent on the name-cleaning procedure in use. It is also relatively easy to adapt the algorithm to other databases, since it uses the familiar concept of regression analysis.

Record linkage is a form of merging that originated in epidemiology in the late 1940’s. To “link” (read merge) records across different formats, records were transposed into a uniform format and “linking” characteristics chosen to gather matching records together. A very powerful technique that has been in continuous use and development ever since.

One major different with topic maps is that record linkage has undisclosed subjects, that is the subjects that make up the common format and the association of the original data sets with that format. I assume in many cases the mapping is documented but it doesn’t appear as part of the final work product, thereby rendering the merging process opaque and inaccessible to future researchers. All you can say is “…this is the data set that emerged from the record linkage.”

Sufficient for some purposes but if you want to reduce the 80% of your time that is spent munging data that has been munged before, it is better to have the mapping documented and to use disclosed subjects with identifying properties.

Having said all of that, these are tools you can use now on patents and/or extend them to other data sets. The disambiguation problems addressed for patents are the common ones you have encountered with other names for entities.

If a topic map underlies your analysis, the less time you will spend on the next analysis of the same information. Think of it as reducing your intellectual overhead in subsequent data sets.

Income – Less overhead = Greater revenue for you. ;-)

PS: Don’t be confused, you are looking for EPO Worldwide Patent Statistical Database (PATSTAT). Naturally there is a US organization, that is just patent litigation statistics.

PPS: Sam Hunting, the source of so many interesting resources, pointed me to this post.

Infinit.e Overview

December 15th, 2014

Infinit.e Overview by Alex Piggott.

From the webpage:

Infinit.e is a scalable framework for collecting, storing, processing, retrieving, analyzing, and visualizing unstructured documents and structured records.

[Image omitted. Too small in my theme to be useful.]

Let’s provide some clarification on each of the often overloaded terms used in that previous sentence:

  • It is a "framework" (or "platform") because it is configurable and extensible by configuration (DSLs) or by various plug-ins types – the default configuration is expected to be useful for a range of typical analysis applications but to get the most out of Infinit.e we anticipate it will usually be customized.
    • Another element of being a framework is being designed to integrate with existing infrastructures as well run standalone.
  • By "scalable" we mean that new nodes (or even more granular: new components) can be added to meet increasing workload (either more users or more data), and that provision of new resources are near real-time.
    • Further, the use of fundamentally cloud-based components means that there are no bottlenecks at least to the ~100 node scale.
  • By "unstructured documents" we mean anything from a mostly-textual database record to a multi-page report – but Infinit.e’s "sweet spot" is in the range of database records that would correspond to a paragraph or more of text ("semi-structured records"), through web pages, to reports of 10 pages or less.
    • Smaller "structured records" are better handled by structured analysis tools (a very saturated space), though Infinit.e has the ability to do limited aggregation, processing and integration of such datasets. Larger reports can still be handled by Infinit.e, but will be most effective if broken up first.
  • By "processing" we mean the ability to apply complex logic to the data. Infinit.e provides some standard "enrichment", such as extraction of entities (people/places/organizations.etc) and simple statistics; and also the ability to "plug in" domain specific processing modules using the Hadoop API.
  • By "retrieving" we mean the ability to search documents and return them in ranking order, but also to be able to retrieve "knowledge" aggregated over all documents matching the analyst’s query.
    • By "query"/"search" we mean the ability to form complex "questions about the data" using a DSL (Domain Specific Language).
  • By "analyzing" we mean the ability to apply domain-specific logic (visual/mathematical/heuristic/etc) to "knowledge" returned from a query.

We refer to the processing/retrieval/analysis/visualization chain as document-centric knowledge discovery:

  • "document-centric": means the basic unit of storage is a generically-formatted document (eg useful without knowledge of the specific data format in which it was encoded)
  • "knowledge discovery": means using statistical and text parsing algorithms to extract useful information from a set of documents that a human can interpret in order to understand the most important knowledge contained within that dataset.

One important aspect of the Infinit.e is our generic data model. Data from all sources (from large unstructured documents to small structured records) is transformed into a single, simple. data model that allows common queries, scoring algorithms, and analytics to be applied across the entire dataset. …

I saw this in a tweet by Gregory Piatetsky yesterday and so haven’t had time to download or test any of the features of Infinit.e.

The list of features is a very intriguing one.

Definitely worth the time to throw another VM on the box and try it out with a dataset of interest.

Would appreciate your doing the same and sending comments and/or pointers to posts with your experiences. Suspect we will have different favorite features and hit different limitations.


PS: Downloads.

Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data

December 15th, 2014

Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data by Michael Cieply and Brooks Barnes.

From the article:

Sony Pictures Entertainment warned media outlets on Sunday against using the mountains of corporate data revealed by hackers who raided the studio’s computer systems in an attack that became public last month.

In a sharply worded letter sent to news organizations, including The New York Times, David Boies, a prominent lawyer hired by Sony, characterized the documents as “stolen information” and demanded that they be avoided, and destroyed if they had already been downloaded or otherwise acquired.

The studio “does not consent to your possession, review, copying, dissemination, publication, uploading, downloading or making any use” of the information, Mr. Boies wrote in the three-page letter, which was distributed Sunday morning.

Since I wrote about the foolish accusations against North Korea by Sony, I thought it only fair to warn you that the idlers at Sony have decided to threaten everyone else.

A rather big leap from trash talking about North Korea to accusing the rest of the world of being interested in their incestuous bickering.

I certainly don’t want a copy of their movies, released or unreleased. Too much noise and too little signal for the space they would take. But, since Sony has gotten on its “let’s threaten everybody” hobby-horse, I do hope the location of the Sony documents suddenly appears in many more inboxes. ;-)

How would you display choice snippets and those who uttered them when a webpage loads?

The bitching and catching by Sony are sure signs that something went terribly wrong internally. The current circus is an attempt to distract the public from that failure. Probably a member of management with highly inappropriate security clearance because “…they are important!”

Inappropriate security clearances for management to networks is a sign of poor systems administration. I wonder when that shoe is going to drop?

American Institute of Physics: Oral Histories

December 15th, 2014

American Institute of Physics: Oral Histories

From the webpage:

The Niels Bohr Library & Archives holds a collection of over 1,500 oral history interviews. These range in date from the early 1960s to the present and cover the major areas and discoveries of physics from the past 100 years. The interviews are conducted by members of the staff of the AIP Center for History of Physics as well as other historians and offer unique insights into the lives, work, and personalities of modern physicists.

Read digitized oral history transcripts online

I don’t have a large collection audio data-set (see: Shining a light into the BBC Radio archives) but there are lots of other people who do.

If you are teaching or researching physics for the last 100 years, this is a resource you should not miss.

Integrating audio resources such as this one, at less than the full recording level (think of it as audio transclusion), into teaching materials would be a great step forward. To say nothing of being about to incorporate such granular resources into a library catalog.

I did not find an interview with Edward Teller but a search of the transcripts turned up three hundred and five (305) “hits” where he is mentioned in interviews. A search for J. Robert Oppenheimer netted four hundred and thirty-six (436) results.

If you know your atomic bomb history, you can guess between Teller and Oppenheimer which one would support the “necessity” defense for the use of torture. It would be an interesting study to see how the interviewees saw these two very different men.

Shining a light into the BBC Radio archives

December 15th, 2014

Shining a light into the BBC Radio archives by Yves Raimond, Matt Hynes, and Rob Cooper.

From the post:


One of the biggest challenges for the BBC Archive is how to open up our enormous collection of radio programmes. As we’ve been broadcasting since 1922 we’ve got an archive of almost 100 years of audio recordings, representing a unique cultural and historical resource.

But the big problem is how to make it searchable. Many of the programmes have little or no meta-data, and the whole collection is far too large to process through human efforts alone.

Help is at hand. Over the last five years or so, technologies such as automated speech recognition, speaker identification and automated tagging have reached a level of accuracy where we can start to get impressive results for the right type of audio. By automatically analysing sound files and making informed decisions about the content and speakers, these tools can effectively help to fill in the missing gaps in our archive’s meta-data.

The Kiwi set of speech processing algorithms

COMMA is built on a set of speech processing algorithms called Kiwi. Back in 2011, BBC R&D were given access to a very large speech radio archive, the BBC World Service archive, which at the time had very little meta-data. In order to build our prototype around this archive we developed a number of speech processing algorithms, reusing open-source building blocks where possible. We then built the following workflow out of these algorithms:

  • Speaker segmentation, identification and gender detection (using LIUM diarization toolkitdiarize-jruby and ruby-lsh). This process is also known as diarisation. Essentially an audio file is automatically divided into segments according to the identity of the speaker. The algorithm can show us who is speaking and at what point in the sound clip.
  • Speech-to-text for the detected speech segments (using CMU Sphinx). At this point the spoken audio is translated as accurately as possible into readable text. This algorithm uses models built from a wide range of BBC data.
  • Automated tagging with DBpedia identifiers. DBpedia is a large database holding structured data extracted from Wikipedia. The automatic tagging process creates the searchable meta-data that ultimately allows us to access the archives much more easily. This process uses a tool we developed called ‘Mango’.


COMMA is due to launch some time in April 2015. If you’d like to be kept informed of our progress you can sign up for occasional email updates here. We’re also looking for early adopters to test the platform, so please contact us if you’re a cultural institution, media company or business that has large audio data-set you want to make searchable.

This article was written by Yves Raimond (lead engineer, BBC R&D), Matt Hynes (senior software engineer, BBC R&D) and Rob Cooper (development producer, BBC R&D)

I don’t have a large audio data-set but I am certainly going to be following this project. The results should be useful in and of themselves, to say nothing of being a good starting point for further tagging. I wonder if the BBC Sanskrit broadcasts are going to be available? I will have to check on that.

Without diminishing the achievements of other institutions, the efforts of the BBC, the British Library, and the British Museum are truly remarkable.

I first saw this in a tweet by Mike Jones.


December 15th, 2014


A Twitter analysis service that:

  • Maps your followers by geographic location
  • Measures growth (or decline) of followers over time
  • Listen to what your followers are talking about
  • Action reports, how well you did yesterday
  • Analyze anyone (competitors for example)
  • Assess followers/following
  • Hashtag/Keyword tracking (down to city level)
  • You could do all of this for yourself but TweepsMap has the convenience of simply working. Thus, suitable for passing on to less CS literate co-workers.

    Free account requires you to login with your Twitter account (of course) but the resulting mapping may surprise you.

    I didn’t see it offered but being able to analyze the people you follow would be a real plus. Not just geographically (to make sure you are getting a diverse world view) but by groupings of hashtags. Taking groups of hashtags forming identifiable groups of users who use them. To allow you to judge the groups that you are following.

    I first saw this in a tweet from Alyona Medelyan.


    December 15th, 2014

    wonderland-clojure-katas by Carin Meier.

    From the webpage:

    These are a collection of Clojure katas inspired by Lewis Carroll and Alice and Wonderland

    Which of course makes me curious, is anyone working on Clojure katas based on The Hunting of the Snark?


    Other suggestions for kata inspiring works?

    Deep learning for… chess

    December 15th, 2014

    Deep learning for… chess by Erik Bernhardsson.

    From the post:

    I’ve been meaning to learn Theano for a while and I’ve also wanted to build a chess AI at some point. So why not combine the two? That’s what I thought, and I ended up spending way too much time on it. I actually built most of this back in September but not until Thanksgiving did I have the time to write a blog post about it.

    Chess sets are a common holiday gift so why not do something different this year?

    Pretty print a copy of this post and include a gift certificate from AWS for a GPU instance for say a week to ten days.

    I don’t think AWS sells gift certificates, but they certainly should. Great stocking stuffer, anniversary/birthday/graduation present, etc. Not so great for Valentines Day.

    If you ask AWS for a gift certificate, mention my name. They don’t know who I am so I could use the publicity. ;-)

    I first saw this in a tweet by Onepaperperday.

    Inheritance Patterns in Citation Networks Reveal Scientific Memes

    December 14th, 2014

    Inheritance Patterns in Citation Networks Reveal Scientific Memes by Tobias Kuhn, Matjaž Perc, and Dirk Helbing. (Phys. Rev. X 4, 041036 – Published 21 November 2014.)


    Memes are the cultural equivalent of genes that spread across human culture by means of imitation. What makes a meme and what distinguishes it from other forms of information, however, is still poorly understood. Our analysis of memes in the scientific literature reveals that they are governed by a surprisingly simple relationship between frequency of occurrence and the degree to which they propagate along the citation graph. We propose a simple formalization of this pattern and validate it with data from close to 50 million publication records from the Web of Science, PubMed Central, and the American Physical Society. Evaluations relying on human annotators, citation network randomizations, and comparisons with several alternative approaches confirm that our formula is accurate and effective, without a dependence on linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.

    Popular Summary:

    It is widely known that certain cultural entities—known as “memes”—in a sense behave and evolve like genes, replicating by means of human imitation. A new scientific concept, for example, spreads and mutates when other scientists start using and refining the concept and cite it in their publications. Unlike genes, however, little is known about the characteristic properties of memes and their specific effects, despite their central importance in science and human culture in general. We show that memes in the form of words and phrases in scientific publications can be characterized and identified by a simple mathematical regularity.

    We define a scientific meme as a short unit of text that is replicated in citing publications (“graphene” and “self-organized criticality” are two examples). We employ nearly 50 million digital publication records from the American Physical Society, PubMed Central, and the Web of Science in our analysis. To identify and characterize scientific memes, we define a meme score that consists of a propagation score—quantifying the degree to which a meme aligns with the citation graph—multiplied by the frequency of occurrence of the word or phrase. Our method does not require arbitrary thresholds or filters and does not depend on any linguistic or ontological knowledge. We show that the results of the meme score are consistent with expert opinion and align well with the scientific concepts described on Wikipedia. The top-ranking memes, furthermore, have interesting bursty time dynamics, illustrating that memes are continuously developing, propagating, and, in a sense, fighting for the attention of scientists.

    Our results open up future research directions for studying memes in a comprehensive fashion, which could lead to new insights in fields as disparate as cultural evolution, innovation, information diffusion, and social media.

    You definitely should grab the PDF version of this article for printing and a slow read.

    From Section III Discussion:

    We show that the meme score can be calculated exactly and exhaustively without the introduction of arbitrary thresholds or filters and without relying on any kind of linguistic or ontological knowledge. The method is fast and reliable, and it can be applied to massive databases.

    Fair enough but “black,” “inflation,” and, “traffic flow,” all appear in the top fifty memes in physics. I don’t know that I would consider any of them to be “memes.”

    There is much left to be discovered about memes. Such as who is good at propagating memes? Would not hurt if your research paper is the origin of a very popular meme.

    I first saw this in a tweet by Max Fisher.

    North Korea As Bogeyman

    December 14th, 2014

    The Sony hack: how it happened, who is responsible, and what we’ve learned by Timothy B. Lee.

    From the post:

    However, North Korea has denied involvement in the attack, and on Wednesday the FBI said that it didn’t have evidence linking the attacks to the North Korean regime. And there are other reasons to doubt the North Koreans are responsible. As Kim Zetter has argued, “nation-state attacks don’t usually announce themselves with a showy image of a blazing skeleton posted to infected machines or use a catchy nom-de-hack like Guardians of Peace to identify themselves.”

    There’s some evidence that the hackers may have been aggrieved about last year’s big layoffs at Sony, which doesn’t seem like something the North Korean regime would care about. And the hackers demonstrated detailed knowledge of Sony’s network that could indicate they had help from inside the company.

    In the past, these kinds of attacks have often been carried out by young men with too much time on their hands. The 2011 LulzSec attacks, for example, were carried out by a loose-knit group from the United States, the United Kingdom, and Ireland with no obvious motive beyond wanting to make trouble for powerful institutions and generate publicity for themselves.

    I assume you have heard the bed wetters in the United States government decrying North Korea as the bogeyman responsible for hacking Sony Pictures (November 2014, just to distinguish it from other hacks of Sony.).

    If you have ever seen a picture of North Korea at night (below), you will understand why I doubt North Korea is the technology badass imaged by US security “experts.”

    North Korea at night

    Not that you have to waste a lot of energy on outside lighting to have a competent computer hacker community but it is one indicator.

    A more likely explanation is that Sony forgot to reset a sysadmin password and it is a “hack” only because a non-current employee carried it out.

    Until some breach other than a valid login by a non-employee is confirmed by independent security experts, I would discard any talk of this being North Korea attacking Sony.

    The only reason to blame North Korea is to create a smokescreen to avoid accepting blame for internally lack security. Watch for Sony to make a film about its fight for freedom of speech against the axis of evil (includes North Korea, wait a couple of weeks to know who else).

    When Sony wants to say something, it is freedom of speech. When you want to repeat it, it is a criminal copyright violation. Funny how that works. Tell Sony to clean up its internal security and only then to worry about outsiders.


    December 14th, 2014

    GearPump (GitHub)

    From the wiki homepage:

    GearPump is a lightweight, real-time, big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. GearPump draws from a number of existing frameworks including MillWheel, Apache Storm, Spark Streaming, Apache Samza, Apache Tez, and Hadoop YARN while leveraging Akka actors throughout its architecture.

    What originally caught my attention was this passage on the GitHub page:

    Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

    Think about that for a second.

    Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

    The GitHub page features a word count example and pointers to the wiki with more examples.

    What if every topic “knew” the index value of every topic that should merge with it on display to a user?

    When added to a topic map it broadcasts its merging property values and any topic with those values responds by transmitting its index value.

    When you retrieve a topic, it has all the IDs necessary to create a merged view of the topic on the fly and on the client side.

    There would be redundancy in the map but de-duplication for storage space went out with preferences for 7-bit character values to save memory space. So long as every topic returns the same result, who cares?

    Well, it might make a difference when the CIA want to give every contractor full access to its datastores 24×7 via their cellphones. But, until that is an actual requirement, I would not worry about the storage space overmuch.

    I first saw this in a tweet from Suneel Marthi.

    Everything You Need To Know About Social Media Search

    December 14th, 2014

    Everything You Need To Know About Social Media Search by Olsy Sorokina.

    From the post:

    For the past decade, social networks have been the most universally consistent way for us to document our lives. We travel, build relationships, accomplish new goals, discuss current events and welcome new lives—and all of these events can be traced on social media. We have created hashtags like #ThrowbackThursday and apps like Timehop to reminisce on all the past moments forever etched in the social web in form of status updates, photos, and 140-character phrases.

    Major networks demonstrate their awareness of the role they play in their users’ lives by creating year-end summaries such as Facebook’s Year in Review, and Twitter’s #YearOnTwitter. However, much of the emphasis on social media has been traditionally placed on real-time interactions, which often made it difficult to browse for past posts without scrolling down for hours on end.

    The bias towards real-time messaging has changed in a matter of a few days. Over the past month, three major social networks announced changes to their search functions, which made finding old posts as easy as a Google search. If you missed out on the news or need a refresher, here’s everything you need to know.

    I suppose Olsy means in addition to search in general sucking.

    Interested tidbit on Facebook:

    This isn’t Facebook’s first attempt at building a search engine. The earlier version of Graph Search gave users search results in response to longer-form queries, such as “my friends who like Game of Thrones.” However, the semantic search never made it to the mobile platforms; many supposed that using complex phrases as search queries was too confusing for an average user.

    Does anyone have any user research on the ability of users to use complex phrases as search queries?

    I ask because if users have difficulty authoring “complex” semantics and difficulty querying with “complex” semantics, it stands to reason they may have difficulty interpreting “complex” semantic results. Yes?

    If all three of those are the case, then how do we impart the value-add of “complex” semantics without tripping over one of those limitations?

    Osly also covers Instagram and Twitter. Twitter’s advanced search looks like the standard include/exclude, etc. type of “advanced” search. “Advanced” maybe forty years ago in the early OPACs but not really “advanced” now.

    Catch up on these new search features. They will provide at least a minimum of grist for your topic map mill.

    How Scientists Are Learning to Write

    December 14th, 2014

    How Scientists Are Learning to Write by Alexandra Ossola.

    From the post:

    The students tried not to look sheepish as their professor projected the article on the whiteboard, waiting for their work to be devoured by their classmates. It was the second class for the nine students, all of whom are Ph.D. candidates or post-doctoral fellows. Their assignment had been to distill their extensive research down to just three paragraphs so that the average person could understand it, and, as in any class, some showed more aptitude than others. The piece on the board was by one of the students, a Russian-born biologist.

    The professor, the journalist and author Stephen Hall (with whom I took a different writing workshop last year), pointed to the word “sequencing.” “That’s jargon-ish,” he said, circling it on the board. “Even some people in the sciences don’t have an intuitive understanding of what that means.” He turned to another student in the class, an Italian native working on his doctorate in economics, for confirmation. “Yes, I didn’t know what was going on,” he said, turning to the piece’s author. The biology student wrote something in her notebook.

    Why is better writing important?:

    But explaining science is just as valuable for the lay public as it is for the scientists themselves. “Science has become more complex, more specialized—every sub-discipline has its own vocabulary,” Hall said. Scientists at all levels have to work hard to explain niche research to the general public, he added, but it’s increasingly important for the average person to understand. That’s because their research has become central to many other elements of society, influencing realms that may have previously seemed far from scientific rigors.

    Olivia Wilkins, a post-doctoral fellow who studies plant genetics at New York University’s Center for Genomics and Systems Biology, recently took Hall’s four-session workshop. She wanted to be a better writer, she said, because she wanted her research to matter. “Science is a group effort. We may be in different labs at different universities, but ultimately, many of us are working towards the same goals. I want to get other people as excited about my work as I am, and I believe that one of the ways to do this is through better writing.”

    How about that? Communicating with other people who are just as bright as you, but who don’t share the same vocabulary? Does that sound like a plausible reason to you?

    I really like the closer:

    “…Writing takes a lot of practice like anything else—if you don’t do it, you don’t get better. (emphasis added)

    I review standards and even offer editing advice from time to time. If you think scientists aren’t born with the ability to write, you should check out standards drafts by editors unfamiliar with how to write standards.

    Citations in a variety of home grown formats, to publications that may or may not exist or be suitable for normative citation, to terminology that isn’t defined, anywhere, to contradictions between different parts, to conformance clauses that are too vague for anyone to know what is required, and many things in between.

    If anything should be authored with clarity, considering that conformance should make applications interoperable, it is IT standards. Take the advice in Alexandra’s post to heart and seek out a writing course near you.

    I edit and review standards so ping me if you want an estimate on how to improve your latest standard draft. (References available on request.)

    I first saw this in a tweet by Gretchen Ritter.

    Instant Hosting of Open Source Projects with GitHub-style Ribbons

    December 14th, 2014

    Instant Hosting of Open Source Projects with GitHub-style Ribbons by Ryan Jarvinen.

    From the post:

    In this post I’ll show you how to create your own GitHub-style ribbons for launching open source projects on OpenShift.

    The popular “Fork me on GitHub” ribbons provide a great way to raise awareness for your favorite open source projects. Now, the same technique can be used to instantly launch clones of your application, helping to rapidly grow your community!

    Take advantage of [the following link is broken as of 12/14/2014] OpenShift’s web-based app creation workflow – streamlining installation, hosting, and management of instances – by crafting a workflow URL that contains information about your project.

    I thought this could be useful in the not too distant future.

    Better to blog about it here than to search for it in the nightmare of my bookmarks. ;-)

    What Is the Relationship Between HCI Research and UX Practice?

    December 14th, 2014

    What Is the Relationship Between HCI Research and UX Practice? by Stuart Reeves

    From the post:

    Human-computer interaction (HCI) is a rapidly expanding academic research domain. Academic institutions conduct most HCI research—in the US, UK, Europe, Australasia, and Japan, with growth in Southeast Asia and China. HCI research often occurs in Computer Science departments, but retains its historically strong relationship to Psychology and Human Factors. Plus, there are several large, prominent corporations that both conduct HCI research themselves and engage with the academic research community—for example, Microsoft Research, PARC, and Google.

    If you aren’t concerned with the relationship between HCI research and UX practice you should be.

    I was in a meeting discussing the addition of RDFa to ODF when a W3C expert commented that the difficulty users have with RDFa syntax was a “user problem.”

    Not to pick on RDFa, I think many of us in the topic map camp felt that users weren’t putting enough effort into learning topic maps. (I will only confess that for myself. Others can speak for themselves.)

    Anytime an advocate and/or developer takes the view that syntax, interfaces or interaction with a program is a “user problem,” they pointing the wrong way with the stick.

    They should be pointing at the developers, designers, advocates who have not made interaction with their program/software intuitive for the “targeted audience.”

    If your program is a LaTeX macro targeted at physicists who eat LaTeX for breakfast, lunch and dinner, that’s one audience.

    If your program is an editing application is targeted at users crippled by the typical office suite menus, then you had best make different choices.

    That is assuming that use of your application is your measure of success.

    Otherwise you can strive to be the second longest running non-profitable software project (Xandu, started in 1960 has first place) in history.

    Rather than being right, or saving the world, or any of the other …ologies, I would prefer to have software that users find useful and do in fact use.

    Use is pre-condition to any software or paradigm changing the world.


    PS: Don’t get me wrong, Xandu is a great project but its adoption of web browsers as means of delivery is a mistake. True, they are everywhere but also subject to the crippled design of web security which prevents transclusion. Which ties you to a server where the NSA can more conveniently scoop up your content.

    Better would be a document browser that uses web protocols and ignores web security rules, thus enabling client-side transclusion. Fork one of the open source browsers and be done with it. Only use digitally signed PDFs or from particular sources. Once utility is demonstrated in a PDF-only universe, the demand will grow for extending it to other sources as well.

    True, some EU/US trade delegates and others will get caught in phishing schemes but I consider that grounds for dismissal and forfeiture of all retirement benefits. (Yes, I retain a certain degree of users be damned but not about UI/UX experiences. ;-) )

    My method of avoiding phishing schemes is to never follow links in emails. If there is an offer I want to look at, I log directly into the site from my browser and not via email. Even for valid messages, which they rarely are.

    I first saw this in a tweet by Raffaele Boiano.

    Machine Learning: The High-Interest Credit Card of Technical Debt (and Merging)

    December 14th, 2014

    Machine Learning: The High-Interest Credit Card of Technical Debt by D. Sculley, et al.


    Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

    Under “entanglement” (referring to inputs) the authors announce the CACE principle:

    Changing Anything Changes Everything

    The net result of such changes is that prediction behavior may alter, either subtly or dramatically, on various slices of the distribution. The same principle applies to hyper-parameters. Changes in regularization strength, learning settings, sampling methods in training, convergence thresholds, and essentially every other possible tweak can have similarly wide ranging effects.

    Entanglement is a native condition in topic maps as a result of the merging process. Yet, I don’t recall there being much discussion of how to evaluate the potential for unwanted entanglement or how to avoid entanglement (if desired).

    You may have topics in a topic map where merging with later additions to the topic map is to be avoided. Perhaps to avoid the merging of spam topics that would otherwise overwhelm your content.

    One way to avoid that and yet allow users to use links reported as subjectIdentifiers and subjectLocators under the TMDM would be to not report those properties for some set of topics to the topic map engine. The only property they could merge on would be their topicID, which hopefully you have concealed from public users.

    Not unlike the traditions of Unix where some X ports are unavailable to any users other than root. Topics with IDs below N are skipped by the topic map engine for merging purposes, unless the merging is invoked by the equivalent of root.

    No change in current syntax or modeling required, although a filter on topic IDs would need to be implemented to add this to current topic map applications.

    I am sure there are other ways to prevent merging of some topics but this seems like a simple way to achieve that end.

    Unfortunately it does not address the larger question of the “technical debt” incurred to maintain a topic map of any degree of sophistication.


    I first saw this in a tweet by Elias Ponvert.