## The Top 10 Posts of 2014 from the Cloudera Engineering Blog

December 18th, 2014

The Top 10 Posts of 2014 from the Cloudera Engineering Blog by Justin Kestelyn.

From the post:

Our “Top 10″ list of blog posts published during a calendar year is a crowd favorite (see the 2013 version here), in particular because it serves as informal, crowdsourced research about popular interests. Page views don’t lie (although skew for publishing date—clearly, posts that publish earlier in the year have pole position—has to be taken into account).

In 2014, a strong interest in various new components that bring real time or near-real time capabilities to the Apache Hadoop ecosystem is apparent. And we’re particularly proud that the most popular post was authored by a non-employee.

See Justin’s post for the top ten (10) list!

The Cloudera blog always has high quality content so this the cream of the crop!

Enjoy!

## Announcing Apache Storm 0.9.3

December 18th, 2014

From the post:

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. Apache Storm brings real-time data processing capabilities to help capture new business opportunities by powering low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in the Hadoop cluster.

Now there’s an early holiday surprise!

Enjoy!

December 18th, 2014

GovTrack’s Summer/Fall Updates by Josh Tauberer.

From the post:

Here’s what’s been improved on GovTrack in the summer and fall of this year.

developers

• Permalinks to individual paragraphs in bill text is now provided (example).
• We now ask for your congressional district so that we can customize vote and bill pages to show how your Members of Congress voted.
• Our bill action/status flow charts on bill pages now include activity on certain related bills, which are often crucially important to the main bill.
• The bill cosponsors list now indicates when a cosponsor of a bill is no longer serving (i.e. because of retirement or death).
• We switched to gender neutral language when referring to Members of Congress. Instead of “congressman/woman”, we now use “representative.”
• Our historical votes database (1979-1989) from voteview.com was refreshed to correct long-standing data errors.
• We dropped support for Internet Explorer 6 in order to address with POODLE SSL security vulnerability that plagued most of the web.
• We dropped support for Internet Explorer 7 in order to allow us to make use of more modern technologies, which has always been the point of GovTrack.

The comment I posted was:

Great work! But I read the other day about legislation being “snuck” by the House (Senate changes), US Congress OKs ‘unprecedented’ codification of warrantless surveillance.

Do you have plans for a diff utility that warns members of either house of changes to pending legislation?

In case you aren’t familiar with GovTrack.us.

GovTrack.us, a project of Civic Impulse, LLC now in its 10th year, is one of the worldʼs most visited government transparency websites. The site helps ordinary citizens find and track bills in the U.S. Congress and understand their representatives’ legislative record.

In 2013, GovTrack.us was used by 8 million individuals. We sent out 3 million legislative update email alerts. Our embeddable widgets were deployed on more than 80 official websites of Members of Congress.

We bring together the status of U.S. federal legislation, voting records, congressional district maps, and more (see the table at the right).
and make it easier to understand. Use GovTrack to track bills for updates or get alerts about votes with email updates and RSS feeds. We also have unique statistical analyses to put the information in context. Read the «Analysis Methodology».

GovTrack openly shares the data it brings together so that other websites can build other tools to help citizens engage with government. See the «Developer Documentation» for more.

## A Survey of Monte Carlo Tree Search Methods

December 18th, 2014

A Survey of Monte Carlo Tree Search Methods by Cameron Browne, et al.

Abstract:

Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.

At almost fifty (50) pages, this review of the state of the art for MCTS research as of 2012, should keep even dedicated readers occupied for several days. The extensive bibliography will enhance your reading experience!

I first saw this in a tweet by Ebenezer Fogus.

## Google’s alpha-stage email encryption plugin lands on GitHub

December 18th, 2014

Google’s alpha-stage email encryption plugin lands on GitHub by David Meyer.

From the post:

Google has updated its experimental End-to-End email encryption plugin for Chrome and moved the project to GitHub. The firm said in a Tuesday blog post that it had “always believed strongly that End-To-End must be an open source project.” The alpha-stage, OpenPGP-based extension now includes the first contributions from Yahoo’s chief security officer, Alex Stamos. Google will also make its new crypto library available to several other projects that have expressed interest. However, product manager Stephan Somogyi said the plugin still wasn’t ready for the Chrome Web Store, and won’t be widely released until Google is happy with the usability of its key distribution and management mechanisms.

Not to mention that being open source makes it harder to lean on management to make compromises to suit governments. Imagine that, the strength to resist tyranny in openness.

If you are looking for a “social good” project for 2015, it is hard to imagine a better one in the IT area.

## DeepDive

December 18th, 2014

DeepDive

From the homepage:

DeepDive is a new type of system that enables developers to analyze data on a deeper level than ever before. DeepDive is a trained system: it uses machine learning techniques to leverage on domain-specific knowledge and incorporates user feedback to improve the quality of its analysis.

DeepDive differs from traditional systems in several ways:

• DeepDive is aware that data is often noisy and imprecise: names are misspelled, natural language is ambiguous, and humans make mistakes. Taking such imprecisions into account, DeepDive computes calibrated probabilities for every assertion it makes. For example, if DeepDive produces a fact with probability 0.9 it means the fact is 90% likely to be true.
• DeepDive is able to use large amounts of data from a variety of sources. Applications built using DeepDive have extracted data from millions of documents, web pages, PDFs, tables, and figures.
• DeepDive allows developers to use their knowledge of a given domain to improve the quality of the results by writing simple rules that inform the inference (learning) process. DeepDive can also take into account user feedback on the correctness of the predictions, with the goal of improving the predictions.
• DeepDive is able to use the data to learn "distantly". In contrast, most machine learning systems require tedious training for each prediction. In fact, many DeepDive applications, especially at early stages, need no traditional training data at all!
• DeepDive’s secret is a scalable, high-performance inference and learning engine. For the past few years, we have been working to make the underlying algorithms run as fast as possible. The techniques pioneered in this project
are part of commercial and open source tools including MADlib, Impala, a product from Oracle, and low-level techniques, such as Hogwild!. They have also been included in Microsoft's Adam.

This is an example of why I use Twitter for current awareness. My odds for encountering DeepDive on a web search, due primarily to page-ranked search results, are very, very low. From the change log, it looks like DeepDive was announced in March of 2014, which isn’t very long to build up a page-rank.

You do have to separate the wheat from the chaff with Twitter, but DeepDive is an example of what you may find. You won’t find it with search, not for another year or two, perhaps longer.

How does that go? He said he had a problem and was going to use search to find a solution? Now he has two problems?

I first saw this in a tweet by Stian Danenbarger.

PS: Take a long and careful look at DeepDive. Unless I find other means, I am likely to be using DeepDive to extract text and the redactions (character length) from a redacted text.

## Michael Brown – Grand Jury Witness Index – Part 1

December 17th, 2014

I have completed the first half of the grand jury witness index for the Michael Brown case, covering volumes 1 – 12. (index volumes 13 -24, forthcoming)

The properties with each witness, along with others, will be used to identify that witness using a topic map.

Donate here to support this ongoing effort.

1. Volume 1 Page 25 Line: 7 – Medical legal investigator – His report is Exhibit #1. (in released documents, 2014-5143-narrative-report-01.pdf)

2. Volume 2 Page 20 Line: 6 – Crime Scene Detective with St. Louis County Police
3. Volume 3 Page 7 Line: 7 – Crime Scene Detective with St. Louis County Police – 22 years with St. Louis – 14 years as crime scene detective
4. Volume 3 Page 51 Line: 12 – Forensic Pathologist – St Louis City Medical Examiner’s Office (assistant medical examiner)
5. Volume 4 Page 17 Line: 7 – Dorian Johnson
6. Volume 5 Page 12 Line: 9 – Police Sergent – Ferguson Police – Since December 2001 (Volume-5 Page 14 – Prepared no written report)
7. Volume 5 Page 75 Line: 11 – Detective St. Louis Police Department Two and 1/2 years
8. Volume 5 Page 140 Line: 11 – Female FBI agent three and one-half years
9. Volume 5 Page 196 Line: 23 – Darren Wilson (Volume-5 Page 197 talked to prosecutor before appearing)
10. Volume 6 Page 149 Line: 18 – Witness #10
11. Volume 6 Page 232 Line: 5 – Witness with marketing firm
12. Volume 7 Page 9 Line: 1 – Canfield Green Apartments (female, no #)
13. Volume 7 Page 153 Line: 9 – coming from a young lady’s house, passenger in white Monte Carlo
14. Volume 8 Page 97 Line: 14 – Canfield Green Apartments, second floor, collecting Social Security, brother and his wife come over
15. Volume 8 Page 173 Line: 9 – Detective St. Louis County Police Department – Since March 2008 (as detective) **primary case officer**
16. Volume 8 Page 196 Line: 2 – Previously testified on Sept. 9th, page 7 Crime Scene Detective with St. Louis County Police – 22 years with St. Louis – 14 years as crime scene detective
17. Volume 9 Page 7 Line: 7 – Sales consultant – Canfield Drive
18. Volume 9 Page 68 Line: 15 – Visitor to Canfield Green Apartment Complex with wife
19. Volume-10 Page 7 Line: 10 – Wife of witness in volume 9? visitor to complex
20. Volume-10 Page 68 Line: 24 – Police officer, St. Louis County Police Department, assigned as a firearm and tool mark examiner in the crime laboratory.
21. Volume-10 Page 128 Line: 8 – Detective, Crime Scene Unit for St. Louis County, 18 years as police officer, 3 years with crime scene – photographed Darren Wilson
22. Volume-11 Page 6 Line: 21 – Canfield Apartment Complex, top floor, Living with girlfriend
23. Volume-11 Page 59 Line: 7 – Girlfriend of witness at volume 11, page 6 – prosecutor has her renounce prior statements
24. Volume-11 Page 80 Line: 7 – Drug chemist – crime lab
25. Volume-11 Page 111 Line: 7 – Latent (fingerprint) examiner for the St. Louis County Police Department.
26. Volume-11 Page 137 Line: 7 – Canfield Green Apartment Complex, fiancee for 3 1/2 to 4 years, south end of building, one floor above them, has children (boys)
27. Volume-11 Page 169 Line: 16 – Doesn’t live at the Canfield Apartments, returning on August 9th to return?, in a van with husband, two daughters and granddaughter
28. Volume-12 Page 11 Line: 7 – Husband of the witness driving the van, volume 11, page 169
29. Volume-12 Page 51 Line: 15 – Special agent with the FBI assigned to the St. Louis field office, almost 24 years
30. Volume-12 Page 102 Line: 18 – Lives in Northwinds Apartments, white ’99 Monte Carlo
31. Volume-12 Page 149 Line: 6 – Contractor, retaining wall and brick patios

Caution: This list presents witnesses as they appeared and does not include the playing of prior statements and interviews. Those will be included in a separate index of statements because they play a role in identifying the witnesses who appeared before the grand jury.

The outcome of the Michael Brown grand jury was not the fault of the members of the grand jury. It was a result that was engineered by departing from usual and customary practices, distortion of evidence and misleading the grand jury about applicable law, among other things. All of that is hiding in plain sight in the grand jury transcripts.

### Other Michael Brown Posts

Missing From Michael Brown Grand Jury Transcripts December 7, 2014. (The witness index I propose to replace.)

New recordings, documents released in Michael Brown case [LA Times Asks If There’s More?] Yes! December 9, 2014 (before the latest document dump on December 14, 2014).

Michael Brown Grand Jury – Presenting Evidence Before Knowing the Law December 10, 2014.

How to Indict Darren Wilson (Michael Brown Shooting) December 12, 2014.

More Missing Evidence In Ferguson (Michael Brown) December 15, 2014.

Michael Brown – Grand Jury Witness Index – Part 1 December 17, 2014. (above)

## History & Philosophy of Computational and Genome Biology

December 17th, 2014

History & Philosophy of Computational and Genome Biology by Mark Boguski.

A nice collection of books and articles on computational and genome biology. It concludes with this anecdote:

Despite all of the recent books and biographies that have come out about the Human Genome Project, I think there are still many good stories to be told. One of them is the origin of the idea for whole-genome shotgun and assembly. I recall a GRRC (Genome Research Review Committee) review that took place in late 1996 or early 1997 where Jim Weber proposed a whole-genome shotgun approach. The review panel, at first, wanted to unceremoniously “NeRF” (Not Recommend for Funding) the grant but I convinced them that it deserved to be formally reviewed and scored, based on Jim’s pioneering reputation in the area of genetic polymorphism mapping and its impact on the positional cloning of human disease genes and the origins of whole-genome genotyping. After due deliberation, the GRRC gave the Weber application a non-fundable score (around 350 as I recall) largely on the basis of Weber’s inability to demonstrate that the “shotgun” data could be assembled effectively.

Some time later, I was giving a ride to Jim Weber who was in Bethesda for a meeting. He told me why his grant got a low score and asked me if I knew any computer scientists that could help him address the assembly problem. I suggested he talk with Gene Myers (I knew Gene and his interests well since, as one of the five authors of the BLAST algorithm, he was a not infrequent visitor to NCBI).

The following May, Weber and Myers submitted a “perspective” for publication in Genome Research entitled “Human whole-genome shotgun sequencing“. This article described computer simulations which showed that assembly was possible and was essentially a rebuttal to the negative review and low priority score that came out of the GRRC. The editors of Genome Research (including me at the time) sent the Weber/Myers article to Phil Green (a well-known critic of shotgun sequencing) for review. Phil’s review was extremely detailed and actually longer that the Weber/Myers paper itself! The editors convinced Phil to allow us to publish his critique entitled “Against a whole-genome shotgun” as a point-counterpoint feature alongside the Weber-Myers article in the journal.

The rest, as they say, is history, because only a short time later, Craig Venter (whose office at TIGR had requested FAX copies of both the point and counterpoint as soon as they were published) and Mike Hunkapiller announced their shotgun sequencing and assembly project and formed Celera. They hired Gene Myers to build the computational capabilities and assemble their shotgun data which was first applied to the Drosophila genome as practice for tackling a human genome which, as is now known, was Venter’s own. Three of my graduate students (Peter Kuehl, Jiong Zhang and Oxana Pickeral) and I participated in the Drosophila annotation “jamboree” (organized by Mark Adams of Celera and Gerry Rubin) working specifically on an analysis of the counterparts of human disease genes in the Drosophila genome. Other aspects of the Jamboree are described in a short book by one of the other participants, Michael Ashburner.

The same type of stories exist not only from the early days of computer science but since then as well. Stories that will capture the imaginations of potential CS majors as well as illuminate areas where computer science can or can’t be useful.

How many of those stories have you captured?

I first saw this in a tweet by Neil Saunders.

## U.S. Says Europeans Tortured by Assad’s Death Machine

December 17th, 2014

U.S. Says Europeans Tortured by Assad’s Death Machine by Josh Rogin.

From the post:

The U.S. State Department has concluded that up to 10 European citizens have been tortured and killed while in the custody of the Syrian regime and that evidence of their deaths could be used for war crimes prosecutions against Bashar al-Assad in several European countries.

The new claim, made by the State Department’s ambassador at large for war crimes, Stephen Rapp, in an interview with me, is based on a newly completed FBI analysis of 27,000 photographs smuggled out of Syria by the former military photographer known as “Caesar.” The photos show evidence of the torture and murder of over 11,000 civilians in custody. The FBI spent months pouring over the photos and comparing them to consular databases with images of citizens from countries around the world.

Last month, the FBI gave the State Department its report, which included a group of photos that had been tentatively matched to individuals who were already in U.S. government files. “The group included multiple individuals who were non-Syrian, but none who had a birthplace in the United States, according to our information,” Rapp told me. “There were Europeans within that group.”

The implications could be huge for the international drive to prosecute Assad and other top Syrian officials for war crimes and crimes against humanity. While it’s unlikely that multilateral organizations such as the United Nations or the International Criminal Court will pursue cases against Assad in the near term, due to opposition by Assad’s allies including Russia, legal cases against the regime could be brought in individual countries whose citizens were victims of torture and murder.

Is this a “heads up” from the State Department that lists of war criminals in the CIA Torture Report should be circulated in European countries?

Even if they won’t be actively prosecuted, the threat of arrest might help keep Europe free of known American war criminals. Unfortunately that would mean they would still be in the United States but the American public supported them so that seems fair.

I first saw this in a tweet by the U.S. Dept. of Fear.

## Endless Parentheses

December 17th, 2014

Endless Parentheses

Endless Parentheses is a blog about Emacs. It features concise posts on improving your productivity and making Emacs life easier in general.

Code included is predominantly emacs-lisp, lying anywhere in the complexity spectrum with a blatant disregard for explanations or tutorials. The outcome is that the posts read quickly and pleasantly for experienced Emacsers, while new enthusiasts are invited to digest the code and ask questions in the comments.

## What you can expect:

• Posts are always at least weekly, coming out on every weekend and on the occasional Wednesday.
• Posts are always about Emacs. Within this constraint you can expect anything, from sophisticated functions to brief comments on my keybind preferences.
• Posts are usually short, 5-minute reads, as opposed to 20+-minute investments. Don’t expect huge tutorials.

The editor if productivity is your goal.

I first saw this blog mentioned in a tweet by Anna Pawlicka.

## Learn Physics by Programming in Haskell

December 17th, 2014

Abstract:

We describe a method for deepening a student’s understanding of basic physics by asking the student to express physical ideas in a functional programming language. The method is implemented in a second-year course in computational physics at Lebanon Valley College. We argue that the structure of Newtonian mechanics is clarified by its expression in a language (Haskell) that supports higher-order functions, types, and type classes. In electromagnetic theory, the type signatures of functions that calculate electric and magnetic fields clearly express the functional dependency on the charge and current distributions that produce the fields. Many of the ideas in basic physics are well-captured by a type or a function.

A nice combination of two subjects of academic importance!

Anyone working on the use of the NLTK to teach David Copperfield or Great Expectations?

I first saw this in a tweet by José A. Alonso.

## Orleans Goes Open Source

December 17th, 2014

Orleans Goes Open Source

From the post:

Since the release of the Project “Orleans” Public Preview at //build/ 2014 we have received a lot of positive feedback from the community. We took your suggestions and fixed a number of issues that you reported in the Refresh release in September.

Now we decided to take the next logical step, and do the thing many of you have been asking for – to open-source “Orleans”. The preparation work has already commenced, and we expect to be ready in early 2015. The code will be released by Microsoft Research under an MIT license and published on GitHub. We hope this will enable direct contribution by the community to the project. We thought we would share the decision to open-source “Orleans” ahead of the actual availability of the code, so that you can plan accordingly.

The real excitement for me comes from a post just below this announcement: A Framework for Cloud Computing,

To avoid these complexities, we built the Orleans programming model and runtime, which raises the level of the actor abstraction. Orleans targets developers who are not distributed system experts, although our expert customers have found it attractive too. It is actor-based, but differs from existing actor-based platforms by treating actors as virtual entities, not as physical ones. First, an Orleans actor always exists, virtually. It cannot be explicitly created or destroyed. Its existence transcends the lifetime of any of its in-memory instantiations, and thus transcends the lifetime of any particular server. Second, Orleans actors are automatically instantiated: if there is no in-memory instance of an actor, a message sent to the actor causes a new instance to be created on an available server. An unused actor instance is automatically reclaimed as part of runtime resource management. An actor never fails: if a server S crashes, the next message sent to an actor A that was running on S causes Orleans to automatically re-instantiate A on another server, eliminating the need for applications to supervise and explicitly re-create failed actors. Third, the location of the actor instance is transparent to the application code, which greatly simplifies programming. And fourth, Orleans can automatically create multiple instances of the same stateless actor, seamlessly scaling out hot actors.

Overall, Orleans gives developers a virtual “actor space” that, analogous to virtual memory, allows them to invoke any actor in the system, whether or not it is present in memory. Virtualization relies on indirection that maps from virtual actors to their physical instantiations that are currently running. This level of indirection provides the runtime with the opportunity to solve many hard distributed systems problems that must otherwise be addressed by the developer, such as actor placement and load balancing, deactivation of unused actors, and actor recovery after server failures, which are notoriously difficult for them to get right. Thus, the virtual actor approach significantly simplifies the programming model while allowing the runtime to balance load and recover from failures transparently. (emphasis added)

Not in a distributed computing context but the “look and its there” model is something I recall from HyTime. So nice to see good ideas resurface!

Just imagine doing that with topic maps, including having properties of a topic, should you choose to look for them. If you don’t need a topic, why carry the overhead around? Wait for someone to ask for it.

This week alone, Microsoft continues its fight for users, announces an open source project that will make me at least read about .Net, ;-), I think Microsoft merits a lot of kudos and good wishes for the holiday season!

I first say this at: Microsoft open sources cloud framework that powers Halo by Jonathan Vanian.

## The Closed United States Government

December 17th, 2014

U.S. providing little information to judge progress against Islamic State by Nancy A. Youssef.

From the post:

The American war against the Islamic State has become the most opaque conflict the United States has undertaken in more than two decades, a fight that’s so underreported that U.S. officials and their critics can make claims about progress, or lack thereof, with no definitive data available to refute or bolster their positions.

The result is that it’s unclear what impact more than 1,000 airstrikes on Iraq and Syria have had during the past four months. That confusion was on display at a House Foreign Affairs Committee hearing earlier this week, where the topic – “Countering ISIS: Are We Making Progress?” – proved to be a question without an answer.

“Although the administration notes that 60-plus countries having joined the anti-ISIS campaign, some key partners continue to perceive the administration’s strategy as misguided,” Rep. Ed Royce, R-Calif., the committee’s chairman, said in his opening statement at the hearing, using a common acronym for the Islamic State. “Meanwhile, there are grave security consequences to allowing ISIS to control a territory of the size of western Iraq and eastern Syria.”

Nancy does a great job teasing out reasons for the opaqueness of the war against ISIS, which include:

1. Disclosure of the lack of coordination between any policy goal and military action
2. Disclosure of odd alliances with countries and “groups” (terrorist groups?)
3. Disclosure of timing and location of attacks might be used to detect trends

The first two are classic reasons for openness. If the public knew what was happening in the war with ISIS, it would well have Congress defund the war as being incompetently lead. Take it up some other time with better leadership.

But the public can’t make that call so long as the government remains a closed (not open) government and the press remains too timid to seek facts out for itself.

I don’t credit #3 at all because ISIS should know with a fair degree of accuracy where bombing raids are occurring and when. Unless the military is bombing sand to throw off their trend analysis.

Lack of openness from the government, about wars, about torture, about its alliances, will lead to future generations asking Americans: “How could you have supported a government like that?” Are you really going to say that you didn’t know?

## Leveraging UIMA in Spark

December 17th, 2014

Leveraging UIMA in Spark by Philip Ogren.

Description:

Much of the Big Data that Spark welders tackle is unstructured text that requires text processing techniques. For example, performing named entity extraction on tweets or sentiment analysis on customer reviews are common activities. The Unstructured Information Management Architecture (UIMA) framework is an Apache project that provides APIs and infrastructure for building complex and robust text analytics systems. A typical system built on UIMA defines a collection of analysis engines (such as e.g. a tokenizer, part-of-speech tagger, named entity recognizer, etc.) which are executed according to arbitrarily complex flow control definitions. The framework makes it possible to have interoperable components in which best-of-breed solutions can be mixed and matched and chained together to create sophisticated text processing pipelines. However, UIMA can seem like a heavy weight solution that has a sprawling API, is cumbersome to configure, and is difficult to execute. Furthermore, UIMA provides its own distributed computing infrastructure and run time processing engines that overlap, in their own way, with Spark functionality. In order for Spark to benefit from UIMA, the latter must be light-weight and nimble and not impose its architecture and tooling onto Spark.

In this talk, I will introduce a project that I started called uimaFIT which is now part of the UIMA project (http://uima.apache.org/uimafit.html). With uimaFIT it is possible to adopt UIMA in a very light-weight way and leverage it for what it does best: text processing. An entire UIMA pipeline can be encapsulated inside a single function call that takes, for example, a string input parameter and returns named entities found in the input string. This allows one to call a Spark RDD transform (e.g. map) that performs named entity recognition (or whatever text processing tasks your UIMA components accomplish) on string values in your RDD. This approach requires little UIMA tooling or configuration and effectively reduces UIMA to a text processing library that can be called rather than requiring full-scale adoption of another platform. I will prepare a companion resource for this talk that will provide a complete, self-contained, working example of how to leverage UIMA using uimaFIT from within Spark.

The necessity of creating light-weight ways to bridge the gaps between applications and frameworks is a signal that every solution is trying to be the complete solution. Since we have different views of what any “complete” solution would look like, wheels are re-invented time and time again. Along with all the parts necessary to use those wheels. Resulting in a tremendous duplication of effort.

A component based approach attempts to do one thing. Doing any one thing well, is challenging enough. (Self-test: How many applications do more than one thing well? Assuming they do one thing well. BTW, for programmers, the test isn’t that other programs fail to do it any better.)

Until more demand results in easy to pipeline components, Philip’s uimaFIT is a great way to incorporate text processing from UIMA into Spark.

Enjoy!

## Sony Breach Result of Self Abuse

December 17th, 2014

In Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data I wrote in part:

The bitching and catching by Sony are sure signs that something went terribly wrong internally. The current circus is an attempt to distract the public from that failure. Probably a member of management with highly inappropriate security clearance because “…they are important!”

Inappropriate security clearances for management to networks is a sign of poor systems administration. I wonder when that shoe is going to drop? (emphasis added)

The other shoe dropping did not take long! Later that same day, Sony employees file a suit largely to the same effect: Sony employees file lawsuit, blame company over hacked data by Jeff John Roberts.

Jeff writes in part:

They accuse Sony of negligence for failing to secure its network, and not taking adequate steps to protect employees once the company knew the information was compromised.

The complaint also cites various security and news reports to say that Sony lost the cryptographic “keys to the kingdom,” which allowed the hackers to root around in its system undetected for as long as a year.

That is the other reason for the obsession with secrecy in the computer security business. The management that signs the checks for security contractors is the same management that is responsible for the security breaches.

Honest security reporting (which does happen) bites the hand that feeds it.

Just so you know, before I signed off for the day, the following appeared in the New York Times: U.S. Links North Korea to Sony Hacking by David E. Sanger and Nicole Perlroth.

There is one tiny problem with the story:

It is not clear how the United States came to its determination that the North Korean regime played a central role in the Sony attacks.

Buried about half-way down in the story.

Sanger and Perlroth report no independent confirmation that what was told to them by unnamed sources is true. Unnamed sources from an administration that has repeatedly demonstrated its willingness to lie, cheat, even murder, in the pursuit of some secret agenda.

Broadcasting re-edited broadsides from a group of known liars without independent verification of the claims is a disservice to the reading public. With the U.S. government, I would require two independent sources of confirmation before reporting their claims at all and then with a caution about the government’s reliability.

Update: In Sony hack: White House views attack as security issue, the BBC reports the White House refuses to confirm if North Korea is responsible for the attack on Sony. Private FUD and public denial?

At least the BBC offers these options under Four possible suspects in the Sony hack:

• A nation state, most likely North Korea
• Supporters of North Korean regime, based in China
• Hackers with a money-making motive
• Hackers or a lone individual with another motive, such as revenge

Whatever the “factual” outcome, the North Korean 9/11 on Sony has already passed into folklore for computer security discussions, at least at the policy level. What failing policies will result, like those following 9/11, such as useless operations in Afghanistan and Iraq, remains to be seen.

Update:

Jody Westby’s Instead Of A Real Response, Perennially Hacked Sony Is Acting Like A Spoiled Teenager is as instructive for potential hacking victims as it is amusing. A joyful read for the holidays and counter to the gloom and doom folks selling less than stellar cybersecurity services.

## Tracking Government/Terrorist Financing

December 17th, 2014

From the post:

Terrorism impacts our lives each and every day; whether directly through acts of violence by terrorists, reduced liberties from new anti-terrorism laws, or increased taxes to support counter terrorism activities. A vital component of terrorism is the means through which these activities are financed, through legal and illicit financial activities. Recognizing the necessity to limit these financial activities in order to reduce terrorism, many nation states have agreed to a framework of global regulations, some of which have been realized through regulatory programs such as the Bank Secrecy Act (BSA).

As part of the BSA (an other similar regulations), governed financial services institutions are required to determine if the financial transactions of a person or entity is related to financing terrorism. This is a specific report requirement found in Response 30, of Section 2, in the FinCEN Suspicious Activity Report (SAR). For every financial transaction moving through a given banking system, the institution need to determine if it is suspicious and, if so, is it part of a larger terrorist activity. In the event that it is, the financial services institution is required to immediately file a SAR and call FinCEN.

The process of determining if a financial transaction is terrorism related is not merely a compliance issue, but a national security imperative. No solution exist today that adequately addresses this requirement. As such, I was asked to speak on the issue as a data scientist practicing in the private intelligence community. These are some of the relevant points from that discussion.

Jerry has a great outline of the capabilities you will need for tracking government/terrorist financing. Depending upon your client’s interest, you may be required to monitor data flows in order to trigger the filing of a SAR and calling FinCEN or to avoid triggering the filing of a SAR and calling FinCEN. For either goal the tools and techniques are largely the same.

Or for monitoring government funding for torture or groups to carry out atrocities on its behalf. Same data mining techniques apply.

Have you ever noticed that government data leaks rarely involve financial records? Thinking of the consequences of the accounts payable ledger that listed all the organizations and people paid by the Bush administration, sans all the SS and retirement recipients.

That would be near the top of my most wanted data leaks list.

You?

## Apache Spark I & II [Pacific Northwest Scala 2014]

December 16th, 2014

Description:

This session introduces you to Spark by starting with something basic: Scala collections and functional data transforms. We then look at how Spark expands the functional collection concept to enable massively distributed, fast computations. The second half of the talk is for those of you who want to know the secrets to make Spark really fly for querying tabular datasets. We will dive into row vs columnar datastores and the facilities that Spark has for enabling interactive data analysis, including Spark SQL and the in-memory columnar cache. Learn why Scala’s functional collections are the best foundation for working with data!

Description:

In this talk we will step into Spark over Cassandra with Spark Streaming and Kafka. Then put it in the context of an event-driven Akka application for real-time delivery of meaning at high velocity. We will do this by showing how to easily integrate Apache Spark and Spark Streaming with Apache Cassandra and Apache Kafka using the Spark Cassandra Connector. All within a common use case: working with time-series data, which Cassandra excells at for data locality and speed.

Back to back excellent presentations on Spark!

I need to replace my second monitor (died last week) so I can run the video at full screen with a REPL open!

Enjoy!

## Cartography with complex survey data

December 16th, 2014

Cartography with complex survey data by David Smith.

From the post:

Visualizing complex survey data is something of an art. If the data has been collected and aggregated to geographic units (say, counties or states), a choropleth is one option. But if the data aren't so neatly arranged, making visual sense often requires some form of smoothing to represent it on a map.

R, of course, has a number of features and packages to help you, not least the survey package and the various mapping tools. Swmap (short for "survey-weighted maps") is a collection of R scripts that visualize some public data sets, for example this cartogram of transportation share of household spending based on data from the 2012-2013 Consumer Expenditure Survey.

In addition to finding data, there is also the problem of finding tools to process found data.

As in when I follow a link to a resource, that link is also submitted to a repository of other things associated with the data set I am requesting, such as the current locations of its authors, tools for processing the data, articles written using the data, etc.

That’s a long ways off but at least today you can record having found one more cache of tools for data processing.

## Type systems and logic

December 16th, 2014

Type systems and logic by Alyssa Carter (From Code Word – Hacker School)

From the post:

An important result in computer science and type theory is that a type system corresponds to a particular logic system.

How does this work? The basic idea is that of the Curry-Howard Correspondence. A type is interpreted as a proposition, and a value is interpreted as a proof of the proposition corresponding to its type. Most standard logical connectives can be derived from this idea: for example, the values of the pair type (A, B) are pairs of values of types A and B, meaning they’re pairs of proofs of A and B, which means that (A, B) represents the logical conjunction “A && B”. Similarly, logical disjunction (“A | | B”) corresponds to what’s called a “tagged union” type: a value (proof) of Either A B is either a value (proof) of A or a value (proof) of B.

This might be a lot to take in, so let’s take a few moments for concrete perspective.

Types like Int and String are propositions – you can think of simple types like these as just stating that “an Int exists” or “a String exists”. 1 is a proof of Int, and "hands" is a proof of String. (Int, String) is a simple tuple type, stating that “there exists an Int and there exists a String”. (1, "hands") is a proof of (Int, String). Finally, the Either type is a bit more mysterious if you aren’t familiar with Haskell, but the type Either a b can contain values of type a tagged as the “left” side of an Either or values of type b tagged as the “right” side of an Either. So Either Int String means “either there exists an Int or there exists a String”, and it can be proved by either Left 1 or Right "hands". The tags ensure that you don’t lose any information if the two types are the same: Either Int Int can be proved by Left 1 or Right 1, which can be distinguished from each other by their tags.

It has gems like:

truth is useless for computation and proofs are not

I would have far fewer objections to some logic/ontology discussions if they limited their claims to computation.

People are free to accept or reject any result of computation. Depends on their comparison of the result to their perception of the world.

Case in point, the five year old who could not board a plane because they shared a name with someone on the no-fly list.

One person, a dull TSA agent, could not see beyond the result of a calculation on the screen.

Everyone else could see a five year old who, while cranky, wasn’t on the no-fly list.

I first saw this in a tweet by Rahul Goma Phulore.

## Slooh

December 16th, 2014

Slooh I want to be an astronaut astronomer.

From the webpage:

Robotic control of Slooh’s three telescopes in the northern (Canary Islands) and southern hemispheres (Chile)

Schedule time and point the telescopes at any object in the night sky. You can make up to five reservations at a time in five or ten minute increments depending on the observatory. There are no limitations on the total number of reservations you can book in any quarter.

Capture, collect, and share images, including PNG and FITS files. You can view and take images from any of the 250+ “missions” per night, including those scheduled by other members.

Watch hundreds of hours of live and recorded space shows with expert narration featuring 10+ years of magical moments in the night sky including eclipses, transits, solar flares, NEA, comets, and more.

See and discuss highlights from the telescopes, featuring member research, discoveries, animations, and more.

Join groups with experts and fellow citizen astronomers to learn and discuss within areas of interest, from astrophotography and tracking asteroids to exoplanets and life in the Universe.

Access Slooh activities with step by step how-to instructions to master the art and science of astronomy.

A reminder that for all the grim data that is available for analysis/mining, there is an equal share of interesting and/or beautiful data as well.

There is a special on right now for $1.00 you can obtain four (4) weeks of membership. The fine print says every yearly quarter of membership is$74.85. $74.85 / 4 =$18.71 per month or $224.25 per year. Less than cable and/or cellphone service. It also has the advantage of not making you dumber. Surprised they didn’t mention that. I first saw this in a tweet by Michael Peter Edson. ## UX Newsletter December 16th, 2014 Our New Ebook: The UX Reader From the post: This week, MailChimp published its first ebook, The UX Reader. I could just tell you that it features revised and updated pieces from our UX Newsletter, that you can download it here for$5, and that all proceeds go to RailsBridge. But instead, I’m hearing the voice of Mrs. McLogan, my high school physics teacher:

“Look, I know you’ve figured out the answer, but I want you to show your work.”

Just typing those words makes me sweat—I still get nervous when I’m asked to show how to solve a problem, even if I’m confident in the solution. But I always learn new things and get valuable feedback whenever I do.

So today I want to show you the work of putting together The UX Reader and talk more about the problem it helped us solve.

After you read this post, you too will be a subscriber to the UX Newsletter. Not to mention having a copy of the updated book, The UX Reader.

Worth the time to read and put in to practice what it reports.

Or as I told an old friend earlier today:

The greatest technology/paradigm without use is only interesting, not compelling or game changing.

## Melville House to Publish CIA Torture Report:… [Publishing Gone Awry?]

December 16th, 2014

From the post:

In what must be considered a watershed moment in contemporary publishing, Brooklyn-based independent publisher Melville House will release the Senate Intelligence Committee’s executive summary of a government report — “Study of the Central Intelligence Agency’s Detention and Interrogation Program” — that is said to detail the monstrous torture methods employed by the Central Intelligence Agency in its counter-terrorism efforts.

Melville House’s co-publisher and co-founder Dennis Johnson has called the report “probably the most important government document of our generation, even one of the most significant in the history of our democracy.”

Melville House’s press release confirms that they are releasing both print and digital editions on December 30, 2014.

As of December 30, 2014, I can read and mark my copy, print or digital and you can mark your copy, print or digital, but no collaboration on the torture report.

For the “…most significant [document] in the history of our democracy” that seems rather sad. That is that each of us is going to be limited to whatever we know or can find out when we are reading our copies of the same report.

If there was ever a report (and there have been others) that merited a collaborative reading/annotation, the CIA Torture Report would be one of them.

Given the large number of people who worked on this report and the diverse knowledge required to evaluate it, that sounds like bad publishing choices. Or at least that there are better publishing choices available.

What about casting the entire report into the form of wiki pages, broken down by paragraphs? Once proofed, the original text can be locked and comments only allowed on the text. Free to view but $fee to comment. What do you think? Viable way to present such a text? Other ways to host the text? PS: Unlike other significant government reports, major publishing houses did not receive incentives to print the report. Jerry attributes that to Dianne Feinstein not wanting to favor any particular publisher. That’s one explanation. Another would be that if published in hard copy at all, a small press will mean it fades more quickly from public view. Your call. ## Graph data from MySQL database in Python December 16th, 2014 Graph data from MySQL database in Python From the webpage: All Python code for this tutorial is available online in this IPython notebook. Thinking of using Plotly at your company? See Plotly’s on-premise, Plotly Enterprise options. Note on operating systems: While this tutorial can be followed by Windows or Mac users, it assumes a Ubuntu operating system (Ubuntu Desktop or Ubuntu Server). If you don’t have a Ubuntu server, its possible to set up a cloud one with Amazon Web Services (follow the first half of this tutorial). If you’re using a Mac, we recommend purchasing and downloading VMware Fusion, then installing Ubuntu Desktop through that. You can also purchase an inexpensive laptop or physical server from Zareason, with Ubuntu Desktop or Ubuntu Server preinstalled. Reading data from a MySQL database and graphing it in Python is straightforward, and all the tools that you need are free and online. This post shows you how. If you have questions or get stuck, email feedback@plot.ly, write in the comments below, or tweet to @plotlygraphs. Just in case you want to start on adding a job skill over the holidays! Whenever I see “graph” used in this sense, I wish it were some appropriate form of “visualize.” Unfortunately, “graphing” of data stuck too long ago to expect anyone to change now. To be fair, it is marking nodes on an edge, except that we treat all the space on one side or the other of the edge as significant. Perhaps someone has treated the “curve” of a graph as a hyperedge? Connecting multiple nodes? I don’t know. You? Whether they have or haven’t, I will continue to think of this type of “graphing” as visualization. Very useful but not the same thing as graphs with nodes/edges, etc. ## Warning: Verizon Scam – Secure Cypher December 16th, 2014 Scams during the holiday season are nothing new but this latest scam has a “…man bites dog” quality to it. The scam in this case is being run by the vendor offering the service: Verizon. Karl Bode writes in: Verizon Offers Encrypted Calling With NSA Backdoor At No Additional Charge: Verizon’s marketing materials for the service feature young, hip, privacy-conscious users enjoying the “industry’s most secure voice communication” platform: Verizon says it’s initially pitching the$45 per phone service to government agencies and corporations, but would ultimately love to offer it to consumers as a line item on your bill. Of course by “end-to-end encryption,” Verizon means that the new \$45 per phone service includes an embedded NSA backdoor free of charge. Apparently, in Verizon-land, “end-to-end encryption” means something entirely different than it does in the real world:

“Cellcrypt and Verizon both say that law enforcement agencies will be able to access communications that take place over Voice Cypher, so long as they’re able to prove that there’s a legitimate law enforcement reason for doing so. Seth Polansky, Cellcrypt’s vice president for North America, disputes the idea that building technology to allow wiretapping is a security risk. “It’s only creating a weakness for government agencies,” he says. “Just because a government access option exists, it doesn’t mean other companies can access it.”

What do you think? Is the added * Includes Free NSA Backdoor sufficient notice to consumers?

I am more than willing to donate my rights to this image to Verizon for advertising purposes. Perhaps you should forward a copy to them and your friends on Verizon.

## LT-Accelerate

December 16th, 2014

LT-Accelerate: LT-Accelerate is a conference designed to help businesses, researchers and public administrations discover business value via Language Technology.

LT-Accelerate is a joint production of LT-Innovate, the European Association of the Language Technology Industry, and Alta Plana Corporation, a Washington DC based strategy consultancy headed by analyst Seth Grimes.

Held December 4-5, 2014 in Brussels, the website reports seven (7) interviews with key speakers and slides from thirty-eight speakers.

Not as in depth as papers nor as useful as videos of the presentations but still capable of sparking new ideas as you review the slides.

For example, the slides from Multi-Dimensional Sentiment Analysis by Stephen Pulman made me wonder what sentiment detection design would be appropriate for the Michael Brown grand jury transcripts?

Sentiment detection has been successfully used with tweets (140 character limit) and I am reliably informed that most of the text strings in the Michael Brown grand jury transcript are far longer than one hundred and forty (140) characters.

Any sentiment detectives in the audience?

## US Congress OKs ‘unprecedented’ codification of warrantless surveillance

December 16th, 2014

From the post:

Congress last week quietly passed a bill to reauthorize funding for intelligence agencies, over objections that it gives the government “virtually unlimited access to the communications of every American”, without warrant, and allows for indefinite storage of some intercepted material, including anything that’s “enciphered”.

That’s how it was summed up by Rep. Justin Amash, a Republican from Michigan, who pitched and lost a last-minute battle to kill the bill.

The bill is titled the Intelligence Authorization Act for Fiscal Year 2015.

Amash said that the bill was “rushed to the floor” of the house for a vote, following the Senate having passed a version with a new section – Section 309 – that the House had never considered.

Lisa reports that the bill codifies Executive Order 12333, a Ronald Reagan remnant from an earlier attempt to dismantle the United States Constitution.

There is a petition underway to ask President Obama to veto the bill. Are you a large bank? Skip the petition and give the President a call.

From Lisa’s report, it sounds like Congress needs a DEW Line for legislation:

Rep. Zoe Lofgren, a California Democrat who voted against the bill, told the National Journal that the Senate’s unanimous passage of the bill was sneaky and ensured that the House would rubberstamp it without looking too closely:

If this hadn’t been snuck in, I doubt it would have passed. A lot of members were not even aware that this new provision had been inserted last-minute. Had we been given an additional day, we may have stopped it.

How do you “sneak in” legislation in a public body?

Suggestions on an early warning system for changes to legislation between the two houses of Congress?

## More Missing Evidence In Ferguson (Michael Brown)

December 15th, 2014

Saturday’s data dump from St. Louis County Prosecutor Robert McCulloch is still short at least two critical pieces of evidence. There is no copy of the “documents that we gave you to help in your deliberation.” And, there is no copy of the police map to “…guide the grand jury.”

### I. The “documents that we gave you to help in your deliberations:”

The prosecutors gave the grand jury written documents that supplemented their various oral misstatements of the law in this case.

From Volume 24 - November 21, 2014 - Page  138:
...

2 You have all the information you need in

3 those documents that we gave you to help in your

4 deliberation.
...


That follows verbal mis-statement of the law by Ms. Whirley:

Volume 24 - November 21, 2014 - Page  137

...

13 	    MS. WHIRLEY: Is that in order to vote

14 true bill, you also must consider whether you

15 believe Darren Wilson, you find probable cause,

16 that's the standard to believe that Darren Wilson

17 committed the offense and the offenses are what is

18 in the indictment and you must find probable cause

19 to believe that Darren Wilson did not act in lawful

20 self—defense, and you've got the last sheet talks

22 force, because then you must also have probable

23 cause to believe that Darren Wilson did not use

24 lawful force in making an arrest. So you are

25 considering self—defense and use of force in making

Volume 24 - November 21, 2014 - Page  138

Grand Jury — Ferguson Police Shooting Grand Jury 11/21/2014

1 an arrest.
...


Where are the “documents that we gave you to help in your deliberation?”

Have you seen those documents? I haven’t.

And consider this additional misstatement of the law:

Volume 24 - November 21, 2014 - Page  139

...
8 And the one thing that Sheila has

9 explained as far as what you must find and as she

10 said, it is kind of in Missouri it is kind of, the

11 State has to prove in a criminal trial, the State

12 has to prove that the person did not act in lawful

13 self—defense or did not use lawful force in making,

14 it is kind of like we have to prove the negative.

15 So in this case because we are talking

16 about probable cause, as we've discussed, you must

17 find probable cause to believe that he committed the

18 offense that you're considering and you must find

19 probable cause to believe that he did not act in

20 lawful self—defense. Not that he did, but that he

21 did not and that you find probable cause to believe

22 that he did not use lawful force in making the

23 arrest.
...


Just for emphasis:

the State has to prove that the person did not act in lawful self—defense or did not use lawful force in making, it is kind of like we have to prove the negative.

How hard is it to prove a negative? James Randi, James Randi Lecture @ Caltech – Cant Prove a Negative, points out that proving a negative is a logical impossibility.

The grand jury was given a logically impossible task in order to indict Darren Wilson.

What choice did the grand jury have but to return a “no true bill?”

### More Misguidance: The police map, Grand Jury 101

A police map was created to guide the jury in its deliberations, a map that reflected the police view of the location of witnesses.

Volume 24 - November 21, 2014 - Page  26

Grand Jury — Ferguson Police Shooting Grand Jury 11/21/2014

...

10	 Q (By Ms. Alizadeh) Extra, okay, that's

11 right. And you indicated that you, along with other

12 investigators prepared this, which is your

13 interpretation based upon the statements made of

14 witnesses as to where various eyewitnesses were

15 during, when I say shooting, obviously, there was a

16 time period that goes along, the beginning of the

17 time of the beginning of the incident until after

18 the shooting had been done. And do you still feel

19 that this map accurately reflects where witnesses

20 said they were?

21 A I do.

22	 Q And just for your instruction, this just,

24 and if you disagree with anything that's on the map,

25 these little sticky things come right off. So

Volume 24 - November 21, 2014 - Page  27

Grand Jury — Ferguson Police Shooting Grand Jury 11/21/2014

1 supposedly they come right off.

2 A They do.

3	 Q If you feel that this witness is not in

4 the right place, you can move any of these stickers

5 that you want and put them in the places where you

6 think they belong.

7 This is just something that is

8 representative of what this witness believes where

9 people were. If you all do with this what you will.

10 Also there was a legend that was

11 provided for all of you regarding the numbers

12 because the numbers that were assigned witnesses are

13 not the same numbers as the witnesses testimony in

14 this grand jury.

...


Two critical statements:



11... And you indicated that you, along with other

12 investigators prepared this, which is your

13 interpretation based upon the statements made of

14 witnesses as to where various eyewitnesses were

15 during, when I say shooting,



So the map represents the detective’s opinion about other witnesses, and:


3	 Q If you feel that this witness is not in

4 the right place, you can move any of these stickers

5 that you want and put them in the places where you

6 think they belong.



The witness gave the grand jury a map, to guide its deliberations but we will never know what map that was, because the stickers can be moved.

Pretty neat trick, giving the grand jury guidance that can never be disclosed to others.

### Summary:

You have seen the quote from the latest data dump from the prosecutor’s office:

McCulloch apologized in a written statement for any confusion that may have occurred by failing to initially release all of the interview transcripts. He said he believes he has now released all of the grand jury evidence, except for photos of Brown’s body and anything that could lead to witnesses being identified.

The written instructions to the grand jury and the now unknowable map (Grand Jury 101) aren’t pictures of Brown’s body or anything that could identify a witness. Where are they?

Please make a donation to support further research on the grand jury proceedings concerning Michael Brown. Future work will include:

• A witness index to the grand jury transcripts
• An exhibit index to the grand jury transcripts
• Analysis of the grand jury transcript for patterns by the prosecuting attorneys, both expected and unexpected
• A concordance of the grand jury transcripts
• Suggestions?

Donations will enable continued analysis of the grand jury transcripts, which, along with other evidence, may establish a pattern of conduct that was not happenstance or coincidence, but in fact was, enemy action.

### Other Michael Brown Posts

Missing From Michael Brown Grand Jury Transcripts December 7, 2014. (The witness index I propose to replace.)

New recordings, documents released in Michael Brown case [LA Times Asks If There’s More?] Yes! December 9, 2014 (before the latest document dump on December 14, 2014).

Michael Brown Grand Jury – Presenting Evidence Before Knowing the Law December 10, 2014.

How to Indict Darren Wilson (Michael Brown Shooting) December 12, 2014.

More Missing Evidence In Ferguson (Michael Brown) December 15, 2014. (above)

## Tweet Steganography?

December 15th, 2014

Hacking The Tweet Stream by Brett Lawrie.

Brett covers two popular methods for escaping the 140 character limit of Twitter, Tweetstorms and inline screen shots of text.

Brett comes down in favor of inline screen shots over Tweetstorms but see his post to get the full flavor of his comments.

What puzzled me was that Brett did not mention the potential for the use of steganography with inline screen shots. Whether they are of text or not. Could very well be screen shots of portions of the 1611 version of the King James Version (KJV) of the Bible with embedded information that some find offensive if not dangerous.

Or I suppose the sharper question is, How do you know that isn’t happening right now? On Flickr, Instagram, Twitter, one of many other photo sharing sites, blogs, etc.

Oh, I just remembered, I have an image for you.

(Image from a scan hosted at the Schoenberg Center for Electronic Text and Image (UPenn))

A downside to Twitter text images is that they won’t be easily indexed. Assuming you want your content to be findable. Sometimes you don’t.

## Some tools for lifting the patent data treasure

December 15th, 2014

From the post:

…Our work can be summarized as follows:

1. We provide an algorithm that allows researchers to find the duplicates inside Patstat in an efficient way
2. We provide an algorithm to connect Patstat to other kinds of information (CITL, Amadeus)
3. We publish the results of our work in the form of source code and data for Patstat Oct. 2011.

More technically, we used or developed probabilistic supervised machine-learning algorithms that minimize the need for manual checks on the data, while keeping performance at a reasonably high level.

The post has links for source code and data for these three papers:

A flexible, scaleable approach to the international patent “name game” by Mark Huberty, Amma Serwaah, and Georg Zachmann

In this paper, we address the problem of having duplicated patent applicants’ names in the data. We use an algorithm that efficiently de-duplicates the data, needs minimal manual input and works well even on consumer-grade computers. Comparisons between entries are not limited to their names, and thus this algorithm is an improvement over earlier ones that required extensive manual work or overly cautious clean-up of the names.

A scaleable approach to emissions-innovation record linkage by Mark Huberty, Amma Serwaah, and Georg Zachmann

PATSTAT has patent applications as its focus. This means it lacks important information on the applicants and/or the inventors. In order to have more information on the applicants, we link PATSTAT to the CITL database. This way the patenting behaviour can be linked to climate policy. Because of the structure of the data, we can adapt the deduplication algorithm to use it as a matching tool, retaining all of its advantages.

Remerge: regression-based record linkage with an application to PATSTAT by Michele Peruzzi, Georg Zachmann, Reinhilde Veugelers

We further extend the information content in PATSTAT by linking it to Amadeus, a large database of companies that includes financial information. Patent microdata is now linked to financial performance data of companies. This algorithm compares records using multiple variables, learning their relative weights by asking the user to find the correct links in a small subset of the data. Since it is not limited to comparisons among names, it is an improvement over earlier efforts and is not overly dependent on the name-cleaning procedure in use. It is also relatively easy to adapt the algorithm to other databases, since it uses the familiar concept of regression analysis.

Record linkage is a form of merging that originated in epidemiology in the late 1940’s. To “link” (read merge) records across different formats, records were transposed into a uniform format and “linking” characteristics chosen to gather matching records together. A very powerful technique that has been in continuous use and development ever since.

One major different with topic maps is that record linkage has undisclosed subjects, that is the subjects that make up the common format and the association of the original data sets with that format. I assume in many cases the mapping is documented but it doesn’t appear as part of the final work product, thereby rendering the merging process opaque and inaccessible to future researchers. All you can say is “…this is the data set that emerged from the record linkage.”

Sufficient for some purposes but if you want to reduce the 80% of your time that is spent munging data that has been munged before, it is better to have the mapping documented and to use disclosed subjects with identifying properties.

Having said all of that, these are tools you can use now on patents and/or extend them to other data sets. The disambiguation problems addressed for patents are the common ones you have encountered with other names for entities.

If a topic map underlies your analysis, the less time you will spend on the next analysis of the same information. Think of it as reducing your intellectual overhead in subsequent data sets.

Income – Less overhead = Greater revenue for you.

PS: Don’t be confused, you are looking for EPO Worldwide Patent Statistical Database (PATSTAT). Naturally there is a US organization, http://www.patstats.org/ that is just patent litigation statistics.

PPS: Sam Hunting, the source of so many interesting resources, pointed me to this post.

## Infinit.e Overview

December 15th, 2014

Infinit.e Overview by Alex Piggott.

From the webpage:

Infinit.e is a scalable framework for collecting, storing, processing, retrieving, analyzing, and visualizing unstructured documents and structured records.

[Image omitted. Too small in my theme to be useful.]

Let’s provide some clarification on each of the often overloaded terms used in that previous sentence:

• It is a "framework" (or "platform") because it is configurable and extensible by configuration (DSLs) or by various plug-ins types – the default configuration is expected to be useful for a range of typical analysis applications but to get the most out of Infinit.e we anticipate it will usually be customized.
• Another element of being a framework is being designed to integrate with existing infrastructures as well run standalone.
• By "scalable" we mean that new nodes (or even more granular: new components) can be added to meet increasing workload (either more users or more data), and that provision of new resources are near real-time.
• Further, the use of fundamentally cloud-based components means that there are no bottlenecks at least to the ~100 node scale.
• By "unstructured documents" we mean anything from a mostly-textual database record to a multi-page report – but Infinit.e’s "sweet spot" is in the range of database records that would correspond to a paragraph or more of text ("semi-structured records"), through web pages, to reports of 10 pages or less.
• Smaller "structured records" are better handled by structured analysis tools (a very saturated space), though Infinit.e has the ability to do limited aggregation, processing and integration of such datasets. Larger reports can still be handled by Infinit.e, but will be most effective if broken up first.
• By "processing" we mean the ability to apply complex logic to the data. Infinit.e provides some standard "enrichment", such as extraction of entities (people/places/organizations.etc) and simple statistics; and also the ability to "plug in" domain specific processing modules using the Hadoop API.
• By "retrieving" we mean the ability to search documents and return them in ranking order, but also to be able to retrieve "knowledge" aggregated over all documents matching the analyst’s query.
• By "query"/"search" we mean the ability to form complex "questions about the data" using a DSL (Domain Specific Language).
• By "analyzing" we mean the ability to apply domain-specific logic (visual/mathematical/heuristic/etc) to "knowledge" returned from a query.

We refer to the processing/retrieval/analysis/visualization chain as document-centric knowledge discovery:

• "document-centric": means the basic unit of storage is a generically-formatted document (eg useful without knowledge of the specific data format in which it was encoded)
• "knowledge discovery": means using statistical and text parsing algorithms to extract useful information from a set of documents that a human can interpret in order to understand the most important knowledge contained within that dataset.

One important aspect of the Infinit.e is our generic data model. Data from all sources (from large unstructured documents to small structured records) is transformed into a single, simple. data model that allows common queries, scoring algorithms, and analytics to be applied across the entire dataset. …

I saw this in a tweet by Gregory Piatetsky yesterday and so haven’t had time to download or test any of the features of Infinit.e.

The list of features is a very intriguing one.

Definitely worth the time to throw another VM on the box and try it out with a dataset of interest.

Would appreciate your doing the same and sending comments and/or pointers to posts with your experiences. Suspect we will have different favorite features and hit different limitations.

Thanks!