Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 15, 2014

Some tools for lifting the patent data treasure

Filed under: Deduplication,Patents,Record Linkage,Text Mining — Patrick Durusau @ 11:57 am

Some tools for lifting the patent data treasure by by Michele Peruzzi and Georg Zachmann.

From the post:

…Our work can be summarized as follows:

  1. We provide an algorithm that allows researchers to find the duplicates inside Patstat in an efficient way
  2. We provide an algorithm to connect Patstat to other kinds of information (CITL, Amadeus)
  3. We publish the results of our work in the form of source code and data for Patstat Oct. 2011.

More technically, we used or developed probabilistic supervised machine-learning algorithms that minimize the need for manual checks on the data, while keeping performance at a reasonably high level.

The post has links for source code and data for these three papers:

A flexible, scaleable approach to the international patent “name game” by Mark Huberty, Amma Serwaah, and Georg Zachmann

In this paper, we address the problem of having duplicated patent applicants’ names in the data. We use an algorithm that efficiently de-duplicates the data, needs minimal manual input and works well even on consumer-grade computers. Comparisons between entries are not limited to their names, and thus this algorithm is an improvement over earlier ones that required extensive manual work or overly cautious clean-up of the names.

A scaleable approach to emissions-innovation record linkage by Mark Huberty, Amma Serwaah, and Georg Zachmann

PATSTAT has patent applications as its focus. This means it lacks important information on the applicants and/or the inventors. In order to have more information on the applicants, we link PATSTAT to the CITL database. This way the patenting behaviour can be linked to climate policy. Because of the structure of the data, we can adapt the deduplication algorithm to use it as a matching tool, retaining all of its advantages.

Remerge: regression-based record linkage with an application to PATSTAT by Michele Peruzzi, Georg Zachmann, Reinhilde Veugelers

We further extend the information content in PATSTAT by linking it to Amadeus, a large database of companies that includes financial information. Patent microdata is now linked to financial performance data of companies. This algorithm compares records using multiple variables, learning their relative weights by asking the user to find the correct links in a small subset of the data. Since it is not limited to comparisons among names, it is an improvement over earlier efforts and is not overly dependent on the name-cleaning procedure in use. It is also relatively easy to adapt the algorithm to other databases, since it uses the familiar concept of regression analysis.

Record linkage is a form of merging that originated in epidemiology in the late 1940’s. To “link” (read merge) records across different formats, records were transposed into a uniform format and “linking” characteristics chosen to gather matching records together. A very powerful technique that has been in continuous use and development ever since.

One major different with topic maps is that record linkage has undisclosed subjects, that is the subjects that make up the common format and the association of the original data sets with that format. I assume in many cases the mapping is documented but it doesn’t appear as part of the final work product, thereby rendering the merging process opaque and inaccessible to future researchers. All you can say is “…this is the data set that emerged from the record linkage.”

Sufficient for some purposes but if you want to reduce the 80% of your time that is spent munging data that has been munged before, it is better to have the mapping documented and to use disclosed subjects with identifying properties.

Having said all of that, these are tools you can use now on patents and/or extend them to other data sets. The disambiguation problems addressed for patents are the common ones you have encountered with other names for entities.

If a topic map underlies your analysis, the less time you will spend on the next analysis of the same information. Think of it as reducing your intellectual overhead in subsequent data sets.

Income – Less overhead = Greater revenue for you. 😉

PS: Don’t be confused, you are looking for EPO Worldwide Patent Statistical Database (PATSTAT). Naturally there is a US organization, http://www.patstats.org/ that is just patent litigation statistics.

PPS: Sam Hunting, the source of so many interesting resources, pointed me to this post.

Infinit.e Overview

Filed under: Data Analysis,Data Mining,Structured Data,Unstructured Data,Visualization — Patrick Durusau @ 11:04 am

Infinit.e Overview by Alex Piggott.

From the webpage:

Infinit.e is a scalable framework for collecting, storing, processing, retrieving, analyzing, and visualizing unstructured documents and structured records.

[Image omitted. Too small in my theme to be useful.]

Let’s provide some clarification on each of the often overloaded terms used in that previous sentence:

  • It is a "framework" (or "platform") because it is configurable and extensible by configuration (DSLs) or by various plug-ins types – the default configuration is expected to be useful for a range of typical analysis applications but to get the most out of Infinit.e we anticipate it will usually be customized.
    • Another element of being a framework is being designed to integrate with existing infrastructures as well run standalone.
  • By "scalable" we mean that new nodes (or even more granular: new components) can be added to meet increasing workload (either more users or more data), and that provision of new resources are near real-time.
    • Further, the use of fundamentally cloud-based components means that there are no bottlenecks at least to the ~100 node scale.
  • By "unstructured documents" we mean anything from a mostly-textual database record to a multi-page report – but Infinit.e’s "sweet spot" is in the range of database records that would correspond to a paragraph or more of text ("semi-structured records"), through web pages, to reports of 10 pages or less.
    • Smaller "structured records" are better handled by structured analysis tools (a very saturated space), though Infinit.e has the ability to do limited aggregation, processing and integration of such datasets. Larger reports can still be handled by Infinit.e, but will be most effective if broken up first.
  • By "processing" we mean the ability to apply complex logic to the data. Infinit.e provides some standard "enrichment", such as extraction of entities (people/places/organizations.etc) and simple statistics; and also the ability to "plug in" domain specific processing modules using the Hadoop API.
  • By "retrieving" we mean the ability to search documents and return them in ranking order, but also to be able to retrieve "knowledge" aggregated over all documents matching the analyst’s query.
    • By "query"/"search" we mean the ability to form complex "questions about the data" using a DSL (Domain Specific Language).
  • By "analyzing" we mean the ability to apply domain-specific logic (visual/mathematical/heuristic/etc) to "knowledge" returned from a query.

We refer to the processing/retrieval/analysis/visualization chain as document-centric knowledge discovery:

  • "document-centric": means the basic unit of storage is a generically-formatted document (eg useful without knowledge of the specific data format in which it was encoded)
  • "knowledge discovery": means using statistical and text parsing algorithms to extract useful information from a set of documents that a human can interpret in order to understand the most important knowledge contained within that dataset.

One important aspect of the Infinit.e is our generic data model. Data from all sources (from large unstructured documents to small structured records) is transformed into a single, simple. data model that allows common queries, scoring algorithms, and analytics to be applied across the entire dataset. …

I saw this in a tweet by Gregory Piatetsky yesterday and so haven’t had time to download or test any of the features of Infinit.e.

The list of features is a very intriguing one.

Definitely worth the time to throw another VM on the box and try it out with a dataset of interest.

Would appreciate your doing the same and sending comments and/or pointers to posts with your experiences. Suspect we will have different favorite features and hit different limitations.

Thanks!

PS: Downloads.

Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data

Filed under: News,Reporting,Text Analytics,Text Mining — Patrick Durusau @ 10:31 am

Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data by Michael Cieply and Brooks Barnes.

From the article:

Sony Pictures Entertainment warned media outlets on Sunday against using the mountains of corporate data revealed by hackers who raided the studio’s computer systems in an attack that became public last month.

In a sharply worded letter sent to news organizations, including The New York Times, David Boies, a prominent lawyer hired by Sony, characterized the documents as “stolen information” and demanded that they be avoided, and destroyed if they had already been downloaded or otherwise acquired.

The studio “does not consent to your possession, review, copying, dissemination, publication, uploading, downloading or making any use” of the information, Mr. Boies wrote in the three-page letter, which was distributed Sunday morning.

Since I wrote about the foolish accusations against North Korea by Sony, I thought it only fair to warn you that the idlers at Sony have decided to threaten everyone else.

A rather big leap from trash talking about North Korea to accusing the rest of the world of being interested in their incestuous bickering.

I certainly don’t want a copy of their movies, released or unreleased. Too much noise and too little signal for the space they would take. But, since Sony has gotten on its “let’s threaten everybody” hobby-horse, I do hope the location of the Sony documents suddenly appears in many more inboxes. patrick@durusau.net. 😉

How would you display choice snippets and those who uttered them when a webpage loads?

The bitching and catching by Sony are sure signs that something went terribly wrong internally. The current circus is an attempt to distract the public from that failure. Probably a member of management with highly inappropriate security clearance because “…they are important!”

Inappropriate security clearances for management to networks is a sign of poor systems administration. I wonder when that shoe is going to drop?

American Institute of Physics: Oral Histories

Filed under: Archives,Audio,Physics,Science — Patrick Durusau @ 9:56 am

American Institute of Physics: Oral Histories

From the webpage:

The Niels Bohr Library & Archives holds a collection of over 1,500 oral history interviews. These range in date from the early 1960s to the present and cover the major areas and discoveries of physics from the past 100 years. The interviews are conducted by members of the staff of the AIP Center for History of Physics as well as other historians and offer unique insights into the lives, work, and personalities of modern physicists.

Read digitized oral history transcripts online

I don’t have a large collection audio data-set (see: Shining a light into the BBC Radio archives) but there are lots of other people who do.

If you are teaching or researching physics for the last 100 years, this is a resource you should not miss.

Integrating audio resources such as this one, at less than the full recording level (think of it as audio transclusion), into teaching materials would be a great step forward. To say nothing of being about to incorporate such granular resources into a library catalog.

I did not find an interview with Edward Teller but a search of the transcripts turned up three hundred and five (305) “hits” where he is mentioned in interviews. A search for J. Robert Oppenheimer netted four hundred and thirty-six (436) results.

If you know your atomic bomb history, you can guess between Teller and Oppenheimer which one would support the “necessity” defense for the use of torture. It would be an interesting study to see how the interviewees saw these two very different men.

Shining a light into the BBC Radio archives

Filed under: Archives,Audio,Auto Tagging,BBC,British Library,British Museum,Radio — Patrick Durusau @ 9:23 am

Shining a light into the BBC Radio archives by Yves Raimond, Matt Hynes, and Rob Cooper.

From the post:

comma

One of the biggest challenges for the BBC Archive is how to open up our enormous collection of radio programmes. As we’ve been broadcasting since 1922 we’ve got an archive of almost 100 years of audio recordings, representing a unique cultural and historical resource.

But the big problem is how to make it searchable. Many of the programmes have little or no meta-data, and the whole collection is far too large to process through human efforts alone.

Help is at hand. Over the last five years or so, technologies such as automated speech recognition, speaker identification and automated tagging have reached a level of accuracy where we can start to get impressive results for the right type of audio. By automatically analysing sound files and making informed decisions about the content and speakers, these tools can effectively help to fill in the missing gaps in our archive’s meta-data.

The Kiwi set of speech processing algorithms

COMMA is built on a set of speech processing algorithms called Kiwi. Back in 2011, BBC R&D were given access to a very large speech radio archive, the BBC World Service archive, which at the time had very little meta-data. In order to build our prototype around this archive we developed a number of speech processing algorithms, reusing open-source building blocks where possible. We then built the following workflow out of these algorithms:

  • Speaker segmentation, identification and gender detection (using LIUM diarization toolkitdiarize-jruby and ruby-lsh). This process is also known as diarisation. Essentially an audio file is automatically divided into segments according to the identity of the speaker. The algorithm can show us who is speaking and at what point in the sound clip.
  • Speech-to-text for the detected speech segments (using CMU Sphinx). At this point the spoken audio is translated as accurately as possible into readable text. This algorithm uses models built from a wide range of BBC data.
  • Automated tagging with DBpedia identifiers. DBpedia is a large database holding structured data extracted from Wikipedia. The automatic tagging process creates the searchable meta-data that ultimately allows us to access the archives much more easily. This process uses a tool we developed called ‘Mango’.

,,,

COMMA is due to launch some time in April 2015. If you’d like to be kept informed of our progress you can sign up for occasional email updates here. We’re also looking for early adopters to test the platform, so please contact us if you’re a cultural institution, media company or business that has large audio data-set you want to make searchable.

This article was written by Yves Raimond (lead engineer, BBC R&D), Matt Hynes (senior software engineer, BBC R&D) and Rob Cooper (development producer, BBC R&D)

I don’t have a large audio data-set but I am certainly going to be following this project. The results should be useful in and of themselves, to say nothing of being a good starting point for further tagging. I wonder if the BBC Sanskrit broadcasts are going to be available? I will have to check on that.

Without diminishing the achievements of other institutions, the efforts of the BBC, the British Library, and the British Museum are truly remarkable.

I first saw this in a tweet by Mike Jones.

TweepsMap

Filed under: Mapping,Twitter — Patrick Durusau @ 8:43 am

TweepsMap

A Twitter analysis service that:

  • Maps your followers by geographic location
  • Measures growth (or decline) of followers over time
  • Listen to what your followers are talking about
  • Action reports, how well you did yesterday
  • Analyze anyone (competitors for example)
  • Assess followers/following
  • Hashtag/Keyword tracking (down to city level)
  • You could do all of this for yourself but TweepsMap has the convenience of simply working. Thus, suitable for passing on to less CS literate co-workers.

    Free account requires you to login with your Twitter account (of course) but the resulting mapping may surprise you.

    I didn’t see it offered but being able to analyze the people you follow would be a real plus. Not just geographically (to make sure you are getting a diverse world view) but by groupings of hashtags. Taking groups of hashtags forming identifiable groups of users who use them. To allow you to judge the groups that you are following.

    I first saw this in a tweet from Alyona Medelyan.

    wonderland-clojure-katas

    Filed under: Clojure,Programming — Patrick Durusau @ 8:21 am

    wonderland-clojure-katas by Carin Meier.

    From the webpage:

    These are a collection of Clojure katas inspired by Lewis Carroll and Alice and Wonderland

    Which of course makes me curious, is anyone working on Clojure katas based on The Hunting of the Snark?

    SnarkTitle1.

    Other suggestions for kata inspiring works?

    Deep learning for… chess

    Filed under: Amazon Web Services AWS,Deep Learning,Games,GPU — Patrick Durusau @ 5:38 am

    Deep learning for… chess by Erik Bernhardsson.

    From the post:

    I’ve been meaning to learn Theano for a while and I’ve also wanted to build a chess AI at some point. So why not combine the two? That’s what I thought, and I ended up spending way too much time on it. I actually built most of this back in September but not until Thanksgiving did I have the time to write a blog post about it.

    Chess sets are a common holiday gift so why not do something different this year?

    Pretty print a copy of this post and include a gift certificate from AWS for a GPU instance for say a week to ten days.

    I don’t think AWS sells gift certificates, but they certainly should. Great stocking stuffer, anniversary/birthday/graduation present, etc. Not so great for Valentines Day.

    If you ask AWS for a gift certificate, mention my name. They don’t know who I am so I could use the publicity. 😉

    I first saw this in a tweet by Onepaperperday.

    December 14, 2014

    Inheritance Patterns in Citation Networks Reveal Scientific Memes

    Filed under: Citation Analysis,Language,Linguistics,Meme,Social Networks — Patrick Durusau @ 8:37 pm

    Inheritance Patterns in Citation Networks Reveal Scientific Memes by Tobias Kuhn, Matjaž Perc, and Dirk Helbing. (Phys. Rev. X 4, 041036 – Published 21 November 2014.)

    Abstract:

    Memes are the cultural equivalent of genes that spread across human culture by means of imitation. What makes a meme and what distinguishes it from other forms of information, however, is still poorly understood. Our analysis of memes in the scientific literature reveals that they are governed by a surprisingly simple relationship between frequency of occurrence and the degree to which they propagate along the citation graph. We propose a simple formalization of this pattern and validate it with data from close to 50 million publication records from the Web of Science, PubMed Central, and the American Physical Society. Evaluations relying on human annotators, citation network randomizations, and comparisons with several alternative approaches confirm that our formula is accurate and effective, without a dependence on linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.

    Popular Summary:

    It is widely known that certain cultural entities—known as “memes”—in a sense behave and evolve like genes, replicating by means of human imitation. A new scientific concept, for example, spreads and mutates when other scientists start using and refining the concept and cite it in their publications. Unlike genes, however, little is known about the characteristic properties of memes and their specific effects, despite their central importance in science and human culture in general. We show that memes in the form of words and phrases in scientific publications can be characterized and identified by a simple mathematical regularity.

    We define a scientific meme as a short unit of text that is replicated in citing publications (“graphene” and “self-organized criticality” are two examples). We employ nearly 50 million digital publication records from the American Physical Society, PubMed Central, and the Web of Science in our analysis. To identify and characterize scientific memes, we define a meme score that consists of a propagation score—quantifying the degree to which a meme aligns with the citation graph—multiplied by the frequency of occurrence of the word or phrase. Our method does not require arbitrary thresholds or filters and does not depend on any linguistic or ontological knowledge. We show that the results of the meme score are consistent with expert opinion and align well with the scientific concepts described on Wikipedia. The top-ranking memes, furthermore, have interesting bursty time dynamics, illustrating that memes are continuously developing, propagating, and, in a sense, fighting for the attention of scientists.

    Our results open up future research directions for studying memes in a comprehensive fashion, which could lead to new insights in fields as disparate as cultural evolution, innovation, information diffusion, and social media.

    You definitely should grab the PDF version of this article for printing and a slow read.

    From Section III Discussion:


    We show that the meme score can be calculated exactly and exhaustively without the introduction of arbitrary thresholds or filters and without relying on any kind of linguistic or ontological knowledge. The method is fast and reliable, and it can be applied to massive databases.

    Fair enough but “black,” “inflation,” and, “traffic flow,” all appear in the top fifty memes in physics. I don’t know that I would consider any of them to be “memes.”

    There is much left to be discovered about memes. Such as who is good at propagating memes? Would not hurt if your research paper is the origin of a very popular meme.

    I first saw this in a tweet by Max Fisher.

    North Korea As Bogeyman

    Filed under: Cybersecurity,Security — Patrick Durusau @ 8:07 pm

    The Sony hack: how it happened, who is responsible, and what we’ve learned by Timothy B. Lee.

    From the post:

    However, North Korea has denied involvement in the attack, and on Wednesday the FBI said that it didn’t have evidence linking the attacks to the North Korean regime. And there are other reasons to doubt the North Koreans are responsible. As Kim Zetter has argued, “nation-state attacks don’t usually announce themselves with a showy image of a blazing skeleton posted to infected machines or use a catchy nom-de-hack like Guardians of Peace to identify themselves.”

    There’s some evidence that the hackers may have been aggrieved about last year’s big layoffs at Sony, which doesn’t seem like something the North Korean regime would care about. And the hackers demonstrated detailed knowledge of Sony’s network that could indicate they had help from inside the company.

    In the past, these kinds of attacks have often been carried out by young men with too much time on their hands. The 2011 LulzSec attacks, for example, were carried out by a loose-knit group from the United States, the United Kingdom, and Ireland with no obvious motive beyond wanting to make trouble for powerful institutions and generate publicity for themselves.

    I assume you have heard the bed wetters in the United States government decrying North Korea as the bogeyman responsible for hacking Sony Pictures (November 2014, just to distinguish it from other hacks of Sony.).

    If you have ever seen a picture of North Korea at night (below), you will understand why I doubt North Korea is the technology badass imaged by US security “experts.”

    North Korea at night

    Not that you have to waste a lot of energy on outside lighting to have a competent computer hacker community but it is one indicator.

    A more likely explanation is that Sony forgot to reset a sysadmin password and it is a “hack” only because a non-current employee carried it out.

    Until some breach other than a valid login by a non-employee is confirmed by independent security experts, I would discard any talk of this being North Korea attacking Sony.

    The only reason to blame North Korea is to create a smokescreen to avoid accepting blame for internally lack security. Watch for Sony to make a film about its fight for freedom of speech against the axis of evil (includes North Korea, wait a couple of weeks to know who else).

    When Sony wants to say something, it is freedom of speech. When you want to repeat it, it is a criminal copyright violation. Funny how that works. Tell Sony to clean up its internal security and only then to worry about outsiders.

    GearPump

    Filed under: Actor-Based,Akka,Hadoop YARN,Samza,Spark,Storm,Tez — Patrick Durusau @ 7:30 pm

    GearPump (GitHub)

    From the wiki homepage:

    GearPump is a lightweight, real-time, big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. GearPump draws from a number of existing frameworks including MillWheel, Apache Storm, Spark Streaming, Apache Samza, Apache Tez, and Hadoop YARN while leveraging Akka actors throughout its architecture.

    What originally caught my attention was this passage on the GitHub page:

    Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

    Think about that for a second.

    Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

    The GitHub page features a word count example and pointers to the wiki with more examples.

    What if every topic “knew” the index value of every topic that should merge with it on display to a user?

    When added to a topic map it broadcasts its merging property values and any topic with those values responds by transmitting its index value.

    When you retrieve a topic, it has all the IDs necessary to create a merged view of the topic on the fly and on the client side.

    There would be redundancy in the map but de-duplication for storage space went out with preferences for 7-bit character values to save memory space. So long as every topic returns the same result, who cares?

    Well, it might make a difference when the CIA want to give every contractor full access to its datastores 24×7 via their cellphones. But, until that is an actual requirement, I would not worry about the storage space overmuch.

    I first saw this in a tweet from Suneel Marthi.

    Everything You Need To Know About Social Media Search

    Filed under: Facebook,Instagram,Social Media,Twitter — Patrick Durusau @ 7:07 pm

    Everything You Need To Know About Social Media Search by Olsy Sorokina.

    From the post:

    For the past decade, social networks have been the most universally consistent way for us to document our lives. We travel, build relationships, accomplish new goals, discuss current events and welcome new lives—and all of these events can be traced on social media. We have created hashtags like #ThrowbackThursday and apps like Timehop to reminisce on all the past moments forever etched in the social web in form of status updates, photos, and 140-character phrases.

    Major networks demonstrate their awareness of the role they play in their users’ lives by creating year-end summaries such as Facebook’s Year in Review, and Twitter’s #YearOnTwitter. However, much of the emphasis on social media has been traditionally placed on real-time interactions, which often made it difficult to browse for past posts without scrolling down for hours on end.

    The bias towards real-time messaging has changed in a matter of a few days. Over the past month, three major social networks announced changes to their search functions, which made finding old posts as easy as a Google search. If you missed out on the news or need a refresher, here’s everything you need to know.

    I suppose Olsy means in addition to search in general sucking.

    Interested tidbit on Facebook:


    This isn’t Facebook’s first attempt at building a search engine. The earlier version of Graph Search gave users search results in response to longer-form queries, such as “my friends who like Game of Thrones.” However, the semantic search never made it to the mobile platforms; many supposed that using complex phrases as search queries was too confusing for an average user.

    Does anyone have any user research on the ability of users to use complex phrases as search queries?

    I ask because if users have difficulty authoring “complex” semantics and difficulty querying with “complex” semantics, it stands to reason they may have difficulty interpreting “complex” semantic results. Yes?

    If all three of those are the case, then how do we impart the value-add of “complex” semantics without tripping over one of those limitations?

    Osly also covers Instagram and Twitter. Twitter’s advanced search looks like the standard include/exclude, etc. type of “advanced” search. “Advanced” maybe forty years ago in the early OPACs but not really “advanced” now.

    Catch up on these new search features. They will provide at least a minimum of grist for your topic map mill.

    How Scientists Are Learning to Write

    Filed under: Communication,Writing — Patrick Durusau @ 5:39 pm

    How Scientists Are Learning to Write by Alexandra Ossola.

    From the post:

    The students tried not to look sheepish as their professor projected the article on the whiteboard, waiting for their work to be devoured by their classmates. It was the second class for the nine students, all of whom are Ph.D. candidates or post-doctoral fellows. Their assignment had been to distill their extensive research down to just three paragraphs so that the average person could understand it, and, as in any class, some showed more aptitude than others. The piece on the board was by one of the students, a Russian-born biologist.

    The professor, the journalist and author Stephen Hall (with whom I took a different writing workshop last year), pointed to the word “sequencing.” “That’s jargon-ish,” he said, circling it on the board. “Even some people in the sciences don’t have an intuitive understanding of what that means.” He turned to another student in the class, an Italian native working on his doctorate in economics, for confirmation. “Yes, I didn’t know what was going on,” he said, turning to the piece’s author. The biology student wrote something in her notebook.

    Why is better writing important?:

    But explaining science is just as valuable for the lay public as it is for the scientists themselves. “Science has become more complex, more specialized—every sub-discipline has its own vocabulary,” Hall said. Scientists at all levels have to work hard to explain niche research to the general public, he added, but it’s increasingly important for the average person to understand. That’s because their research has become central to many other elements of society, influencing realms that may have previously seemed far from scientific rigors.

    Olivia Wilkins, a post-doctoral fellow who studies plant genetics at New York University’s Center for Genomics and Systems Biology, recently took Hall’s four-session workshop. She wanted to be a better writer, she said, because she wanted her research to matter. “Science is a group effort. We may be in different labs at different universities, but ultimately, many of us are working towards the same goals. I want to get other people as excited about my work as I am, and I believe that one of the ways to do this is through better writing.”

    How about that? Communicating with other people who are just as bright as you, but who don’t share the same vocabulary? Does that sound like a plausible reason to you?

    I really like the closer:

    “…Writing takes a lot of practice like anything else—if you don’t do it, you don’t get better. (emphasis added)

    I review standards and even offer editing advice from time to time. If you think scientists aren’t born with the ability to write, you should check out standards drafts by editors unfamiliar with how to write standards.

    Citations in a variety of home grown formats, to publications that may or may not exist or be suitable for normative citation, to terminology that isn’t defined, anywhere, to contradictions between different parts, to conformance clauses that are too vague for anyone to know what is required, and many things in between.

    If anything should be authored with clarity, considering that conformance should make applications interoperable, it is IT standards. Take the advice in Alexandra’s post to heart and seek out a writing course near you.

    I edit and review standards so ping me if you want an estimate on how to improve your latest standard draft. (References available on request.)

    I first saw this in a tweet by Gretchen Ritter.

    Instant Hosting of Open Source Projects with GitHub-style Ribbons

    Filed under: Open Source,OpenShift — Patrick Durusau @ 5:15 pm

    Instant Hosting of Open Source Projects with GitHub-style Ribbons by Ryan Jarvinen.

    From the post:

    In this post I’ll show you how to create your own GitHub-style ribbons for launching open source projects on OpenShift.

    The popular “Fork me on GitHub” ribbons provide a great way to raise awareness for your favorite open source projects. Now, the same technique can be used to instantly launch clones of your application, helping to rapidly grow your community!

    Take advantage of [the following link is broken as of 12/14/2014] OpenShift’s web-based app creation workflow – streamlining installation, hosting, and management of instances – by crafting a workflow URL that contains information about your project.

    I thought this could be useful in the not too distant future.

    Better to blog about it here than to search for it in the nightmare of my bookmarks. 😉

    What Is the Relationship Between HCI Research and UX Practice?

    Filed under: HCIR,Interface Research/Design,UX — Patrick Durusau @ 5:01 pm

    What Is the Relationship Between HCI Research and UX Practice? by Stuart Reeves

    From the post:

    Human-computer interaction (HCI) is a rapidly expanding academic research domain. Academic institutions conduct most HCI research—in the US, UK, Europe, Australasia, and Japan, with growth in Southeast Asia and China. HCI research often occurs in Computer Science departments, but retains its historically strong relationship to Psychology and Human Factors. Plus, there are several large, prominent corporations that both conduct HCI research themselves and engage with the academic research community—for example, Microsoft Research, PARC, and Google.

    If you aren’t concerned with the relationship between HCI research and UX practice you should be.

    I was in a meeting discussing the addition of RDFa to ODF when a W3C expert commented that the difficulty users have with RDFa syntax was a “user problem.”

    Not to pick on RDFa, I think many of us in the topic map camp felt that users weren’t putting enough effort into learning topic maps. (I will only confess that for myself. Others can speak for themselves.)

    Anytime an advocate and/or developer takes the view that syntax, interfaces or interaction with a program is a “user problem,” they pointing the wrong way with the stick.

    They should be pointing at the developers, designers, advocates who have not made interaction with their program/software intuitive for the “targeted audience.”

    If your program is a LaTeX macro targeted at physicists who eat LaTeX for breakfast, lunch and dinner, that’s one audience.

    If your program is an editing application is targeted at users crippled by the typical office suite menus, then you had best make different choices.

    That is assuming that use of your application is your measure of success.

    Otherwise you can strive to be the second longest running non-profitable software project (Xandu, started in 1960 has first place) in history.

    Rather than being right, or saving the world, or any of the other …ologies, I would prefer to have software that users find useful and do in fact use.

    Use is pre-condition to any software or paradigm changing the world.

    Yes?

    PS: Don’t get me wrong, Xandu is a great project but its adoption of web browsers as means of delivery is a mistake. True, they are everywhere but also subject to the crippled design of web security which prevents transclusion. Which ties you to a server where the NSA can more conveniently scoop up your content.

    Better would be a document browser that uses web protocols and ignores web security rules, thus enabling client-side transclusion. Fork one of the open source browsers and be done with it. Only use digitally signed PDFs or from particular sources. Once utility is demonstrated in a PDF-only universe, the demand will grow for extending it to other sources as well.

    True, some EU/US trade delegates and others will get caught in phishing schemes but I consider that grounds for dismissal and forfeiture of all retirement benefits. (Yes, I retain a certain degree of users be damned but not about UI/UX experiences. 😉 )

    My method of avoiding phishing schemes is to never follow links in emails. If there is an offer I want to look at, I log directly into the site from my browser and not via email. Even for valid messages, which they rarely are.

    I first saw this in a tweet by Raffaele Boiano.

    Machine Learning: The High-Interest Credit Card of Technical Debt (and Merging)

    Filed under: Machine Learning,Merging,Topic Maps — Patrick Durusau @ 1:55 pm

    Machine Learning: The High-Interest Credit Card of Technical Debt by D. Sculley, et al.

    Abstract:

    Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

    Under “entanglement” (referring to inputs) the authors announce the CACE principle:

    Changing Anything Changes Everything

    The net result of such changes is that prediction behavior may alter, either subtly or dramatically, on various slices of the distribution. The same principle applies to hyper-parameters. Changes in regularization strength, learning settings, sampling methods in training, convergence thresholds, and essentially every other possible tweak can have similarly wide ranging effects.

    Entanglement is a native condition in topic maps as a result of the merging process. Yet, I don’t recall there being much discussion of how to evaluate the potential for unwanted entanglement or how to avoid entanglement (if desired).

    You may have topics in a topic map where merging with later additions to the topic map is to be avoided. Perhaps to avoid the merging of spam topics that would otherwise overwhelm your content.

    One way to avoid that and yet allow users to use links reported as subjectIdentifiers and subjectLocators under the TMDM would be to not report those properties for some set of topics to the topic map engine. The only property they could merge on would be their topicID, which hopefully you have concealed from public users.

    Not unlike the traditions of Unix where some X ports are unavailable to any users other than root. Topics with IDs below N are skipped by the topic map engine for merging purposes, unless the merging is invoked by the equivalent of root.

    No change in current syntax or modeling required, although a filter on topic IDs would need to be implemented to add this to current topic map applications.

    I am sure there are other ways to prevent merging of some topics but this seems like a simple way to achieve that end.

    Unfortunately it does not address the larger question of the “technical debt” incurred to maintain a topic map of any degree of sophistication.

    Thoughts?

    I first saw this in a tweet by Elias Ponvert.

    December 13, 2014

    Hadoop

    Filed under: BigData,Hadoop — Patrick Durusau @ 7:40 pm

    Hadoop: What it is and how people use it: my own summary by Bob DuCharme.

    From the post:

    The web offers plenty of introductions to what Hadoop is about. After reading up on it and trying it out a bit, I wanted to see if I could sum up what I see as the main points as concisely as possible. Corrections welcome.

    Hadoop is an open source Apache project consisting of several modules. The key ones are the Hadoop Distributed File System (whose acronym is trademarked, apparently) and MapReduce. The HDFS lets you distribute storage across multiple systems and MapReduce lets you distribute processing across multiple systems by performing your “Map” logic on the distributed nodes and then the “Reduce” logic to gather up the results of the map processes on the master node that’s driving it all.

    This ability to spread out storage and processing makes it easier to do large-scale processing without requiring large-scale hardware. You can spread the processing across whatever boxes you have lying around or across virtual machines on a cloud platform that you spin up for only as long as you need them. This ability to inexpensively scale up has made Hadoop one of the most popular technologies associated with the buzzphrase “Big Data.”

    If you aren’t already familiar with Hadoop or if you are up to your elbows in Hadoop and need a literate summary to forward to others, I think this post does the trick.

    Bob covers the major components of the Hadoop ecosystem without getting lost in the weeds.

    Recommended reading.

    Password Cracking and Countermeasures in Computer Security: A Survey

    Filed under: Cybersecurity,Security — Patrick Durusau @ 7:27 pm

    Password Cracking and Countermeasures in Computer Security: A Survey by Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao.

    Abstract:

    With the rapid development of internet technologies, social networks, and other related areas, user authentication becomes more and more important to protect the data of the users. Password authentication is one of the widely used methods to achieve authentication for legal users and defense against intruders. There have been many password cracking methods developed during the past years, and people have been designing the countermeasures against password cracking all the time. However, we find that the survey work on the password cracking research has not been done very much. This paper is mainly to give a brief review of the password cracking methods, import technologies of password cracking, and the countermeasures against password cracking that are usually designed at two stages including the password design stage (e.g. user education, dynamic password, use of tokens, computer generations) and after the design (e.g. reactive password checking, proactive password checking, password encryption, access control). The main objective of this work is offering the abecedarian IT security professionals and the common audiences with some knowledge about the computer security and password cracking, and promoting the development of this area.

    As you know from Strong Passwords – Myths of CS?, there are cases where strong passwords are still useful.

    This is an overview of the state of password research and so is neither a practical guide nor does it offer new information for password professionals.

    You need to look up abecedarian before you try to use it in a conversation. 😉

    Scala eXchange 2014 (videos)

    Filed under: Conferences,Functional Programming,Scala — Patrick Durusau @ 5:32 pm

    Scala eXchange 2014 Videos are online! Thanks to the super cool folks at Skills Matter for making them available!

    As usual, I have sorted the videos by author. I am not sure about using “scala” as a keyword at a Scala conference but suspect it was to permit searching in a database with videos from other conferences.

    If you watch these with ear buds while others are watching sporting events, remember to keep the sound down enough that you can hear curses or cheers from others in the room. Mimic their sentiments and no one will be any wiser, except you for having watched these videos. 😉

    PS: I could have used a web scraper to obtain the data but found manual extraction to be a good way to practice regexes in Emacs.

    The Ultimate Guide to Learning Clojure for Free

    Filed under: Clojure,ClojureScript,Programming — Patrick Durusau @ 10:20 am

    The Ultimate Guide to Learning Clojure for Free by Eric Normand.

    From the post:

    Are you interested in learning Clojure? Would you like to get started without risking your money on a book? Is there something that can help you even know whether you want to learn Clojure?

    Well, yes.

    I learned Clojure for free. In fact, I’m a bit of a cheapskate. For many years I would avoid spending money if there was a free alternative. I also keep an eye on resources that pop up around the internet.

    I have assembled what I believe to be the best resources to start with if you want to get into Clojure but want to spend $0.

    Eric has a focus on beginner with a few intermediate resources.

    Are there any additions you would suggest?

    I first saw this in a tweet by Christophe Lalanne.

    Every time you cite a paper w/o reading it,

    Filed under: Bibliography,Science — Patrick Durusau @ 9:51 am

    Every time you cite a paper w/o reading it, b/c someone else cited it, a science fairy dies. (A tweet by realscientists.)

    The tweet points to the paper, Mother’s Milk, Literature Sleuths, and Science Fairies by Katie Hinde.

    Katie encountered an article that offered a model that was right on point for a chapter she was writing. But rather than simply citing that article, Katie started backtracking from that article to the articles it cited. After quite a bit of due diligence, Katie discovered that the cited articles did not make the claims for which they were cited. Not no way, not no how.

    Some of the comments to Katie’s post suggest that students in biological sciences should learn from her example.

    I would go further than that and say that all students, biological sciences, physical sciences, computer sciences, the humanities, etc., should all learn from Katie’s example.

    If you can’t or don’t verify cited work, don’t cite it. (full stop)

    I haven’t kept statistics on it but it isn’t uncommon to find citations in computer science work that don’t exist, are cited incorrectly and/or don’t support the claims made for them. Most of the “don’t exist” class appear to be conference papers that weren’t accepted or were never completed. But were cited as “going to appear…”

    Someday soon linking of articles will make verification of references much easier than it is today. How will your publications fare on that day?

    December 12, 2014

    Cipherli.st Strong Ciphers for Apache, nginx and Lighttpd

    Filed under: Cybersecurity,Security — Patrick Durusau @ 8:39 pm

    Cipherli.st Strong Ciphers for Apache, nginx and Lighttpd

    Not for the faint of heart or non-sysadmins. I mention it because the news lags behind the latest surveillance outrages of various governments. Your security is your own responsibility.

    Network/computer security is a full time job and why your organization should not task someone to be in charge of security as a part-time responsibility or as they have time from other duties.

    Over tasking someone with network/computer security is not a sign of thriftiness, only dumbness.

    Building a Better Word Cloud

    Filed under: R,Visualization,Word Cloud — Patrick Durusau @ 8:28 pm

    Building a Better Word Cloud by Drew Conway.

    From the post:

    A few weeks ago I attended the NYC Data Visualization and Infographics meetup, which included a talk by Junk Charts blogger Kaiser Fung. Given the topic of his blog, I was a bit shocked that the central theme of his talk was comparing good and bad word clouds. He even stated that the word cloud was one of the best data visualizations of the last several years. I do not think there is such a thing as a good word cloud, and after the meetup I left unconvinced; as evidenced by the above tweet.

    This tweet precipitated a brief Twitter debate about the value of word clouds, but from that straw poll it seemed the Nays had the majority. My primary gripe is that space is meaningless in word clouds. They are meant to summarize a single statistics—word frequency—yet they use a two dimensional space to express that. This is frustrating, since it is very easy to abuse the flexibility of these dimensions and conflate the position of a word with its frequency to convey dubious significance.

    This came up on Twitter today even though Drew’s post dates from 2011. Great post though as Drew tries to improve upon the standard word cloud.

    Not Drew’s fault but after reading his post I am where he was at the beginning on word clouds, I don’t see their utility. Perhaps your experience will be different.

    SlamData

    Filed under: MongoDB,SlamData — Patrick Durusau @ 8:16 pm

    SlamData

    From the about page:

    SlamData was formed in early 2014 in recognition that the primary methods for analytics on NoSQL data were far too complex and resource intensive. Even simple questions required learning new technolgies, writing complex ETL processes or even coding. We created the SlamData project to address this problem.

    In contrast to legacy vendors, which emphasize trying to make the data fit legacy analytics infrastructure, SlamData focuses on trying to make the analytics infrastructure fit the data.

    The SlamData solution provides a common ANSI SQL compatible interface to NoSQL data. This makes modern NoSQL data accessible to anyone. SlamData retains the leading developers of the SlamData open source project and provides commercial support and training around the open source analytics technology.

    I first encountered SlamData in MongoDB gets its first native analytics tool by Andrew C. Oliver, who writes in part:


    In order to deal with the difference between documents and tables, SlamData extends SQL with an XPath-like notation. Rather than querying from a table name (or collection name), you might query FROM person[*].address[*].city. This should represent a short learning curve for SQL-loving data analysts or power business users, while being inconsequential for developers.

    The power of SlamData resides in its back-end SlamEngine, which implements a multidimensional relational algorithm and deals with the data without reformatting the infrastructure. The JVM (Scala) back end supplies a REST interface, which allows developers to access SlamData’s algorithm for their own uses.

    The SlamData front end and SlamEngine are both open source and waiting for you to download them.

    My major curiosity is about the extension to SQL and the SlamEngine’s “multidimensional relational algorithm.”

    I was planning on setting up MongoDB for something else so perhaps this will be the push to get that project started.

    Enjoy!

    RoboBrain: The World’s First Knowledge Engine For Robots

    Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 8:01 pm

    RoboBrain: The World’s First Knowledge Engine For Robots

    From the post:

    One of the most exciting changes influencing modern life is the ability to search and interact with information on a scale that has never been possible before. All this is thanks to a convergence of technologies that have resulted in services such as Google Now, Siri, Wikipedia and IBM’s Watson supercomputer.

    This gives us answers to a wide range of questions on almost any topic simply by whispering a few words into a smart phone or typing a few characters into a laptop. Part of what makes this possible is that humans are good at coping with ambiguity. So the answer to a simple question such as “how to make cheese on toast” can result in very general instructions that an ordinary person can easily follow.

    For robots, the challenge is quite different. These machines require detailed instructions even for the simplest task. For example, a robot asking a search engine “how to bring sweet tea from the kitchen” is unlikely to get the detail it needs to carry out the task since it requires all kinds of incidental knowledge such as the idea that cups can hold liquid (but not when held upside down), that water comes from taps and can be heated in a kettle or microwave, and so on.

    The truth is that if robots are ever to get useful knowledge from search engines, these databases will have to contain a much more detailed description of every task that they might need to carry out.

    Enter Ashutosh Saxena at Stanford University in Palo Alto and a number of pals, who have set themselves the task of building such knowledge engine for robots.

    These guys have already begun creating a kind of Google for robots that can be freely accessed by any device wishing to carry out a task. At the same time, the database gathers new information about these tasks as robots perform them, thereby learning as it goes. They call their new knowledge engine RoboBrain.

    Robobrain

    An overview of: arxiv.org/abs/1412.0691 RoboBrain: Large-Scale Knowledge Engine for Robots.

    See the website as well: RoboBrain.me

    Not quite AI but something close.

    If nothing else, the project should identify a large amount of tacit knowledge that is generally overlooked.

    Deep Neural Networks are Easily Fooled:…

    Filed under: Deep Learning,Machine Learning,Neural Networks — Patrick Durusau @ 7:47 pm

    Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images by Anh Nguyen, Jason Yosinski, Jeff Clune.

    Abstract:

    Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify objects in images with near-human-level performance, questions naturally arise as to what differences remain between computer and human vision. A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects. Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.

    This is a great paper for weekend reading, even if computer vision isn’t your field. In part because the results were unexpected. Computer science is moving towards being an experimental science, at least in some situations.

    Before you read the article, spend a few minutes thinking about how DNNs and human vision differ.

    I haven’t run it to ground yet but I wonder if the authors have stumbled upon a way to deceive deep neural networks outside of computer vision applications? If so, does that suggest experiments that could identify ways to deceive other classification algorithms? And how would you detect such means if they were employed? Still confident about your data processing results?

    I first saw this in a tweet by Gregory Piatetsky.

    Introducing Atlas: Netflix’s Primary Telemetry Platform

    Filed under: BigData,Graphs,Visualization — Patrick Durusau @ 5:15 pm

    Introducing Atlas: Netflix’s Primary Telemetry Platform

    From the post:

    Various previous Tech Blog posts have referred to our centralized monitoring system, and we’ve presented at least one talk about it previously. Today, we want to both discuss the platform and ecosystem we built for time-series telemetry and its capabilities and announce the open-sourcing of its underlying foundation.

    atlas image

    How We Got Here

    While working in the datacenter, telemetry was split between an IT-provisioned commercial product and a tool a Netflix engineer wrote that allowed engineers to send in arbitrary time-series data and then query that data. This tool’s flexibility was very attractive to engineers, so it became the primary system of record for time series data. Sadly, even in the datacenter we found that we had significant problems scaling it to about two million distinct time series. Our global expansion, increase in platforms and customers and desire to improve our production systems’ visibility required us to scale much higher, by an order of magnitude (to 20M metrics) or more. In 2012, we started building Atlas, our next-generation monitoring platform. In late 2012, it started being phased into production, with production deployment completed in early 2013.

    The use of arbitrary key/value pairs to determine a metrics identity merits a slow read. As does the query language for metrics, said “…to allow arbitrarily complex graph expressions to be encoded in a URL friendly way.”

    Posted to Github with a longer introduction here.

    The Wikipedia entry on time series offers this synopsis on time series data:

    A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Examples of time series are ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average. Time series are very frequently plotted via line charts. Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering, and largely in any domain of applied science and engineering which involves temporal measurements.

    It looks to me like a number of users communities should be interested in this release from Netflix!

    Speaking of series, it occurs to me that is you count the character lengths of blanks in the Senate CIA torture report, you should be able to make some fairly good guesses on some of the names.

    I am hopeful it doesn’t come to that because anyone with access to the full 6,000 page uncensored report has a moral obligation to post it to public servers. Surely there is one person with access to that report with a moral conscience.

    I first saw this in a tweet by Roy Rapoport

    Global Open Data Index

    Filed under: Government,Government Data,Open Data — Patrick Durusau @ 3:59 pm

    Global Open Data Index

    From the about page:

    For more information on the Open Data Index, you may contact the team at: index@okfn.org

    Each year, governments are making more data available in an open format. The Global Open Data Index tracks whether this data is actually released in a way that is accessible to citizens, media and civil society and is unique in crowd-sourcing its survey of open data releases around the world. Each year the open data community and Open Knowledge produces an annual ranking of countries, peer reviewed by our network of local open data experts.

    Crowd-sourcing this data provides a tool for communities around the world to learn more about the open data available locally and by country, and ensures that the results reflect the experience of civil society in finding open information, rather than government claims. it also ensures that those who actually collect the information that builds the Index are the very people who use the data and are in a strong position to advocate for more and higher quality open data.

    The Global Open Data Index measures and benchmarks the openness of data around the world, and then presents this information in a way that is easy to understand and use. This increases its usefulness as an advocacy tool and broadens its impact.

    In 2014 we are expanding to more countries (from 70 in 2013) with an emphasis on countries of the Global South.

    See the blog post launching the 2014 Index. For more information, please see the FAQ and the methodology section. Join the conversation with our Open Data Census discussion list.

    It is better to have some data rather than none but look at the data by which countries are ranked for openness:

    Transport Timetables, Government Budget, Government Spending, Election Results, Company Register, National Map, National Statistics, Postcodes/Zipcodes, Pollutant Emissions.

    A listing of data that results in the United Kingdom with a 97% score and first place.

    It is hard to imagine a less threatening set of data than those listed. I am sure someone will find a use for them but in the great scheme of things, they are a distraction from the data that isn’t being released.

    Off-hand, in the United States at least, public data should include who meets with appointed or elected members of government along with transcripts of those meetings (including phone calls). It should also include all personal or corporate donations made to any organization for any reason of greater than $100.00. It should include documents prepared and/or submitted to the U.S. government and its agencies. And those are just the ones that come to mind rather quickly.

    Current disclosures by the U.S. government are a fiction of openness that conceals a much larger dark data set, waiting to be revealed at some future date.

    I first saw this in a tweet by ChemConnector.

    How to Indict Darren Wilson (Michael Brown Shooting)

    Filed under: Ferguson,Skepticism — Patrick Durusau @ 3:08 pm

    The Missouri Attorney General’s office needs to remove St. Louis Prosecuting Attorney Robert P. McCulloch from the Michael Brown case. Then convene a grand jury led to represent the public’s interest and not that of Darren Wilson.

    As we saw in Michael Brown Grand Jury – Presenting Evidence Before Knowing the Law, an indictment of Darren Wilson for second degree murder in the death of Michael Brown only requires probable cause (“a reasonable belief that a person has committed a crime”) to find that:

    1. Darren Wilson (a person)
    2. intentionally shot (knowingly causes)
    3. to kill Michael Brown (another person) or
    4. to inflict serious injury on Michael Brown (another person)
    5. and Michael Brown dies (death)

    It need not be a long and drawn out grand jury like the first one.

    Just in case the Missouri Attorney General takes my advice (yeah, right), here is a thumbnail sketch to avoid a repetition of the prior defective grand jury process.

    First witness, the chief investigating officer. Establish a scale map of the area and the locations of Darren Wilson’s vehicle, Darren Wilson’s claimed position and the final location of Michael Brown.

    A map something like:

    Michael Brown map

    (See this map in full at: http://www.washingtonpost.com/wp-srv/special/national/ferguson-witness-map/, it was authored by Richard Johnson.)

    Elicit the following facts from the chief investigating officer:

    1. Michael Brown was in fact unarmed.
    2. Officer Darren Wilson said that he shot Michael Brown. (hearsay is admissible in grand jury proceedings)
    3. Officer Darren Wilson was also armed with police issued Mace at the time of the shooting.
    4. Officer Darren Wilson had pursued Michael Brown for over 100 feet from any initial contact.
    5. Michael Brown’s body had no traces of Mace on it.
    6. Officer Darren Wilson’s issued Mace was unused.
    7. Michael Brown was shot eight (8) times, three of them in the head.
    8. The medical examiner concluded that Michael Brown died as a result of gun shot wounds on 9 August 2014.

    Unnecessary but to give the grand jury the human side of the story, call the witness from the second floor of the apartment building who testified to the grand jury:


    Volume 8 – September 30, 2014 – Page 114

    23 A Okay. Then my brother noticed, he said

    24 wait a minute, looks like they’re struggling. We

    25 are looking at the car, we can see them tussling,

    Volume 8 – September 30, 2014 – Page 115

    1 all right. His head was above the truck for a

    2 moment and then it went below it.

    3 Q Okay.

    4 A All right. And it was still tussling.

    5 His friend had backed up a step back on the

    6 sidewalk, then we heard a shot. His friend ran this

    7 direction, Michael ran to this driveway right here,

    8 beside this building.

    9 Q Just so we can be clear, this street is

    10 Copper Creek Court?

    11 A Right.

    12 Q So you are saying, you had the pointer,

    13 the little laser ——

    14 A Right, right here.

    15 Q —— at the corner of Canfield Drive and

    16 Copper Creek Court?

    17 A Right, he had ran towards this way. As

    18 he’s running ——

    19 Q He’s running east down Canfield?

    20 A As he’s running this way, the officer got

    21 out of his truck, came around from the back, got to

    22 this side where he was now on the driver’s side

    23 because he had a clear line of Michael over here.

    24 Then he assumed his position with the

    25 pistol. As he turned around, as he came around, he

    Volume 8 – September 30, 2014 – Page 116

    1 was coming up with the gun. He held the gun up like

    2 this. (indicating) When he got to here, Michael was

    3 standing right on the grass and he was like looking

    4 down at his body.

    5 Q Okay. Let me stop you here. At this

    6 point have you seen anything in Michael’s hands?

    7 A No.

    8 Q When he was stopped, when they were

    9 talking down the street, did you see anything in his

    10 hands?

    11 A No.

    12 Q How about the other boy, anything in his

    13 hands?

    14 A No.

    15 Q They weren’t carrying anything that you

    16 saw?

    17 A No.

    18 Q And then you said, you know how important

    19 some of this gesturing has been, right?

    20 A Uh—huh, right.

    21 Q So they are here to actually witness what

    22 you are going to do. And so you say when Michael

    23 Brown gets to, is he in the grass actually?

    24 A He’s is standing at the very edge. Okay.

    25 The driveways are blacktop, he is stopped right at

    Volume 8 – September 30, 2014 – Page 117

    1 the blacktop right, at the very edge.

    2 Q Okay.

    3 A His back was turned to the officer.

    4 Q Okay.

    5 A And he had his hands like this, like he’s

    6 looking down at his body to see.

    7 Q Okay. Can I ask you to stand up that will

    8 really help them to see what you’re doing and he’s

    9 stopped now?

    10 A He’s stopped with his back towards the

    11 officer and he stopped and he was doing this. As he

    12 was trying to see where he was shot.

    13 Q Okay.

    14 A All right.

    15 Q Uh—huh.

    16 A As he was turning, at that time the

    17 officer had already been around to the back of his

    18 truck and got into his spot. By the time he got

    19 there, while Michael was there, he was slowly

    20 turning around and the officer said stop. When

    21 Michael turned around, he just put his hands up like

    22 this. They were shoulder high, they weren’t above

    23 his head, but he did have them up. He had them out

    24 like this, all right, palms facing him like this.

    25 The officer said stop again. Michael

    Volume 8 – September 30, 2014 – Page 118

    1 then took a step, a few steps it took for him to get

    2 from that blacktop to the street. When he stepped

    3 out on the street, the officer said stop one more

    4 time and then he fired. He fired three to four

    5 shots. When he hit him, he went back. Can I stand?

    6 Q Sure.

    7 A When he hit him he, did like this, and he

    8 went like, like his balance —— he started staggering

    9 and he looked up at the officer like why.

    10 Q Now, just to be clear, you can’t hear him

    11 say anything?

    12 A I can’t hear him say that, but he’s

    13 looking at him and he is doing, you know. So then

    14 as he’s stopped, he’s trying to steady, he starts

    15 staggering, my brother says, he’s not going to stand

    16 up, he’s getting ready to fall, he’s getting ready

    17 to fall.

    18 He looks like he was trying to stay

    19 on his feet, and he started staggering toward the

    20 police officer and he still had his hands up.

    21 At some point between the officer’s

    22 truck, which by that time this is about 30, 35 feet,

    23 when he reached out into the street, he started

    24 walking toward the officer, the officer took three

    25 steps back and he yelled out stop to Michael again

    Volume 8 – September 30, 2014 – Page 119

    1 three times.

    2 Michael’s steadily walking toward

    3 him. More or less to me and to my brothers, he was

    4 staggering.

    5 Q Okay. To your brothers, did you have more

    6 than one brother?

    7 A Well, I mean my brother. I didn’t mean to

    8 say brothers, my brother. He was staggering, you

    9 know. And as he was staggering forward, his head,

    10 his body kind of went down at an angle. He was like

    11 this, more or less fighting to stay up. You could

    12 see his legs wobbling.

    13 Q Were his hands the way you had them?

    14 A His hands were coming down like this, all

    15 right. And he had his head up and he’s facing the

    16 officer like this and he is steadily moving, and the

    17 officer was moving back, stop. He yelled stop the

    18 third time, he let off four more shops, but as he

    19 was firing, Michael was falling. After he stopped

    20 firing, Michael, he went down face first, smack.


    What do you think? Probable cause for:

    1. Darren Wilson (a person)
    2. intentionally shot – 8 times (knowingly causes)
    3. to kill Michael Brown (another person) or
    4. to inflict serious injury on Michael Brown (another person)
    5. and Michael Brown dies (death)

    Unless you think a police officer yelling “stop” is a license to kill, there is more than enough evidence for probable cause to indict for second degree murder. Total grand jury time, perhaps a day or a day and a half.

    Should the grand jury ask about self-defense, lawful arrest, etc. the proper response is that all of those are great questions but under Missouri law, the responsibility to answer those questions resides with the trier of fact, whether it is a judge or jury. In a trial, both sides are represented with a judge to insure that all sides have an opportunity to present their side. In a grand jury proceeding, only the State is represented so it would be unfair for the State to attempt to represent both sides.

    Don’t be fooled into “accepting” the grand jury’s decision. Another grand jury can and should be chosen to properly consider the Michael Brown shooting. Even more importantly, all those connected to the first grand jury should be investigated to determine who decided to throw the first grand jury. I can’t believe that an assistant prosecutor made that decision all on their own.

    Clojure eXchange 2014

    Filed under: Clojure,Functional Programming,Programming — Patrick Durusau @ 10:59 am

    Clojure eXchange 2014 Videos.

    The short version, sorted by author:

    Resource management in clojure by Rob Ashton

    Super Charging Cyanite by Tom Coupland

    Pathogens & Parentheses: How we use Clojure in the molecular surveillance of infectious disease by Russell Dunphy

    Journey through the looking glass by Chris Ford

    Flow – learnings from writing a ClojureScript DSL by James Henderson

    Tesser: Another Level of Indirection by Kyle Kingsbury

    More Open-Source systems, please by Thomas Kristensen

    Reactive GUI Implemented in Clojure by Denys Lebedev

    Ephemeral-first data structures by Michal Marczyk

    My Componentised Clojure Crusade by Adrian Mowat

    BirdWatch / Building a System in Clojure by Matthias Nehlsen

    Dragonmark: distributed core.async by David Pollak

    Clojure in the service of Her Majesty’s Government by Philip Potter and Rachel Newstead

    DataScript for web development by Nikita Prokopov

    Herding cattle with Clojure at MixRadio by Neil Prosser

    Trojan Horsing Clojure with Javascript by Robert Rees

    Automation, Animation, Art and Dance by Nick Rothwell

    Pragmatic Clojure Performance Testing by Korny Sietsma

    The Future of Clojure by Bodil Stokke

    Developing Clojure in the cloud by Martin Trojer

    Just in case those marathon holiday movies start early this year. At least you can retreat to your computer or suitable mobile device.

    Enjoy!

    « Newer PostsOlder Posts »

    Powered by WordPress