An A to Z of…D3 force layout

July 23rd, 2014

An A to Z of extra features for the D3 force layout by Simon Raper.

From the post:

Since d3 can be a little inaccessible at times I thought I’d make things easier by starting with a basic skeleton force directed layout (Mike Bostock’s original example) and then giving you some blocks of code that can be plugged in to add various features that I have found useful.

The idea is that you can pick the features you want and slot in the code. In other words I’ve tried to make things sort of modular. The code I’ve taken from various places and adapted so thank you to everyone who has shared. I will try to provide the credits as far as I remember them!

A great read and an even greater bookmark for graph layouts.

In Simon’s alphabet:

A is for arrows.

B is for breaking links.

C is for collision detection.

F is for fisheye.

H is for highlighting.

L is for labels.

P is for pinning down nodes.

S is for search.

T is for tooltip.

Not only does Simon show the code, he also shows the result of the code.

A model of how to post useful information on D3.

NLTK 3.0 Beta!

July 23rd, 2014

NLTK 3.0 Beta!

The official name is nltk 3.0.0b1 but I thought 3.0 beta rolls off the tongue better. ;-)

Interface changes.

Grab the latest, contribute bug reports, etc.

Commonplace Books at Harvard

July 22nd, 2014

Commonplace Books

From the webpage:

In the most general sense, a commonplace book contains a collection of significant or well-known passages that have been copied and organized in some way, often under topical or thematic headings, in order to serve as a memory aid or reference for the compiler. Commonplace books serve as a means of storing information, so that it may be retrieved and used by the compiler, often in his or her own work.

The commonplace book has its origins in antiquity in the idea of loci communes, or “common places,” under which ideas or arguments could be located in order to be used in different situations. The florilegium, or “gathering of flowers,” of the Middle Ages and early modern era, collected excerpts primarily on religious and theological themes. Commonplace books flourished during the Renaissance and early modern period: students and scholars were encouraged to keep commonplace books for study, and printed commonplace books offered models for organizing and arranging excerpts. In the 17th, 18th, and 19th centuries printed commonplace books, such as John Locke’s A New Method of Making Common-Place-Books (1706), continued to offer new models of arrangement. The practice of commonplacing continued to thrive in the modern era, as writers appropriated the form for compiling passages on various topics, including the law, science, alchemy, ballads, and theology. The manuscript commonplace books in this collection demonstrate varying degrees and diverse methods of organization, reflecting the idiosyncratic interests and practices of individual readers.

A great collection of selections from commonplace books!

I am rather “lite” on posts for the day because I tried to chase down John Locke’s publication of A New Method of Making Common-Place-Books in French, circa 1686/87.

Unfortunately, the scanned version of Bibliotheque Universelle et Historique I was using, listed “volumes” when they were actually four (4) issues per year and the issue containing Locke’s earlier publication is missing. A translation that appears in John Locke, The Works of John Locke in Nine Volumes, (London: Rivington, 1824 12th ed.). Vol. 2 gives this reference:

Translated out of the French from the second volume of Bibliotheque Universelle.

You can view an image of the work at: http://lf-oll.s3.amazonaws.com/titles/762/0128-02df_Bk.pdf on page 441.

Someone who could not read Roman numerals gave varying dates for the “volumes” of Bibliotheque Universelle et Historique which didn’t improve my humor. I will try to find a complete scanned set tomorrow and try to chase down the earlier version of A New Method of Making Common-Place-Books. My concern is the graphic that appears in the translation and what appears to be examples at the end. I wanted to confirm that both appear in the original French version.

Enjoy!

PS: I know, this isn’t as “practical” as functional programming, writing Pig or Cuda code, but on the other hand, understanding where you are going is at least as important as getting there quickly. Yes?

Readings in conflict-free replicated data types

July 22nd, 2014

Readings in conflict-free replicated data types by Christopher Meiklejohn.

From the post:

This is a work in progress post outlining research topics related to conflict-free replicated data types, or CRDTs.

Yesterday, Basho announced the release of Riak 2.0.0 RC1, which contains a comprehensive set of “data types” that can be used for building more robust distributed applications. For an overview of how to use these data types in Riak to avoid custom, and error prone, merge functions, see the Basho documentation site.

You’re probably more familiar with another name for these data types: conflict-free replicated data types (CRDTs). Simply put, CRDTs are data structures which capture some aspect of causality, along with providing interfaces for safely operating over the value and correctly merging state with diverged and concurrently edited structures.

This provides a very useful property when combined with an eventual consistency, or AP-focused, data store: Strong Eventual Consistency (SEC). Strong Eventual Consistency is an even stronger convergence property than eventual consistency: given that all updates are delivered to all replicas, there is no need for conflict resolution, given the conflict-free merge properties of the data structure. Simply put, correct replicas which have received all updates have the same state.

Here’s a great overview by one of the inventors of CRDTs, Marc Shapiro, where he discusses conflict-free replicated data types and their relation to strong eventual consistency.

In this Hacker News thread, there was an interesting discussion about why one might want to implement these on the server, why implementing them is non-trivial, and what the most recent research related to them consists of.

This post serves as a reading guide on the the various areas of conflict-free replicated data types. Papers are broken down into various areas and sorted in reverse chronologically.

Relevant to me because the new change tracking in ODF is likely to be informed by CRDTs and because eventually consistent merging is important for distributed topic maps.

Confusion would be the result if the order of merging topics results in different topic maps.

CRDTs are an approach to avoid that unhappy outcome.

Enjoy!

PS: Remember to grab a copy of Riak 2.0.0 RC1.

Clojure in the Large

July 22nd, 2014

Clojure in the Large by Stuart Sierra.

From the summary:

Stuart Sierra discusses various Clojure features: protocols, records, DI, managing startup/shutdown of components, dynamic binding, interactive development workflow, testing and mocking.

Stuart describes this presentation as “intermediate” level.

Great examples to get you past common problems and “thinking” in Clojure.

Announcing Apache Pig 0.13.0

July 22nd, 2014

Announcing Apache Pig 0.13.0 by Daniel Dai.

From the post:

The Apache Pig community released Pig 0.13. earlier this month. Pig uses a simple scripting language to perform complex transformations on data stored in Apache Hadoop. The Pig community has been working diligently to prepare Pig to take advantage of the DAG processing capabilities in Apache Tez. We also improved usability and performance.

This blog post summarizes the progress we’ve made.

Pig 0.13 improvements

If you missed the Pig 0.13.0 release (I did), here’s a chance to catch up on the latest improvements.

Interactive Documents with R

July 22nd, 2014

Interactive Documents with R by Ramnath Vaidyanathan.

From the webpage:

The main goal of this tutorial is to provide a comprehensive overview of the workflow required to create, customize and share interactive web-friendly documents from R. We will cover a range of topics from how to create interactive visualizations and dashboards, to web pages and applications, straight from R. At the end of this tutorial, attendees will be able to apply these learnings to turn their own analyses and reports into interactive, web-friendly content.

Ramnath gave this tutorial at UseR2014. The slides have now been posted at: http://ramnathv.github.io/user2014-idocs-slides

The tutorial is listed as six (6) separate tutorials:

  1. Interactive Documents
  2. Slidify
  3. Frameworks
  4. Layouts
  5. Widgets
  6. How Slidify Works

I am always impressed by useful interactive web pages. Leaving aside the one that jump, pop and whizz with no discernible purpose, interactive documents add value to their content for readers.

Enjoy!

I first saw this in a tweet by Gregory Piatetsky.

Why Functional Programming Matters

July 22nd, 2014

Why Functional Programming Matters by John Hughes.

Abstract:

As software becomes more and more complex, it is more and more important to structure it well. Well-structured software is easy to write and to debug, and provides a collection of modules that can be reused to reduce future programming costs. In this paper we show that two features of functional languages in particular, higher-order functions and lazy evaluation, can contribute significantly to modularity. As examples, we manipulate lists and trees, program several numerical algorithms, and implement the alpha-beta heuristic (an algorithm from Artificial Intelligence used in game-playing programs). We conclude that since modularity is the key to successful programming, functional programming offers important advantages for software development.

There’s a bottom line issue you can raise with your project leader or manager. “[R]educe future programming costs.” That it is also true isn’t the relevant point. Management isn’t capable of tracking programming costs well enough to know. It is the rhetoric of cheaper, faster, better that moves money down the management colon.

The same is very likely true for semantic integration tasks.

Many people have pointed out that topic maps can make the work that goes into semantic integration efforts re-usable. Which is true and would be a money saver to boot, but none of that really matters to management. Budgets and their share of a budget for a project are management motivators.

Perhaps the better way to sell the re-usable semantics of topic maps is to say the builders of integration systems can re-use the semantics of a topic map. Customers/users, on the other hand, just get the typical semantically opaque results. Just like now.

So the future cost for semantic integration experts to extend or refresh a current semantic integration solution goes down. The have re-usable semantics that they can re-apply to the customer’s situation. Either script it or even have interns do the heavy lifting. Which helps their bottom line.

Thinking about it that way, creating disclosed semantics for popular information resources would have the same impact. Something to think about.

I first saw this in a tweet by Computer Science.

TeXnicCenter

July 22nd, 2014

TeXnicCenter

From the webpage:

  • integrated LaTeX environment for Windows
  • powerful LaTeX editor with auto completion
  • full UTF-8 support
  • document navigator for easy navigation and referencing
  • tight viewer integration with forward and inverse search
  • quick setup wizard for MiKTeX
  • trusted by millions of users around the world
  • free and open source (GPL)

I need to install this so I can run it in a VM on Ubuntu. ;-)

Seriously, this looks really cool.

Let me know what you think. This could make a great recommendation for Windows users.

I first saw this in a tweet by TeX tips.

Commonplace Books

July 21st, 2014

Commonplace Books as a Source for Networked Knowledge and Combinatorial Creativity by Shane Parrish.

From the post:

“You know that I voluntarily communicated this method to you, as I have done to many others, to whom I believed it would not be unacceptable.”

There is an old saying that the truest form of poverty is “when if you have occasion for any thing, you can’t use it, because you know not where it is laid.”

The flood of information is nothing new.

“In fact,” the Harvard historian Ann Blair writes in her book Too Much to Know: Managing Scholarly Information Before the Modern Age, “many of our current ways of thinking about and handling information descend from patterns of thought and practices that extent back for centuries.” Her book explores “the history of one of the longest-running traditions of information management— the collection and arrangement of textual excerpts designed for consultation.” She calls them reference books.

Large collections of textual material, consisting typically of quotations, examples, or bibliographical references, were used in many times and places as a way of facilitating access to a mass of texts considered authoritative. Reference books have sometimes been mined for evidence about commonly held views on specific topics or the meanings of words, and some (encyclopedias especially) have been studied for the genre they formed.

[...]

No doubt we have access to and must cope with a much greater quantity of information than earlier generations on almost every issue, and we use technologies that are subject to frequent change and hence often new. Nonetheless, the basic methods we deploy are largely similar to those devised centuries ago in early reference books. Early compilations involved various combinations of four crucial operations: storing, sorting, selecting, and summarizing, which I think of as the four S’s of text management. We too store, sort, select, and summarize information, but now we rely not only on human memory, manuscript, and print, as in earlier centuries, but also on computer chips, search functions, data mining, and Wikipedia, along with other electronic techniques.

Knowing some of the background on the commonplace book will be helpful:

Commonplace books (or commonplaces) were a way to compile knowledge, usually by writing information into books. Such books were essentially scrapbooks filled with items of every kind: medical recipes, quotes, letters, poems, tables of weights and measures, proverbs, prayers, legal formulas. Commonplaces were used by readers, writers, students, and scholars as an aid for remembering useful concepts or facts they had learned. Each commonplace book was unique to its creator’s particular interests. They became significant in Early Modern Europe.

“Commonplace” is a translation of the Latin term locus communis (from Greek tópos koinós, see literary topos) which means “a theme or argument of general application”, such as a statement of proverbial wisdom. In this original sense, commonplace books were collections of such sayings, such as John Milton‘s commonplace book. Scholars have expanded this usage to include any manuscript that collects material along a common theme by an individual.

Commonplace books are not diaries nor travelogues, with which they can be contrasted: English Enlightenment philosopher John Locke wrote the 1706 book A New Method of Making a Common Place Book, “in which techniques for entering proverbs, quotations, ideas, speeches were formulated. Locke gave specific advice on how to arrange material by subject and category, using such key topics as love, politics, or religion. Commonplace books, it must be stressed, are not journals, which are chronological and introspective.” By the early eighteenth century they had become an information management device in which a note-taker stored quotations, observations and definitions. They were even used by influential scientists. Carl Linnaeus, for instance, used commonplacing techniques to invent and arrange the nomenclature of his Systema Naturae (which is still used by scientists today).

[footnote links omitted]

Have you ever had a commonplace book?

Impressed enough by Shane’s post to think about keeping one. In hard copy.

Curious how you would replicate a commonplace book in software?

Or perhaps better, what aspects of a commonplace book can you capture in software and what aspects can’t be captured.

I first saw this in a tweet by Aaron Kirschenfeld.

You’re not allowed bioinformatics anymore

July 21st, 2014

You’re not allowed bioinformatics anymore by Mick Watson.

Bump this to the head of your polemic reading list! Excellent writing.

To be fair, collaboration with others is a two-way street.

That is both communities in this tale needed to be reaching out to the other on a continuous basis. It isn’t enough that you offered once or twice and were rebuffed so now you will wait them out.

Successful collaborations don’t start with grudges and bad attitudes about prior failures to collaborate.

I know of two organizations that share common members, operate in the same area and despite both being more than a century old, have had only one, brief, collaborative project.

The collaboration fell apart because leadership in both was waiting for the other to call.

It is hard to sustain a collaboration when both parties considered themselves to be the center of the universe. (I have it on good authority neither one of them are the center of the universe.)

I can’t promise fame, professional success, etc., but reaching out and genuinely collaborating with others will advance your field of endeavor. Promise.

Enjoy the story.

I first saw this in a tweet by Neil Saunders.

Graffeine

July 21st, 2014

Graffeine by Julian Browne

From the webpage:

Caffeinated Graph Exploration for Neo4J

Graffeine is both a useful interactive demonstrator of graph capability and a simple visual administration interface for small graph databases.

Here it is with the, now canonical, Dr Who graph loaded up:

Dr. Who graph

From the description:

Graffeine plugs into Neo4J and renders nodes and relationships as an interactive D3 SVG graph so you can add, edit, delete and connect nodes. It’s not quite as easy as a whiteboard and a pen, but it’s close, and all interactions are persisted in Neo4J.

You can either make a graph from scratch or browse an existing one using search and paging. You can even “fold” your graph to bring different aspects of it together on the same screen.

Nodes can be added, updated, and removed. New relationships can be made using drag and drop and existing relationships broken.

It’s by no means phpmyadmin for Neo4J, but one day it could be (maybe).

A great example of D3 making visual editing possible.

Christmas in July?

July 21st, 2014

It won’t be Christmas in July but bioinformatics folks will feel like it with the release of the full annotation of the human genome assembly (GRCh38) due to drop at the end of July 2014.

Dan Murphy covers progress on the annotation and information about the upcoming release in: The new human annotation is almost here!

This is an important big data set.

How would you integrate it with other data sets?

I first saw this in a tweet by Neil Saunders.

Deploying Dionaea…

July 21st, 2014

Deploying Dionaea on a Raspberry Pi using MHN

A complete with screenshots guide to installing Dionaea on a Raspberry Pi.

MHN = Modern Honey Network.

With enough honeypots, do you think a “crowd” could capture most malware within days of its appearance?

I guess the NSA needs to run a honeypot inside its network firewalls. ;-)

I first saw this in a tweet by Jason Trost.

Security Data Science Papers

July 21st, 2014

Security Data Science Papers by Jason Trost

From the webpage:

Over the past several years I have collected and read many security research papers/slides and have started a small catalog of sorts. The topics of these papers range from intrusion detection, anomaly detection, machine learning/data mining, Internet scale data collection, malware analysis, and intrusion/breach reports. I figured this collection might useful to others. All links lead to PDFs hosted here.

I hope to clean this up (add author info, date, and publication) when I get some more time as well as adding some detailed notes I have on the various features, models, algorithms, and datasets used in many of these papers.

Here are some of my favorites (nice uses of machine learning, graph analytics, and/or anomaly detection to solve interesting security problems):

Nice looking collection but it doesn’t help a reader decide:

  • Is this the latest word on this problem?
  • What has this author written that is more recent? On this problem or others?
  • Does this paper cover concept X?
  • What does this paper say about concept X?
  • What other papers are there on concept X?
  • How does this paper fit into the domain as defined by it and other papers?

Not that I am picking on Jason. I do that same thing all the time.

Question: What information is the most useful, beyond location for a paper?

Serious question. I know what I look for related to an interesting paper. What do you look for?

I first saw this in a tweet by Adam Sealey.

Ninety-Nine Haskell Problems [Euler/Clojure too]

July 21st, 2014

Ninety-Nine Haskell Problems

From the webpage:

These are Haskell translations of Ninety-Nine Lisp Problems, which are themselves translations of Ninety-Nine Prolog Problems.

Also listed are:

Naming isn’t the only hard problem in computer science. The webpage points out that due to gaps and use of letters, there are 88 problems and not 99.

If you want something a bit more challenging, consider the Project Euler problems. No peeking but there is a wiki with some Clojure answers, http://clojure-euler.wikispaces.com/.

Enjoy!

I first saw this in a tweet by Computer Science.

Non-Moral Case For Diversity

July 21st, 2014

Groups of diverse problem solvers can outperform groups of high-ability problem solvers by Lu Hong and Scott E. Page.

Abstract:

We introduce a general framework for modeling functionally diverse problem-solving agents. In this framework, problem-solving agents possess representations of problems and algorithms that they use to locate solutions. We use this framework to establish a result relevant to group composition. We find that when selecting a problem-solving team from a diverse population of intelligent agents, a team of randomly selected agents outperforms a team comprised of the best-performing agents. This result relies on the intuition that, as the initial pool of problem solvers becomes large, the best-performing agents necessarily become similar in the space of problem solvers. Their relatively greater ability is more than offset by their lack of problem-solving diversity.

I have heard people say that diverse teams are better, but always in the context of contending for members of one group or another to be included on a team.

Reading the paper carefully, I don’t think that is the author’s point at all.

From the conclusion:

The main result of this paper provides conditions under which, in the limit, a random group of intelligent problem solvers will outperform a group of the best problem solvers. Our result provides insights into the trade-off between diversity and ability. An ideal group would contain high-ability problem solvers who are diverse. But, as we see in the proof of the result, as the pool of problem solvers grows larger, the very best problem solvers must become similar. In the limit, the highest-ability problem solvers cannot be diverse. The result also relies on the size of the random group becoming large. If not, the individual members of the random group may still have substantial overlap in their local optima and not perform well. At the same time, the group size cannot be so large as to prevent the group of the best problem solvers from becoming similar. This effect can also be seen by comparing Table 1. As the group size becomes larger, the group of the best problem solvers becomes more diverse and, not surprisingly, the group performs relatively better.

A further implication of our result is that, in a problem-solving context, a person’s value depends on her ability to improve the collective decision (8). A person’s expected contribution is contextual, depending on the perspectives and heuristics of others who work on the problem. The diversity of an agent’s problem-solving approach, as embedded in her perspective-heuristic pair, relative to the other problem solvers is an important predictor of her value and may be more relevant than her ability to solve the problem on her own. Thus, even if we were to accept the claim that IQ tests, Scholastic Aptitude Test scores, and college grades predict individual problem-solving ability, they may not be as important in determining a person’s potential contribution as a problem solver as would be measures of how differently that person thinks. (emphasis added)

Some people accept gender, race, nationality, etc. as markers for thinking differently and no doubt that is true in some cases. But presuming it is just as uninformed as presuming no differences in how people of different gender, race, and nationalities think.

You could ask. Such as presenting candidates for a team with open ended problems that are capable of multiple solutions. Group similar solutions together and then pick randomly across the solution groups.

You may have a gender, race, nationality diverse team but if they think the same way, say Anthony Scalia and Clarence Thomas, then your team isn’t usefully diverse.

Diversity of thinking should be your goal, not diversity of markers of diversity.

I first saw this in a tweet by Chris Dixon.

Hadoop Doesn’t Cure HIV

July 21st, 2014

If I were Gartner, I could get IBM to support my stating the obvious. I would have to dress it up by repeating a lot of other obvious things but that seems to be the role for some “analysts.”

If you need proof of that claim, consider this report: Hadoop Is Not a Data Integration Solution. Really? Did any sane person familiar with Hadoop think otherwise?

The “key” findings from the report:

  • Many Hadoop projects perform extract, transform and load workstreams. Although these serve a purpose, the technology lacks the necessary key features and functions of commercially-supported data integration tools.
  • Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack).
  • Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer’s code, one data element at a time, or one program at a time.
  • Because Hadoop workstreams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences.

All true, all obvious and all a function of Hadoop’s design. It never had data integration as a requirement so finding that it doesn’t do data integration isn’t a surprise.

If you switch “commercially-supported data integration tools,” you will be working “…one data element at a time,” because common data integration tools don’t capture their own semantics. Which means you can’t re-use your prior data integration with one tool when you transition to another. Does that sound like vendor lock-in?

Odd that Gartner didn’t mention that.

Perhaps that’s stating the obvious as well.

A topic mapping of your present data integration solution will enable you to capture and re-use your investment in its semantics, with any data integration solution.

Did I hear someone say “increased ROI?”

German Record Linkage Center

July 20th, 2014

German Record Linkage Center

From the webpage:

The German Record Linkage Center (GermanRLC) was established in 2011 to promote research on record linkage and to facilitate practical applications in Germany. The Center will provide several services related to record linkage applications as well as conduct research on central topics of the field. The services of the GermanRLC are open to all academic disciplines.

Wikipedia describes record linkage as:

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record Linkage is called Data Linkage in many jurisdictions, but is the same process.

While very similar to topic maps, record linkage relies upon the creation of a common record for further processing, as opposed to pointing into an infoverse to identify subjects in their natural surroundings.

Another difference in practice is that the subjects (headers, fields, etc.) that contain subjects are not themselves treated as subjects with identity. That is to say that how a mapping from an original form was made to the target form is opaque to a subsequent researcher.

I first saw this in a tweet by Lars Marius Garshol.

Combinatorial Algorithms

July 20th, 2014

Combinatorial Algorithms For Computers and Calculators by Albert Jijenhuis and Hebert S. Wilf. (PDF)

I suspect the word “calculators” betrays the age of this item. This edition was published in 1978. Still, Amazon is currently asking > $50.00 U.S. for it so at least knowing about it can save you some money.

Not that the algorithms covered have changed but the authors say that combinatorics changed enough between 1975 and 1978 to warrant a second edition.

I suspect that is true several times over between 1978 and 2014.

Still, there is value in a different presentation than you would see today, even without the latest content.

I first saw this in a tweet by Atabey Kaygun.

Learn Haskell

July 20th, 2014

Learn Haskell by Chris Allen.

Chris has created a GitHub repository on his recommended path for learning Haskell.

Quite a list of resources but if he has missed anything, please file a pull request.

I first saw this in a tweet by Debasish Ghosh.

SciPy Videos – Title Sort Order

July 20th, 2014

You have probably seen that the SciPy 2014 videos are up! Good News! SciPy 2014.

You may have also noticed, the videos are in no discernable order. Not so good news.

However, I have created a list of the SciPy Videos in Title Sort Order.

Enjoy!

Ad-hoc Biocuration Workflows?

July 19th, 2014

Text-mining-assisted biocuration workflows in Argo by Rafal Rak, et al. (Database (2014) 2014 : bau070 doi: 10.1093/database/bau070)

Abstract:

Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced.

Database URL: http://argo.nactem.ac.uk

From the introduction:

Data curation from biomedical literature had been traditionally carried out as an entirely manual effort, in which a curator handpicks relevant documents and creates annotations for elements of interest from scratch. To increase the efficiency of this task, text-mining methodologies have been integrated into curation pipelines. In curating the Biomolecular Interaction Network Database (1), a protein–protein interaction extraction system was used and was shown to be effective in reducing the curation work-load by 70% (2). Similarly, a usability study revealed that the time needed to curate FlyBase records (3) was reduced by 20% with the use of a gene mention recognizer (4). Textpresso (5), a text-mining tool that marks up biomedical entities of interest, was used to semi-automatically curate mentions of Caenorhabditis elegans proteins from the literature and brought about an 8-fold increase in curation efficiency (6). More recently, the series of BioCreative workshops (http://www.biocreative.org) have fostered the synergy between biocuration efforts and text-mining solutions. The user-interactive track of the latest workshop saw nine Web-based systems featuring rich graphical user interfaces designed to perform text-mining-assisted biocuration tasks. The tasks can be broadly categorized into the selection of documents for curation, the annotation of mentions of relevant biological entities in text and the annotation of interactions between biological entities (7).

Argo is a truly impressive text-mining-assisted biocuration application but the first line of a biocuration article needs to read:

Data curation from biomedical literature had been traditionally carried out as an entirely ad-hoc effort, after the author has submitted their paper for publication.

There is an enormous backlog of material that desperately needs biocuration and Argo (and other systems) have a vital role to play in that effort.

However, the situation of ad-hoc biocuration is never going to improve unless and until biocuration is addressed in the authoring of papers to appear in biomedical literature.

Who better to answer questions or ambiguities that appear in biocuration than the author of papers?

That would require working to extend MS Office and Apache OpenOffice, to name two of the more common authoring platforms.

But the return would be higher quality publications earlier in the publication cycle, which would enable publishers to provide enhanced services based upon higher quality products and enhance tracing and searching of the end products.

No offense to ad-hoc efforts but higher quality sooner in the publication process seems like an unbeatable deal.

…Ad-hoc Contextual Inquiry

July 19th, 2014

Honing Your Research Skills Through Ad-hoc Contextual Inquiry by Will Hacker.

From the post:

It’s common in our field to hear that we don’t get enough time to regularly practice all the types of research available to us, and that’s often true, given tight project deadlines and limited resources. But one form of user research–contextual inquiry–can be practiced regularly just by watching people use the things around them and asking a few questions.

I started thinking about this after a recent experience returning a rental car to a national brand at the Phoenix, Arizona, airport.

My experience was something like this: I pulled into the appropriate lane and an attendant came up to get the rental papers and send me on my way. But, as soon as he started, someone farther up the lane called loudly to him saying he’d been waiting longer. The attendant looked at me, said “sorry,” and ran ahead to attend to the other customer.

A few seconds later a second attendant came up, took my papers, and jumped into the car to check it in. She was using an app on an tablet that was attached to a large case with a battery pack, which she carried over her shoulder. She started quickly tapping buttons, but I noticed she kept navigating back to the previous screen to tap another button.

Curious being that I am, I asked her if she had to go back and forth like that a lot. She said “yes, I keep hitting the wrong thing and have to go back.”

Will expands his story into why and how to explore random user interactions with technology.

If you want to become better at contextual inquiry and observation, Will has the agenda for you.

He concludes:

Although exercises like this won’t tell us the things we’d like to know about the products we work on, they do let us practice the techniques of contextual inquiry and observation and make us more sensitive to various design issues. These experiences may also help us build the case in more companies for scheduling time and resources for in-field research with our actual customers.

Government-Grade Stealth Malware…

July 19th, 2014

Government-Grade Stealth Malware In Hands Of Criminals by Sara Peters.

From the post:

Malware originally developed for government espionage is now in use by criminals, who are bolting it onto their rootkits and ransomware.

The malware, dubbed Gyges, was first discovered in March by Sentinel Labs, which just released an intelligence report outlining their findings. From the report: “Gyges is an early example of how advanced techniques and code developed by governments for espionage are effectively being repurposed, modularized and coupled with other malware to commit cybercrime.”

Sentinel was able to detect Gyges with on-device heuristic sensors, but many intrusion prevention systems would miss it. The report states that Gyges’ evasion techniques are “significantly more sophisticated” than the payloads attached. It includes anti-detection, anti-tampering, anti-debugging, and anti-reverse-engineering capabilities.

The figure I keep hearing quoted is that cybersecurity attackers are ten years ahead of cybersecurity defenders.

Is that what you hear?

Whatever the actual gap, what makes me curious is why the gap exists at all? I assume the attackers and defenders are on par as far as intelligence, programming skills, financial support, etc., so what is the difference that accounts for the gap?

I don’t have the answer or even a suspicion of a suggestion but suspect someone else does.

Pointers anyone?

First complex, then simple

July 19th, 2014

First complex, then simple by James D Malley and Jason H Moore. (BioData Mining 2014, 7:13)

Abstract:

At the start of a data analysis project it is often suggested that the researcher look first at multiple simple models. That is, always begin with simple, one variable at a time analyses, such as multiple single-variable tests for association or significance. Then, later, somehow (how?) pull all the separate pieces together into a single comprehensive framework, an inclusive data narrative. For detecting true compound effects with more than just marginal associations, this is easily defeated with simple examples. But more critically, it is looking through the data telescope from wrong end.

I would have titled this article: “Data First, Models Later.”

That is the author’s start with no formal theories about what data will prove and upon finding signals in the data, then generate simple models to explain the signals.

I am sure their questions of the data are driven by a suspicion of what the data may prove, but that isn’t the same thing as asking questions designed to prove a model generated before the data is queried.

Government Software Design Questions

July 19th, 2014

10 questions to ask when reviewing design work by Ben Terrett.

Ben and a colleague reduced a list of design review questions by Jason Fried down to ten:

10 questions to ask when reviewing design work

1. What is the user need?

2. Is what it says and what it means the same thing?

3. What’s the take away after 3 seconds? (We thought 8 seconds was a bit long.)

4. Who needs to know that?

5. What does someone know now that they didn’t know before?

6. Why is that worth a click?

7. Are we assuming too much?

8. Why that order?

9. What would happen if we got rid of that?

10. How can we make this more obvious?

 

I’m Ben, Director of Design at GDS. You can follow me on twitter @benterrett

A great list for reviewing any design!

Where design doesn’t just mean an interface but presentation of data as well.

I am now following @benterrett and you should too.

It is a healthy reminder that not everyone in government wants to harm their own citizens and others. A minority do but let’s not forget true public servants while opposing tyrants.

I first saw the ten questions post in Nat Torkington’s Four short links: 18 July 2014.

What is deep learning, and why should you care?

July 19th, 2014

What is deep learning, and why should you care? by Pete Warden.

From the post:

neuron

When I first ran across the results in the Kaggle image-recognition competitions, I didn’t believe them. I’ve spent years working with machine vision, and the reported accuracy on tricky tasks like distinguishing dogs from cats was beyond anything I’d seen, or imagined I’d see anytime soon. To understand more, I reached out to one of the competitors, Daniel Nouri, and he demonstrated how he used the Decaf open-source project to do so well. Even better, he showed me how he was quickly able to apply it to a whole bunch of other image-recognition problems we had at Jetpac, and produce much better results than my conventional methods.

I’ve never encountered such a big improvement from a technique that was largely unheard of just a couple of years before, so I became obsessed with understanding more. To be able to use it commercially across hundreds of millions of photos, I built my own specialized library to efficiently run prediction on clusters of low-end machines and embedded devices, and I also spent months learning the dark arts of training neural networks. Now I’m keen to share some of what I’ve found, so if you’re curious about what on earth deep learning is, and how it might help you, I’ll be covering the basics in a series of blog posts here on Radar, and in a short upcoming ebook.

Pete gives a brief sketch of “deep learning” and promises more posts and a short ebook to follow.

Along those same lines you will want to see:

Microsoft Challenges Google’s Artificial Brain With ‘Project Adam’ by Daniela Hernandez (WIRED).

If you want in depth (technical) coverage, see: Deep Learning…moving beyond shallow machine learning since 2006! The reading list and references here should keep you busy for some time.

BTW, on “…shallow machine learning…” you do know the “Dark Ages” really weren’t “dark” but were so named in the Renaissance in order to show the fall into darkness (the Fall of Rome), the “Dark Ages,” and then the return of “light” in the Renaissance? See: Dark Ages (historiography).

Don’t overly credit characterizations of ages or technologies by later ages or newer technologies. They too will be found primitive and superstitious.

HOGWILD!

July 19th, 2014

Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent by Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright.

Abstract:

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called Hogwild! which allows processors access to shared memory with the possibility of over-writing each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild! achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild! outperforms alternative schemes that use locking by an order of magnitude. (emphasis in original)

From further in the paper:

Our second graph cut problem sought a mulit-way cut to determine entity recognition in a large database of web data. We created a data set of clean entity lists from the DBLife website and of entity mentions from the DBLife Web Crawl [11]. The data set consists of 18,167 entities and 180,110 mentions and similarities given by string similarity. In this problem each stochastic gradient step must compute a Euclidean projection onto a simplex of dimension 18,167.

A 9X speedup on 10 cores. (Against Vowpal Wabbit.)

A must read paper.

I first saw this in Nat Torkington’s Four short links: 15 July 2014. Nat says:

the algorithm that Microsoft credit with the success of their Adam deep learning system.

Artificial Intelligence | Natural Language Processing

July 18th, 2014

Artificial Intelligence | Natural Language Processing by Christopher Manning.

From the webpage:

This course is designed to introduce students to the fundamental concepts and ideas in natural language processing (NLP), and to get them up to speed with current research in the area. It develops an in-depth understanding of both the algorithms available for the processing of linguistic information and the underlying computational properties of natural languages. Wordlevel, syntactic, and semantic processing from both a linguistic and an algorithmic perspective are considered. The focus is on modern quantitative techniques in NLP: using large corpora, statistical models for acquisition, disambiguation, and parsing. Also, it examines and constructs representative systems.

Lectures with notes.

If you are new to natural language processing, it would be hard to point at a better starting point.

Enjoy!