USDA Nutrient DB (R Data Package)

August 25th, 2014

USDA Nutrient DB (R Data Package) by Hadley Wickham.

From the webpage:

This package contains all data from the USDA National Nutrient Database, “Composition of Foods Raw, Processed, Prepared”, release 26.

From the data documentation:

The USDA National Nutrient Database for Standard Reference (SR) is the major source of food composition data in the United States. It provides the foundation for most food composition databases in the public and private sectors. As information is updated, new versions of the database are released. This version, Release 26 (SR26), contains data on 8,463 food items and up to 150 food components. It replaces SR25 issued in September 2012.

Updated data have been published electronically on the USDA Nutrient Data Laboratory (NDL) web site since 1992. SR26 includes composition data for all the food groups and nutrients published in the 21 volumes of “Agriculture Handbook 8” (U.S. Department of Agriculture 1976-92), and its four supplements (U.S. Department of Agriculture 1990-93), which superseded the 1963 edition (Watt and Merrill, 1963). SR26 supersedes all previous releases, including the printed versions, in the event of any differences.

The ingredient calculators at most recipe sites are wimpy by comparison. If you really are interested in what you are ingesting on a day to day basis, take a walk through this data set.

Some other links of interest:

Release 26 Web Interface

Release 26 page

Correlating this data with online shopping options could be quite useful.

An Introduction to Congress.gov

August 25th, 2014

An Introduction to Congress.gov by Robert Brammer.

From the post:

Barbara Bavis, Ashley Sundin, and I are happy to bring you an introduction to Congress.gov. This video provides a brief explanation of how to use the new features in the latest release, such as accounts, saved searches, member remarks in the Congressional Record, and executive nominations. If you would like more in-depth training on Congress.gov, we hold bi-monthly webinars that are free and available to the public. Our next webinar is scheduled from 2-3 p.m. on September 25, 2014, and you can sign up for it on Law.gov. Do you have an opinion on Congress.gov that you would like to share with us, such as new features that you would like to see added to the site? Please let us know by completing the following survey. Also, if there is something you would like us to cover in a future video, please leave us a comment below.

There are mid-term elections this year (2014) and information on current members of Congress will be widely sought.

The video is only twenty (20) minutes but will help you quickly search a variety of information concerning Congress.

Take special note that once you discover information, the system does not bundle it together for the next searcher.

Exploring a SPARQL endpoint

August 25th, 2014

Exploring a SPARQL endpoint by Bob DuCharme.

From the post:

In the second edition of my book Learning SPARQL, a new chapter titled “A SPARQL Cookbook” includes a section called “Exploring the Data,” which features useful queries for looking around a dataset that you know little or nothing about. I was recently wondering about the data available at the SPARQL endpoint http://data.semanticweb.org/sparql, so to explore it I put several of the queries from this section of the book to work.

An important lesson here is how easy SPARQL and RDF make it to explore a dataset that you know nothing about. If you don’t know about the properties used, or whether any schema or schemas were used and how much they was used, you can just query for this information. Most hypertext links below will execute the queries they describe using semanticweb.org’s SNORQL interface.

Bob’s ease at using SPARQL reminds me of a story of an ex-spy who was going through customs for the first time in years. As part of that process, he accused a customs officer of having memorized print that was too small to read easily. The which the officer replied, “I am familiar with it.” ;-)

Bob’s book on SPARQL and his blog will help you become a competent SPARQL user.

I don’t suppose SPARQL is any worse off semantically than SQL, which has been in use for decades. It is troubling that I can discover dc:title but have no way to investigate how it was used by a particular content author.

Oh, to be sure, the term dc:title makes sense to me, but that is a smoothing function as a reader and may or may not be the same “sense” as occurs to the person who completed such a term.

You can read data sets using your own understanding of tokens but I would do so with a great deal of caution.

Information Aversion

August 25th, 2014

Information Aversion by John Baez.

ostrich

Why do ostriches stick their heads under the sand when they’re scared?

They don’t. So why do people say they do? A Roman named Pliny the Elder might be partially to blame. He wrote that ostriches “imagine, when they have thrust their head and neck into a bush, that the whole of their body is concealed.”

That would be silly—birds aren’t that dumb. But people will actually pay to avoid learning unpleasant facts. It seems irrational to avoid information that could be useful. But people do it. It’s called information aversion.

John reports on an interesting experiment where people really did pay to avoid learning information (about themselves).

Do you think this extends to learning unpleasant information about their present IT software or practices?

Clojure Digest

August 25th, 2014

Clojure Digest by Eric Normand.

Annotated report of four to five resources each week that are relevant to Clojure.

Enough useful information to keep you moving towards effective use of clojure but not enough to become a nest of electronic debris as opposed to a paper one.

Introducing Splainer…

August 25th, 2014

Introducing Splainer — The Open Source Search Sandbox That Tells You Why by Doug Turnbull.

Splainer is a step towards addressing two problems:

From the post:

  • Collaboration: At OpenSource Connections, we believe that collaboration with non-techies is the secret ingredient of search relevancy. We need to arm business analysts and content experts with a human readable version of the explain information so they can inform the search tuning process.
  • Usability: I want to paste a Solr URL, full of query paramaters and all, and go! Then, once I see more helpful explain information, I want to tweak (and tweak and tweak) until I get the search results I want. Much like some of my favorite regex tools. Get out of the way and let me tune!
  • ….

    We hope you’ll give it a spin and let us know how it can be improved. We welcome your bugs, feedback, and pull requests. And if you want to try the Splainer experience over multiple queries, with diffing, results grading, a develoment history, and more — give Quepid a spin for free!

Improving the information content of the tokens you are searching is another way to improve search results.

Desperately Seeking Algorithms!

August 25th, 2014

I don’t know for sure that Christophe Grand is “desperately” seeking algorithms but he has tweeted a request for “favorite algorithms” to be cast into posts similar to:

Tarjan’s strongly connected components algorithm

I dislike algorithms that are full of indices and mutations. Not because they are bad but because I always have the feeling that the core idea is buried. As such, Tarjan’s SCC algorithm irked me.

So I took the traditional algorithm, implemented it in Clojure with explicit environment passing, then I replaced indices by explicit stacks (thanks to persistence!) and after some tweaks, I realized that I’ve gone full circle and could switch to stacks lengths instead of stacks themselves and get rid of the loop. However the whole process made the code cleaner to my eye. You can look at the whole history.

Here is the resulting code:

See the Tarjan post for the Clojure version. Something similar is planned for “favorite” algorithms.

What algorithm are you going to submit?

Pass this along.

Research topics in e-discovery

August 25th, 2014

Research topics in e-discovery by William Webber.

From the post:

Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I’d capture some of these topics here:

E-discovery, if you don’t know, is found in civil litigation and government investigations. Think of it as hacking with rules as the purpose of e-discovery is to find information that supports your claims or defense. E-discovery is high stakes data mining that pays very well. Need I say more?

Webber lists the following research topics:

  1. Classification across heterogeneous document types
  2. Automatic detection of document types
  3. Faceted categorization
  4. Label propagation across related documents
  5. Identifying unclassifiable documents
  6. Identifying poor training examples
  7. Identifying significant fragments in non-significant text
  8. Routing of documents to specialized trainers
  9. Total cost of annotation

“Label propagation across related documents” looks like a natural for topic maps but searching over defined properties that identify subjects as opposed to opaque tokens would enhance the results for a number of these topics.

Speedy Short and Long DNA Reads

August 25th, 2014

Acceleration of short and long DNA read mapping without loss of accuracy using suffix array by Joaquín Tárraga, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu553)

Abstract:

HPG Aligner applies suffix arrays for DNA read mapping. This implementation produces a highly sensitive and extremely fast mapping of DNA reads that scales up almost linearly with read length. The approach presented here is faster (over 20x for long reads) and more sensitive (over 98% in a wide range of read lengths) than the current, state-of-the-art mappers. HPG Aligner is not only an optimal alternative for current sequencers but also the only solution available to cope with longer reads and growing throughputs produced by forthcoming sequencing technologies.

Always nice to see an old friend, suffix arrays, in the news!

Source code: https://github.com/opencb/hpg-aligner.

For documentation and software: http://wiki.opencb.org/projects/hpg/doku.php?id=aligner:overview

I first saw this in a tweet by Bioinfocipf.

Where to get help with Common Lisp

August 25th, 2014

Where to get help with Common Lisp by Zach Beane.

An annotated listing of help resources for Common Lisp.

I especially appreciated his quote/comment from one mailing list:

Beware that “the accuracy of postings made to the list is not guaranteed.” (I think that goes for every mailing list, ever.)

;-)

I first saw this in a tweet by Planet Lisp.

Who Dat?

August 24th, 2014

Dat

From the about page:

Dat is an grant funded, open source project housed in the US Open Data Institute. While dat is a general purpose tool, we have a focus on open science use cases.

The high level goal of the dat project is to build a streaming interface between every database and file storage backend in the world. By building tools to build and share data pipelines we aim to bring to data a style of collaboration similar to what git brings to source code.

The first alpha release is now out!

More on this project later this coming week.

I first saw this in Nat Torkington’s Four short links: 21 August 2014.

Introducing Riffmuse

August 24th, 2014

Introducing Riffmuse by Dave Yarwood.

From the post:

I’ve written a simple command line app in Clojure that will take a musical scale as a command line argument and algorithmically generate a short musical idea or “riff” using the notes in that scale. I call it Riffmuse.

Here’s how Riffmuse works in a nutshell: it takes the command line argument(s), figures out what scale you want (For this, I used the awesome parser generator library Instaparse to create a parser that allows a flexible syntax in specifying scales. C major can be represented as, e.g., “C major,” “CMAJ” or even just “c”), determines what notes are in that scale, comes up with a rhythmic pattern for the notes (represented as 16 “slots” that can either have a note or no note), and then fills the slots with notes from the scale you’ve specified.

On the theory that you never know what will capture someone’s interest, this is a post that needs to be shared. It may spark an interest in some future Clojure or music rock star!

I first saw this in a tweet by coderpost.

Pro Daniel

August 23rd, 2014

Michael Daniel, cybersecurity coordinator for the White House is being taken to task for saying:

“You don’t have to be a coder in order to really do well in this position,” Daniel said, when asked if his job required knowledge of the technology behind information security. “In fact, actually, I think being too down in the weeds at the technical level could actually be a little bit of a distraction.”

“You can get taken up and enamored with the very detailed aspects of some of the technical solutions,” he explained, arguing that “the real issue is looking at the broad strategic picture.”

That quote, from White House cybersecurity czar brags about his lack of technical expertise by Timothy B. Lee, has provoked all manner of huffing and puffing across the computer security community.

It reminds me of candidates for the county commission who would brag about their expertise at running graders, backhoes, and similar heavy equipment. Always puzzled me because I assumed county government would hire people with those skills. Commissioners needed skills at representing the county for grants, making policy decisions, etc.

The security community, or at least reporters purporting to speak for the security community don’t appear to understand the difference between cybersecurity software and cybersecurity policy. You need coders for the former and policy wonks for the latter. Someone could be both but that’s fairly unlikely.

For example, assume a new security algorithm is discovered that can encrypt telephone and email communications with very little overhead for encryption/decryption. Further assume that Daniel has been assured by none other than Bruce Schneier that the algorithm and software that implements it, performs as advertised. And assume Daniel understands none of the details about the algorithm and software.

How does his ignorance impact the formulation of cybersecurity policy with regard to this algorithm or software?

The FBI opposes it because the FBI prefers non-encrypted communications like in the old days when it could just plug into a phone junction box.

The NSA opposes it, at least for others, because then it could not easily tap into email and phone conversations.

The Department of Defense opposes it, primarily because it has long term contractual relationships for security services with firms that don’t have access to the algorithm.

The Library of Congress supports it, at least those outside of the copyright office support it.

Various other groups take positions that seem reasonable to them.

So, how are coding skills going to help Daniel balance the political, social, agency and other politics for a policy concerning such an algorithm?

We all know the answer to that question.

Not at all.

PS: If my example looks like a strawman, come up with one of your own. Technical expertise Daniel can hire, policy expertise, meaning what is expedient given the stakeholders and their influence, not so much.

Large-Scale Object Classification…

August 23rd, 2014

Large-Scale Object Classi cation using Label Relation Graphs by Jia Deng, et al.

Abstract:

In this paper we study how to perform object classi cation in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classifi cation model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a large-scale benchmark. Empirical results demonstrate that our model can signifi cantly improve object classifi cation by exploiting the label relations.

Let’s hear it for “real world labels!”

By which the authors mean:

  • An object can have more than one label.
  • There are relationships between labels.

From the introduction:

We first introduce Hierarchy and Exclusion (HEX) graphs, a new formalism allowing flexible specifi cation of relations between labels applied to the same object: (1) mutual exclusion (e.g. an object cannot be dog and cat), (2) overlapping (e.g. a husky may or may not be a puppy and vice versa), and (3) subsumption (e.g. all huskies are dogs). We provide theoretical analysis on properties of HEX graphs such as consistency, equivalence, and computational implications.

Next, we propose a probabilistic classi fication model leveraging HEX graphs. In particular, it is a special type of Conditional Random Field (CRF) that encodes the label relations as pairwise potentials. We show that this model enjoys
a number of desirable properties, including flexible encoding of label relations, predictions consistent with label relations, efficient exact inference for typical graphs, learning labels with varying specifi city, knowledge transfer, and uni fication of existing models.

Having more than one label is trivially possible in topic maps. The more interesting case is the authors choosing to treat semantic labels as subjects and to define permitted associations between those subjects.

A world of possibilities opens up when you can treat something as a subject that can have relationships defined to other subjects. Noting that those relationships can also be treated as subjects should someone desire to do so.

I first saw this at: Is that husky a puppy?

Data + Design

August 23rd, 2014

Data + Design: A simple introduction to preparing and visualizing information by Trina Chiasson, Dyanna Gregory and others.

From the webpage:

ABOUT

Information design is about understanding data.

Whether you’re writing an article for your newspaper, showing the results of a campaign, introducing your academic research, illustrating your team’s performance metrics, or shedding light on civic issues, you need to know how to present your data so that other people can understand it.

Regardless of what tools you use to collect data and build visualizations, as an author you need to make decisions around your subjects and datasets in order to tell a good story. And for that, you need to understand key topics in collecting, cleaning, and visualizing data.

This free, Creative Commons-licensed e-book explains important data concepts in simple language. Think of it as an in-depth data FAQ for graphic designers, content producers, and less-technical folks who want some extra help knowing where to begin, and what to watch out for when visualizing information.

As of today, the Data + Design is the product of fifty (50) volunteers from fourteen (14) countries. At eighteen (18) chapters and just shy of three-hundred (300) pages, this is a solid introduction to data and its visualization.

The source code is on GitHub, along with information on how you can contribute to this project.

A great starting place but my social science background is responsible for my caution concerning chapters 3 and 4 on survey design and questions.

All of the information and advice in those chapters is good, but it leaves the impression that you (the reader) can design an effective survey instrument. There is a big difference between an “effective” survey instrument and a series of questions pretending to be a survey instrument. Both will measure “something” but the question is whether a survey instrument provides you will actionable intelligence.

For a survey on any remotely mission critical, like user feedback on an interface or service, get as much professional help as you can afford.

When was the last time you heard of a candidate for political office or serious vendor using Survey Monkey? There’s a reason for that lack of reports. Can you guess that reason?

I first saw this in a tweet by Meta Brown.

NLM RSS Feeds

August 23rd, 2014

National Library of Medicine RSS Feeds

RSS feeds covering a broad range National Library of Medicine activities.

I am reporting it here because as soon as I don’t, I will need the listing.

NLM Technical Bulletin

August 23rd, 2014

NLM Technical Bulletin

A publication of the U.S. National Library of Medicine (NLM). The about page for NLM gives the following overview:

The National Library of Medicine (NLM), on the campus of the National Institutes of Health in Bethesda, Maryland, has been a center of information innovation since its founding in 1836. The world’s largest biomedical library, NLM maintains and makes available a vast print collection and produces electronic information resources on a wide range of topics that are searched billions of times each year by millions of people around the globe. It also supports and conducts research, development, and training in biomedical informatics and health information technology. In addition, the Library coordinates a 6,000-member National Network of Libraries of Medicine that promotes and provides access to health information in communities across the United States.

The bulletin about page says:

The NLM Technical Bulletin, your source for the latest searching information, is produced by: MEDLARS Management Section, National Library of Medicine, Bethesda, Maryland, USA.

Which is true but seems inadequate to describe the richness of what you can find at the bulletin.

For example, in 2014 July&emdash;August No. 399 you find:

MeSH on Demand Update: How to Find Citations Related to Your Text

New CMT Subsets Available

New Tutorial: Searching Drugs or Chemicals in PubMed

If medical terminology touches your field of interest, this is a must read.

MeSH on Demand Tool:…

August 23rd, 2014

MeSH on Demand Tool: An Easy Way to Identify Relevant MeSH Terms by Dan Cho.

From the post:

Currently, the MeSH Browser allows for searches of MeSH terms, text-word searches of the Annotation and Scope Note, and searches of various fields for chemicals. These searches assume that users are familiar with MeSH terms and using the MeSH Browser.

Wouldn’t it be great if you could find MeSH terms directly from your text such as an abstract or grant summary? MeSH on Demand has been developed in close collaboration among MeSH Section, NLM Index Section, and the Lister Hill National Center for Biomedical Communications to address this need.

Using MeSH on Demand

Use MeSH on Demand to find MeSH terms relevant to your text up to 10,000 characters. One of the strengths of MeSH on Demand is its ease of use without any prior knowledge of the MeSH vocabulary and without any downloads.

Now there’s a clever idea!

Imagine extending it just a bit so that it produces topics for subjects it detects in your text and associations with the text and author of the text. I would call that assisted topic map authoring. You?

I followed a tweet by Michael Hoffman, which lead to: MeSH on Demand Update: How to Find Citations Related to Your Text, which describes an enhancement to MeSH on demands that finds relevant citations (10) based on your text.

The enhanced version mimics the traditional method of writing court opinions. A judge writes his decision and then a law clerk finds cases that support the positions taken in the opinion. You really thought it worked some other way? ;-)

Imaging Planets and Disks [Not in our Solar System]

August 22nd, 2014

Videos From the 2014 Sagan Summer Workshop On-line

From the post:

The NASA Exoplanet Science Center (NEXScI) hosts the Sagan Workshops, annual themed conferences aimed at introducing the latest techniques in exoplanet astronomy to young researchers. The workshops emphasize interaction with data, and include hands-on sessions where participants use their laptops to follow step-by-step tutorials given by experts. This year’s conference topic was “Imaging Planets and Disks”. It covered topics such as

  • Properties of Imaged Planets
  • Integrating Imaging and RV Datasets
  • Thermal Evolution of Planets
  • The Challenges and Science of Protostellar And Debris Disks…

You can see the agenda and the presentations here, and the videos have been posted here. Some of the talks are also on youtube at https://www.youtube.com/channel/UCytsRiMvdj5VTZWfj6dBadQ

The presentations showcase the extraordinary richness of exoplanet research. If you are unfamiliar with NASA’s exoplanet program, Gary Lockwood provides an introduction (not available for embedding – visit the web page). My favorite talk, of many good ones, was Travis Barman speaking on the “Crown Jewels of Young Exoplanets.”

Looking to expand you data processing horizons? ;-)

Enjoy!

Manhattan District History

August 22nd, 2014

Manhattan District History

From the post:

General Leslie Groves, head of the Manhattan Engineer District, in late 1944 commissioned a multi-volume history of the Manhattan Project called the Manhattan District History. Prepared by multiple authors under the general editorship of Gavin Hadden, a longtime civil employee of the Army Corps of Engineers, the classified history was “intended to describe, in simple terms, easily understood by the average reader, just what the Manhattan District did, and how, when, and where.” The volumes record the Manhattan Project’s activities and achievements in research, design, construction, operation, and administration, assembling a vast amount of information in a systematic, readily available form. The Manhattan District History contains extensive annotations, statistical tables, charts, engineering drawings, maps, photographs, and detailed indices. Only a handful of copies of the history were prepared. The Department of Energy’s Office of History and Heritage Resources is custodian of one of these copies.

The history is arranged in thirty-six volumes grouped in eight books. Some of the volumes were further divided into stand-alone chapters. Several of the volumes and stand-alone chapters were never security classified. Many of the volumes and chapters were declassified at various times and were available to the public on microfilm. Parts of approximately a third of the volumes remain classified.

The Office of Classification and the Office of History and Heritage Resources, in collaboration with the Department’s Office of Science and Technical Information, have made the full-text of the entire thirty-six volume Manhattan District History available on this OpenNet website. Unclassified and declassified volumes have been scanned and posted. Classified volumes were declassified in full or with redactions, i.e., still classified terms, phrases, sentences, and paragraphs were removed and the remaining unclassified parts made available to the public. All volumes have been posted.

In case you are interested in the Manhattan project generally or want to follow its participants into the late 20th century, this is the resource for you!

Just occurred to me that the 1940 Census Records are now online. What other records would you want to map together from this time period?

I first saw this in a tweet by Michael Nielsen.

Getty Thesaurus of Geographic Names (TGN)

August 22nd, 2014

Getty Thesaurus of Geographic Names Released as Linked Open Data by James Cuno.

From the post:

We’re delighted to announce that the Getty Research Institute has released the Getty Thesaurus of Geographic Names (TGN)® as Linked Open Data. This represents an important step in the Getty’s ongoing work to make our knowledge resources freely available to all.

Following the release of the Art & Architecture Thesaurus (AAT)® in February, TGN is now the second of the four Getty vocabularies to be made entirely free to download, share, and modify. Both data sets are available for download at vocab.getty.edu under an Open Data Commons Attribution License (ODC BY 1.0).

What Is TGN?

The Getty Thesaurus of Geographic Names is a resource of over 2,000,000 names of current and historical places, including cities, archaeological sites, nations, and physical features. It focuses mainly on places relevant to art, architecture, archaeology, art conservation, and related fields.

TGN is powerful for humanities research because of its linkages to the three other Getty vocabularies—the Union List of Artist Names, the Art & Architecture Thesaurus, and the Cultural Objects Name Authority. Together the vocabularies provide a suite of research resources covering a vast range of places, makers, objects, and artistic concepts. The work of three decades, the Getty vocabularies are living resources that continue to grow and improve.

Because they serve as standard references for cataloguing, the Getty vocabularies are also the conduits through which data published by museums, archives, libraries, and other cultural institutions can find and connect to each other.

A resource where you could loose some serious time!

Try this entry for London.

Or Paris.

Bear in mind the data that underlies this rich display is now available for free downloading.

The Truth About Triplestores [Opaqueness]

August 22nd, 2014

The Truth About Triplestores

A vendor “truth” document from Ontotext. Not that being from a vendor is a bad thing, but you should always consider the source of a document when evaluating its claims.

Quite naturally I jumped to: “6. Data Integration & Identity Resolution: Identifying the same entity across disparate data sources.”

With so many different databases and systems existing inside any single organization, how do companies integrate all of their data? How do they recognize that an entity in one database is the same entity in a completely separate database?

Resolving identities across disparate sources can be tricky. First, they need to be identified and then linked.

To do this effectively, you need two things. Earlier, we mentioned that through the use of text analysis, the same entity spelled differently can be recognized. Once this happens, the references to entities need to be stored correctly in the triplestore. The triplestore needs to support predicates that can declare two different Universal Resource Indicators (URIs) as one in the same. By doing this, you can align the same real-world entity used in different data sources. The most standard and powerful predicate used to establish mappings between multiple URIs of a single object is owl:sameAs. In turn, this allows you to very easily merge information from multiple sources including linked open data or proprietary sources. The ability to recognize entities across multiple sources holds great promise helping to manage your data more effectively and pinpointing connections in your data that may be masked by slightly different entity references. Merging this information produces more accurate results, a clearer picture of how entities are related to one another and the ability to improve the speed with which your organization operates.

In case you are unfamiliar with owl:sameAS, here is an example from OWL Web Ontology Language Reference

<rdf:Description rdf:about="#William_Jefferson_Clinton"&gt:
  <owl:sameAs rdf:resource="#BillClinton"/>
</rdf:Description>

The owl:sameAs in this case is opaque because there is no way to express why an author thought #William_Jefferson_Clinton and #BillClinton were about the same subject. You could argue that any prostitute in Columbia would recognize that mapping so let’s try a harder case.

<rdf:Description rdf:about="#United States of America"&gt:
  <owl:sameAs rdf:resource="#الولايات المتحدة الأمريكية"/>
</rdf:Description>

Less confident than you were about the first one?

The problem with owl:sameAs is its opaqueness. You don’t know why an author used owl:sameAs. You don’t know what property or properties they saw that caused them to use one of the various understandings of owl:sameAs.

Without knowing those properties, accepting any owl:sameAs mapping is buying a pig in a poke. Not a proposition that interests me. You?

I first saw this in a tweet by graphityhq.

Computer Science – Know Thyself!

August 22nd, 2014

Putting the science in computer science by Felienne Hermans.

From the description:

Programmers love science! At least, so they say. Because when it comes to the ‘science’ of developing code, the most used tool is brutal debate. Vim versus emacs, static versus dynamic typing, Java versus C#, this can go on for hours at end. In this session, software engineering professor Felienne Hermans will present the latest research in software engineering that tries to understand and explain what programming methods, languages and tools are best suited for different types of development.

Great slides from Felienne’s keynote at ALE 2014.

I mention this to emphasize the need for social science research techniques and methodologies for application development. Investigation of computer science debates with such methods may lead to less resistance to them for user facing issues.

Perhaps a recognition that we are all “users,” bringing common human experiences to different interfaces with computers, will result in better interfaces for all.

Data Carpentry (+ Sorted Nordic Scores)

August 21st, 2014

Data Carpentry by David Mimno.

From the post:

The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. The fact that raw data is rarely usable for analysis without significant work is a point I try hard to make with my students. I told them “do not underestimate the difficulty of data preparation”. When they turned in their projects, many of them reported that they had underestimated the difficulty of data preparation. Recognizing this as a hard problem is great.

What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry.

Note: data carpentry seems to already be a thing

I’m not convinced that “carpentry” is the best prestige target.

The first mention of carpenters on a sorted version of the Nordic Scores (Colorado Adoption Project: Resources for Researchers. Institute for Behavioral Genetics, University of Colorado Boulder) is at 147.*

I would go for data scientist since mercenary isn’t listed as an occupation. ;-)

The usual cautions apply. Prestige is as difficult or perhaps more so to measure than any other social construct. The data is from 1989 and so may not reflect “current” prestige rankings.

*(I have removed the classes and sorted by prestige score, to create Sorted Nordic Scores.)

…Loosely Consistent Distributed Programming

August 21st, 2014

Language Support for Loosely Consistent Distributed Programming by Neil Conway.

Abstract:

Driven by the widespread adoption of both cloud computing and mobile devices, distributed computing is increasingly commonplace. As a result, a growing proportion of developers must tackle the complexity of distributed programming—that is, they must ensure correct application behavior in the face of asynchrony, concurrency, and partial failure.

To help address these difficulties, developers have traditionally relied upon system infrastructure that provides strong consistency guarantees (e.g., consensus protocols and distributed transactions). These mechanisms hide much of the complexity of distributed computing—for example, by allowing programmers to assume that all nodes observe the same set of events in the same order. Unfortunately, providing such strong guarantees becomes increasingly expensive as the scale of the system grows, resulting in availability and latency costs that are unacceptable for many modern applications.

Hence, many developers have explored building applications that only require loose consistency guarantees—for example, storage systems that only guarantee that all replicas eventually converge to the same state, meaning that a replica might exhibit an arbitrary state at any particular time. Adopting loose consistency involves making a well-known tradeoff: developers can avoid paying the latency and availability costs incurred by mechanisms for achieving strong consistency, but inexchange they must deal with the full complexity of distributed computing. As a result, achieving correct application behavior in this environment is very difficult.

This thesis explores how to aid developers of loosely consistent applications by providing programming language support for the difficulties they face. The language level is a natural place to tackle this problem: because developers that use loose consistency have fewer system facilities that they can depend on, consistency concerns are naturally pushed into application logic. In part, our goal has been to recognize, formalize, and automate application-level consistency patterns.

We describe three language variants that each tackle a different challenge in distributed programming. Each variant is a modification of Bloom, a declarative language for distributed programming we have developed at UC Berkeley. The first variant of Bloom, BloomL, enables deterministic distributed programming without the need for distributed coordination. Second, Edelweiss allows distributed storage reclamation protocols to be generated in a safe and automatic fashion. Finally, BloomPO adds sophisticated ordering constraints that we use to develop a declarative, high-level implementation of concurrent editing, a particularly difficult class of loosely consistent programs.

Unless you think of topic maps as static files, recent developments in “loosely consistent distributed programming” should be high on your reading list.

It’s entirely possible to have a topic map that is a static file, even one that has been printed out to paper. But that seems like a poor target for development. Captured information begins progressing towards staleness from the moment of its capture.

I first saw this in a tweet by Peter Bailis.

The Little Book of Semaphores

August 21st, 2014

The Little Book of Semaphores by Allen Downey.

From the webpage:

The Little Book of Semaphores is a free (in both senses of the word) textbook that introduces the principles of synchronization for concurrent programming.

In most computer science curricula, synchronization is a module in an Operating Systems class. OS textbooks present a standard set of problems with a standard set of solutions, but most students don’t get a good understanding of the material or the ability to solve similar problems.

The approach of this book is to identify patterns that are useful for a variety of synchronization problems and then show how they can be assembled into solutions. After each problem, the book offers a hint before showing a solution, giving students a better chance of discovering solutions on their own.

The book covers the classical problems, including “Readers-writers,” “Producer-consumer”, and “Dining Philosophers.” In addition, it collects a number of not-so-classical problems, some written by the author and some by other teachers and textbook writers. Readers are invited to create and submit new problems.

If you want a deep understanding of concurrency, this looks like a very good place to start!

Some of the more colorful problem names:

  • The dining savages problem
  • The Santa Claus problem
  • The unisex bathroom problem
  • The Senate Bus problem

There are problems (and patterns) for your discovery and enjoyment!

I first saw this in a tweet by Computer Science.

CSV Fingerprints

August 21st, 2014

CSV Fingerprints by Victor Powell.

From the post:

CSV is a simple and common format for tabular data that uses commas to separate rows and columns. Nearly every spreadsheet and database program lets users import from and export to CSV. But until recently, these programs varied in how they treated special cases, like when the data itself has a comma in it.

It’s easy to make a mistake when you try to make a CSV file fit a particular format. To make it easier to spot mistakes, I’ve made a “CSV Fingerprint” viewer (named after the “Fashion Fingerprints” from The New York Times’s “Front Row to Fashion Week” interactive ). The idea is to provide a birdseye view of the file without too much distracting detail. The idea is similar to Tufte’s Image Quilts…a qualitative view, as opposed to a rendering of the data in the file themselves. In this sense, the CSV Fingerprint is a sort of meta visualization.

This is very clever. Not only can you test a CSV snippet on the webpage, but the source code is on Github. https://github.com/setosa/csv-fingerprint (source code)

Of course, it does rely on the most powerful image processing system known to date. Err, that would be you. ;-)

Pass this along. I can imagine any number of data miners who will be glad you did.

Math for machine learning

August 20th, 2014

Math for machine learning by Zygmunt Zając.

From the post:

Sometimes people ask what math they need for machine learning. The answer depends on what you want to do, but in short our opinion is that it is good to have some familiarity with linear algebra and multivariate differentiation.

Linear algebra is a cornerstone because everything in machine learning is a vector or a matrix. Dot products, distance, matrix factorization, eigenvalues etc. come up all the time.

Differentiation matters because of gradient descent. Again, gradient descent is almost everywhere*. It found its way even into the tree domain in the form of gradient boosting – a gradient descent in function space.

We file probability under statistics and that’s why we don’t mention it here.

Following this introduction you will find a series of books, MOOCs, etc. on linear algebra, calculus and other math resources.

Pass it along!

Mapping Out Lambda Land:…

August 20th, 2014

Mapping Out Lambda Land: An Introduction to Functional Programming by Katie Miller.

From the post:

Anyone who has met me will probably know that I am wildly enthusiastic about functional programming (FP). I co-founded a group for women in FP, have presented a series of talks and workshops about functional concepts, and have even been known to create lambda-branded clothing and jewellery. In this blog post, I will try to give some insight into what the fuss is about. I will briefly explain what functional programming is, why you should care, and how you can use OpenShift to learn more about FP.

With the publicity around OpenShift and functional programming, it seems entirely reasonable to put them together.

Katie gives you a quick overview of functional programming along with resources and next steps for your OpenShift account.

I first saw this in a post by Jonathan Murray.

Web Annotation Working Group (Preventing Semantic Rot)

August 20th, 2014

Web Annotation Working Group

From the post:

The W3C Web Annotation Working Group is chartered to develop a set of specifications for an interoperable, sharable, distributed Web annotation architecture. The chartered specs consist of:

  1. Abstract Annotation Data Model
  2. Data Model Vocabulary
  3. Data Model Serializations
  4. HTTP API
  5. Client-side API

The working group intends to use the Open Annotation Data Model and Open Annotation Extension specifications, from the W3C Open Annotation Community Group, as a starting point for development of the data model specification.

The Robust Link Anchoring specification will be jointly developed with the WebApps WG, where many client-side experts and browser implementers participate.

Some good news for the middle of a week!

Shortcomings to watch for:

Can annotations be annotated?

Can non-Web addressing schemes be used by annotators?

Can the structure of files (visible or not) in addition to content be annotated?

If we don’t have all three of those capabilities, then the semantics of annotations will rot, just as semantics of earlier times have rotted away. The main distinction is that most of our ancestors didn’t choose to allow the rot to happen.

I first saw this in a tweet by Rob Sanderson.