ROCK, RACK And Knowledge Retention

April 18th, 2014

From the post:

Back in May 2013, we held a Roundtable on Knowledge Retention Techniques. Carla Newman, Shaharudin Mohd Ishak and Ashad Ahmed very graciously shared with us their journey and experiences in Knowledge Retention.

Three videos, Carla Newman on ROCK (Retention of Critical Knowledge), Shaharudin Mohd Ishak on IE Singapore’s RACK (Retention of All Critical Knowledge), and Ashad Ahmend on Knowledge Retention.

Any knowledge problem of interest to Shell Oil Company is of interest to me!

At what junctures in a knowledge retention process would topic maps have the greatest impact?

Not really interested in disrupting current approaches or processes but in discovering where topic maps could be a value add to existing systems.

Saving Output of nltk Text.Concordance()

April 18th, 2014

Saving Output of NLTK Text.Concordance() by Kok Hua.

From the post:

In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.

NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.

Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.

Text mining is a very common part of topic map construction so tools that help with that task are always welcome.

To be honest, I am citing this because it may become part of several small tools for processing standards drafts. Concordance software is not rare but a full concordance of a document seems to frighten some proof readers.

The current thinking being if only the “important” terms are highlighted in context, that some proof readers will be more likely to use the work product.

The same principal applies to the authoring of topic maps as well.

Learn Clojure…

April 18th, 2014

Learn Clojure – Clojure Koans Walkthrough in Light Table IDE

You have heard of Clojure and no doubt the Clojure Koans.

Now there are videos solving the Clojure Koans using the Light Table IDE.

I first saw this at Clojure Koans by Christopher Bare.

Marketing Strategy for Topic Maps?

April 18th, 2014

Should you reveal a P = NP algorithm? by Lance Fortnow.

I think Lance captures the best topic map marketing strategy when he says:

Once again though no one will take you seriously unless you really have a working algorithm. If you just tell Google you have an algorithm for NP-complete problem they will just ignore you. If you hand them their private keys then they will listen. (emphasis added)

I can think of endless areas that I think would benefit from topic maps, but where foresight fails is in choosing an area and degree of granularity that would interest others.

A topic map about Justin Bieber would be useful in some sense but there sites that organize that data already. What degree of “better” would be required for data about a sulky Canadian to be successful?

Or take OMB data for example. You may remember my post: Free Sequester Data Here! where I list a series of post on converting OMB data into machine readable files and the issues with that data.

Apparently I was the only person in the United States who did not realize OMB reports are stylized fiction meant to serve political ends. The report in question had no connection to reality other than mostly getting department and program names correct.

Putting the question to you: What would make a good (doesn’t have to be great) demonstration of topic maps? One that would resonate with potential customers? BTW, you need to say why it would be a good demonstration of topic maps. In what respects are other resources in the area deficient and why would that deficiency drive someone to seek out topic maps?

I know that is a very broad question but it is an important one to ask and hopefully, to work towards a useful answer.

A/B Tests and Facebook

April 18th, 2014

From the post:

Two years ago, Wired breathlessly extolled the virtues of A/B testing (link). A lot of Web companies are in the forefront of running hundreds or thousands of tests daily. The reality is that most A/B tests fail.

A/B tests fail for many reasons. Typically, business leaders consider a test to have failed when the analysis fails to support their hypothesis. “We ran all these tests varying the color of the buttons, and nothing significant ever surfaced, and it was all a waste of time!” For smaller websites, it may take weeks or even months to collect enough samples to read a test, and so business managers are understandably upset when no action can be taken at its conclusion. It feels like waiting for the train which is running behind schedule.

Bad outcome isn’t the primary reason for A/B test failure. The main ways in which A/B tests fail are:

1. Bad design (or no design);
2. Bad execution;
3. Bad measurement.

These issues are often ignored or dismissed. They may not even be noticed if the engineers running the tests have not taken a proper design of experiments class. However, even though I earned an A at school, it wasn’t until I started running real-world experiments that I really learned the subject. This is an area in which theory and practice are both necessary.

The Facebook Data Science team just launched an open platform for running online experiments, called PlanOut. This looks like a helpful tool to avoid design and execution problems. I highly recommend looking into how to integrate it with your website. An overview is here, and a more technical paper (PDF) is also available. There is a github page.

The rest of this post gets into some technical, sausage-factory stuff, so be warned.

For all of your software tests, do you run any A/B tests on your interface?

Or is your response to UI criticism, “…well, but all of us like it.” That’s a great test for a UI.

If you don’t read any other blog post this weekend, read Kaiser’s take on A/B testing.

VOWL: Visual Notation for OWL Ontologies

April 18th, 2014

VOWL: Visual Notation for OWL Ontologies

Abstract:

The Visual Notation for OWL Ontologies (VOWL) defines a visual language for the user-oriented representation of ontologies. It provides graphical depictions for elements of the Web Ontology Language (OWL) that are combined to a force-directed graph layout visualizing the ontology.

This specification focuses on the visualization of the ontology schema (i.e. the classes, properties and datatypes, sometimes called TBox), while it also includes recommendations on how to depict individuals and data values (the ABox). Familiarity with OWL and other Semantic Web technologies is required to understand this specification.

At the end of the specification there is an interesting example but as a “force-directed graph layout” it captures one of the difficulties I have with that approach.

I have this unreasonable notion that a node I select and place in the display should stay where I have placed it, not shift about because I have moved some other node. Quite annoying and I don’t find it helpful at all.

I first saw this at: VOWL: Visual Notation for OWL Ontologies

Non-Painful Presentations

April 18th, 2014

Looking to give fewer painful presentations?

Want advice to forward on non-painful presenting?

If you answered “yes,” to either of those questions, read: This Advice From IDEO’s Nicole Kahn Will Transform the Way You Give Presentations.

Nicole Kahn has short and engaging advice that boils down to three (3) touchstones for making a non-painful and perhaps even compelling presentation.

It’s easier to sell an idea, technology or product if the audience isn’t in pain after your presentation.

Precision from Disaggregation

April 18th, 2014

Building Precise Maps with Disser by Brandon Martin-Anderson.

From the post:

Spatially aggregated statistics are pretty great, but what if you want more precision? Here at Conveyal we built a utility to help with that: aggregate-disser. Let me tell you how it works.

Let’s start with a classic aggregated data set – the block-level population counts from the US Census. Here’s a choropleth map of total population for blocks around lower Manhattan and Brooklyn. The darkest shapes contain about five thousand people.

Brandon combines census data with other data sets to go from 5,000 person census blocks to locating every job and bed in Manhattan into individual buildings.

Very cool!

Not to mention instructive when you encounter group subjects that need to be disaggregated before being combined with other data.

I first saw this in a tweet by The O.C.R.

Announcing Schema.org Actions

April 17th, 2014

Announcing Schema.org Actions

From the post:

When we launched schema.org almost 3 years ago, our main focus was on providing vocabularies for describing entities — people, places, movies, restaurants, … But the Web is not just about static descriptions of entities. It is about taking action on these entities — from making a reservation to watching a movie to commenting on a post.

Today, we are excited to start the next chapter of schema.org and structured data on the Web by introducing vocabulary that enables websites to describe the actions they enable and how these actions can be invoked.

The new actions vocabulary is the result of over two years of intense collaboration and debate amongst the schema.org partners and the larger Web community. Many thanks to all those who participated in these discussions, in particular to members of the Web Schemas and Hydra groups at W3C. We are hopeful that these additions to schema.org will help unleash new categories of applications.

Thing > Action

An action performed by a direct agent and indirect participants upon a direct object. Optionally happens at a location with the help of an inanimate instrument. The execution of the action may produce a result. Specific action sub-type documentation specifies the exact expectation of each argument/role.

Fairly coarse but I can see how it would be useful.

BTW, the examples are only available in JSON-LD. Just in case you were wondering.

Given the coarseness of schema.org and its success, due consideration should be given to semantics of “appropriate” coarseness for any particular task.

…lotteries to pick NIH research-grant recipients

April 17th, 2014

Wall Street Journal op-ed advocates lotteries to pick NIH research-grant recipients by Steven T. Corneliussen

From the post:

The subhead for the Wall Street Journal op-ed “Taking the Powerball approach to funding medical research” summarizes its coauthors’ argument about research funding at the National Institutes of Health (NIH): “Winning a government grant is already a crapshoot. Making it official by running a lottery would be an improvement.”

The coauthors, Ferric C. Fang and Arturo Casadevall, serve respectively as a professor of laboratory medicine and microbiology at the University of Washington School of Medicine and as professor and chairman of microbiology and immunology at the Albert Einstein College of Medicine of Yeshiva University.

At a time when funding levels are historically low, they note, grant peer review remains expensive. The NIH Center for Scientific Review has a $110 million annual budget. Grant-submission and grant-review processes extract an additional high toll from participants. Within this context, the coauthors summarize criticisms of NIH peer review. They mention a 2012 Nature commentary that argued, they say, that the system’s structure “encourages conformity.” In particular, after mentioning a study in the journal Circulation Research, they propose that concerning projects judged good enough for funding, “NIH peer reviewers fare no better than random chance when it comes to predicting how well grant recipients will perform.” Nature should use a “mock” lottery to judge the acceptance of papers along side its normal peer review process. Publish the results after a year of peer review “competing” with a lottery. Care to speculate on the results as evaluated by Nature readers? Tutorial Statistical Graph Analysis April 17th, 2014 Tutorial Statistical Graph Analysis by Aapo Kyrola. From the post: GraphChi-DB can be used for efficiently querying induced subgraphs from very large networks. Thus, you can for example, easily sample a vertex, and retrive induced neighborhood graph of the vertex. Or you can choose a random set of vertices and compute their induced subgraph. Assuming you have the data loaded in database, and the GraphChiDatabase object in a value “DB”, here is how you request edges for the induced subgraph of a set of vertices: GraphChi-DB is released two days ago, GraphChi-DB [src released] and today you have a tutorial written for it. Not bad, not bad at all. Clojure Procedural Dungeons April 17th, 2014 Clojure Procedural Dungeons From the webpage: When making games, there are two ways to make a dungeon. The common method is to design one in the CAD tool of our choice (or to draw one in case of 2D games). The alternative is to automatically generate random Dungeons by using a few very powerful algorithms. We could automatically generate a whole game world if we wanted to, but let’s take one step after another. In this Tutorial we will implement procedural Dungeons in Clojure, while keeping everything as simple as possible so everyone can understand it. Just in case you are interesting in a gaming approach for a topic maps interface. Not as crazy as that may sound. One of the brightest CS types I ever knew spend a year playing a version of Myst from start to finish. Think about app sales if you can make your interface addictive. Suggestion: Populate your topic map authoring interface with trolls (accounting), smiths (manufacturing), cavalry (shipping), royalty (managment), wizards (IT), etc. and make collection of information about their information into tokens, spells, etc. Sprinkle in user preference activities and companions. That would be a lot of work but I suspect you would get volunteers to create new levels as your information resources evolve. Cloudera Live (beta) April 17th, 2014 Cloudera Live (beta) From the webpage: Try a live demo of Hadoop, right now. Cloudera Live is a new way to get started with Apache Hadoop, online. No downloads, no installations, no waiting. Watch tutorial videos and work with real-world examples of the complete Hadoop stack included with CDH, Cloudera’s completely open source Hadoop platform, to: • Learn Hue, the Hadoop User Interface developed by Cloudera • Query data using popular projects like Apache Hive, Apache Pig, Impala, Apache Solr, and Apache Spark (new!) • Develop workflows using Apache Oozie Great news for people interested in Hadoop! Question: Will this become the default delivery model for test driving software and training? Enjoy! A Gentle Introduction to Scikit-Learn… April 17th, 2014 A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library by Jason Brownlee. From the post: If you are a Python programmer or you are looking for a robust library you can use to bring machine learning into a production system then a library that you will want to seriously consider is scikit-learn. In this post you will get an overview of the scikit-learn library and useful references of where you can learn more. Nothing new if you are already using Scikit-Learn but a very nice introduction with additional resources to pass onto others. Save yourself some time in gathering materials to spread the use of Scikit-Learn. Bookmark and forward today! Reproducible Research/(Mapping?) April 17th, 2014 Implementing Reproducible Research edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. From the webpage: In many of today’s research fields, including biomedicine, computational tools are increasingly being used so that the results can be reproduced. Researchers are now encouraged to incorporate software, data, and code in their academic papers so that others can replicate their research results. Edited by three pioneers in this emerging area, this book is the first one to explore this groundbreaking topic. It presents various computational tools useful in research, including cloud computing, data repositories, virtual machines, R’s Sweave function, XML-based programming, and more. It also discusses legal issues and case studies. There is a growing concern over the ability of scientists to reproduce the published results of other scientists. The Economist rang one of many alarm bells when it published: Trouble at the lab [Data Skepticism]. From the introduction to Reproducible Research: Literate statistical programming is a concept introduced by Rossini () that builds on the idea of literate programming as described by Donald Knuth. With literate statistical programming, one combines the description of a statistical analysis and the code for doing the statistical analysis into a single document. Subsequently, one can take the combined document and produce either a human-readable document (i.e. PDF) or a machine readable code file. An early implementation of this concept was the Sweave system of Leisch which uses R as its programming language and LATEX as its documentation language (). Yihui Xie describes his knitr package which builds substantially on Sweave and incorporates many new ideas developed since the initial development of Sweave. Along these lines, Tanu Malik and colleagues describe the Science Object Linking and Embedding framework for creating interactive publications that allow authors to embed various aspects of computational research in document, creating a complete research compendium. Tools Of course, we all cringe when we read that a drug company can reproduce only 1/4 of 67 “seminal” studies. What has me curious is why we don’t have the same reaction when enterprise IT systems require episodic remapping, which requires the mappers to relearn what was known at the time of the last remapping? We all know that enterprise (and other) IT systems change and evolve, but practically speaking, no effort is make to capture the knowledge that would reduce the time, cost and expense of every future remapping. We can see the expense and danger of science not being reproducible, but when our own enterprise data mappings are not reproducible, that’s just the way things are. Take inspiration from the movement towards reproducible science and work towards reproducible semantic mappings. I first saw this in a tweet by Victoria Stodden. Expert vs. Volunteer Semantics April 17th, 2014 The variability of crater identification among expert and community crater analysts by Stuart J. Robbins, et al. Abstract: The identification of impact craters on planetary surfaces provides important information about their geological history. Most studies have relied on individual analysts who map and identify craters and interpret crater statistics. However, little work has been done to determine how the counts vary as a function of technique, terrain, or between researchers. Furthermore, several novel internet-based projects ask volunteers with little to no training to identify craters, and it was unclear how their results compare against the typical professional researcher. To better understand the variation among experts and to compare with volunteers, eight professional researchers have identified impact features in two separate regions of the moon. Small craters (diameters ranging from 10 m to 500 m) were measured on a lunar mare region and larger craters (100s m to a few km in diameter) were measured on both lunar highlands and maria. Volunteer data were collected for the small craters on the mare. Our comparison shows that the level of agreement among experts depends on crater diameter, number of craters per diameter bin, and terrain type, with differences of up to ∼±45. We also found artifacts near the minimum crater diameter that was studied. These results indicate that caution must be used in most cases when interpreting small variations in crater size-frequency distributions and for craters ≤10 pixels across. Because of the natural variability found, projects that emphasize many people identifying craters on the same area and using a consensus result are likely to yield the most consistent and robust information. The identification of craters on the Moon may seem far removed from your topic map authoring concerns but I would suggest otherwise. True the paper is domain specific in some of it concerns (crater age, degradation, etc.) but the most important question was whether volunteers in aggregate could be as useful as experts in the identification of craters? The author conclude: Except near the minimum diameter, volunteers are able to identify craters just as well as the experts (on average) when using the same interface (the Moon Mappers interface), resulting in not only a similar number of craters, but also a similar size distribution. (page 34) I find that suggestive for mapping semantics because unlike moon craters, what words mean (and implicitly why) are a daily concern for users, including ones in your enterprise. You can, of course, employ experts to re-interpret what they have been told by some of your users into the expert’s language and produce semantic integration based on the expert’s understanding or mis-understanding of your domain. Or, you can use your own staff, with experts to facilitate encoding their understanding of your enterprise semantics, as in a topic map. Recalling that the semantics for your enterprise aren’t “out there” in the ether but residing within the staff that make up your enterprise. I still see an important role for experts but it isn’t as the source of your semantics, rather at the hunters who assist in capturing your semantics. I first saw this in a tweet by astrobites that lead me to: Crowd-Sourcing Crater Identification by Brett Deaton. GraphLab Create: Upgrade April 17th, 2014 GraphLab Create: Upgrade From the webpage: The latest version of graphlab-create is 0.2 beta. See what’s new for information about new features and the release notes for detailed additions and compatibility changes. From the what’s new page: GraphLab Data Structures: SFrame (Scalable tabular data structure): Graph: Machine Learning Toolkits: Recommender functionality: General Machine Learning: Cloud: • Support for all AWS regions • Secured client server communication using strong, standards-based encryption • CIDR rule specification for Amazon EC2 instance identification For detailed information about additional features and compatibility changes, see the release notes. For known issues and feature requests visit the forum! Cool! Be sure to pass this news along! Future Values of Merges April 17th, 2014 Multilisp: A Language for Concurrent Symbolic Computation by Robert H. Halstead. (ACM Transactions on Programming Languages and Systems, Vol. 7, No. 4, October 1985, Pages 501-538.) Abstract: Multilisp is a version of the Lisp dialect Scheme extended with constructs for parallel execution. Like Scheme, Multilisp is oriented toward symbolic computation. Unlike some parallel programming languages, Multilisp incorporates constructs for causing side effects and for explicitly introducing parallelism. The potential complexity of dealing with side effects in a parallel context is mitigated by the nature of the parallelism constructs and by support for abstract data types: a recommended Multilisp programming style is presented which, if followed, should lead to highly parallel, easily understandable programs. Multilisp is being implemented on the 32-processor Concert multiprocessor; however, it is ulti-mately intended for use on larger multiprocessors. The current implementation, called Concert Multilisp, is complete enough to run the Multilisp compiler itself and has been run on Concert prototypes including up to eight processors. Concert Multilisp uses novel techniques for task scheduling and garbage collection. The task scheduler helps control excessive resource utilization by means of an unfair scheduling policy; the garbage collector uses a multiprocessor algorithm based on the incremental garbage collector of Baker. Of particular interest: Multilisp’s principal construct for both creating tasks and synchronizing among them is the future. The construct ( future X ) immediately returns a future for the value of the expression X and concurrently begins evaluating X. When the evaluation of X yields a value, that value replaces the future. The future is said to be initially undetermined; it becomes determined when its value has been computed. An operation (such as addition) that needs to know the value of an undetermined future will be suspended until the future becomes determined, but many operations, such as assignment and parameter passing, do not need to know anything about the values of their operands and may be performed quite comfortably on undetermined futures. The use of futures often exposes surprisingly large amounts of parallelism in a program, as illustrated by a Quicksort program given in Figure 1. The use of a future value for merge operations, which could avoid re-processing a topic map for each cycle of merging, sounds promising. Deferral of the results isn’t just an old Lisp idea as you will find in: Counting complex disordered states by efficient pattern matching: chromatic polynomials and Potts partition functions by Marc Timme, Frank van Bussel, Denny Fliegner, and Sebastian Stolzenberg. Abstract: Counting problems, determining the number of possible states of a large system under certain constraints, play an important role in many areas of science. They naturally arise for complex disordered systems in physics and chemistry, in mathematical graph theory, and in computer science. Counting problems, however, are among the hardest problems to access computationally. Here, we suggest a novel method to access a benchmark counting problem, finding chromatic polynomials of graphs. We develop a vertex-oriented symbolic pattern matching algorithm that exploits the equivalence between the chromatic polynomial and the zero-temperature partition function of the Potts antiferromagnet on the same graph. Implementing this bottom-up algorithm using appropriate computer algebra, the new method outperforms standard top-down methods by several orders of magnitude, already for moderately sized graphs. As a first application, we compute chromatic polynomials of samples of the simple cubic lattice, for the first time computationally accessing three-dimensional lattices of physical relevance. The method offers straightforward generalizations to several other counting problems. In lay person’s terms the work by Timme and company visits each node in a graph and records an expression that includes unkowns (futures?), that is the values at other nodes in the graph. Using pattern matching techniques, the algorithm then solves all of the unknowns and replaces them with appropriate values. How effective is this? “The existing algorithm copies the whole network for each stage of the calculation and only changes one aspect of it each time,” explains Frank van Bussel of the Max Planck Institute for Dynamics and Self-Organization (MPIDS). Increasing the number of nodes dramatically increases the calculation time. For a square lattice the size of a chess board, this is estimated to be many billions of years. The new algorithm developed by the Göttingen-based scientists is significantly faster. “Our calculation for the chess board lattice only takes seven seconds,” explains Denny Fliegner from MPIDS. (A New Kind of Counting) Hmmm, “many billions of years” versus “seven seconds.” For further details on the pattern matching work see the project page at: Complex Disordered Systems: Statistical Physics and Symbolic Computation Deferral of results looks like a fruitful area for research for topic maps in general and parallel processing of topic maps in particular. I first saw the Multilisp paper in a tweet by Mom of Finland. Hitchhiker’s Guide to Clojure April 16th, 2014 Hitchhiker’s Guide to Clojure From the webpage: The following is a cautionary example of the unpredictable combination of Clojure, a marathon viewing of the BBC’s series “The Hitchhiker’s Guide to the Galaxy”, and a questionable amount of cheese. There have been many tourism guides to the Clojure programming language. Some that easily come to mind for their intellectual erudition and prose are “The Joy of Touring Clojure”, “Touring Clojure”, “Clojure Touring”, and the newest edition of “Touring Clojure Touring”. However, none has surpassed the wild popularity of “The Hitchhiker’s Guide to Clojure”. It has sold over 500 million copies and has been on the “BigInt’s Board of Programming Language Tourism” for the past 15 years. While, arguably, it lacked the in-depth coverage of the other guides, it made up for it in useful practical tips, such as what to do if you find a nil in your pistachio. Most of all, the cover had the following words printed in very large letters: Don’t Worry About the Parens. To tell the story of the book, it is best to tell the story of two people whose lives were affected by it: Amy Denn, one of the last remaining Pascal developers in Cincinnati, and Frank Pecan, a time traveler, guidebook researcher, and friend of Amy. There isn’t any rule (that I’m aware of) that says computer texts must be written to be unfunny. I think my only complaint is that the story is too short. Enjoy! Top secret MI5 files of First World War go online April 16th, 2014 Top secret MI5 files of First World War go online Well, that depends on your definition of “online.” I was hopeful this collection would demonstrate the lack of need for long term secrecy for “top secret” files. That files are kept “secret” more to guard the interest of security staff than any legitimate reason. Then I selected the record for: ‘Mata Hari’ alias MCCLEOD Margaretha Geertruida (Marguerite Gertrude): executed by the… I found that I can view it for free, at The National Archives. I’m not sure that is “online” in any important sense of the word. You? WordNet RDF April 16th, 2014 WordNet RDF From the webpage: WordNet is supported by the National Science Foundation under Grant Number 0855157. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the creators of WordNet and do not necessarily reflect the views of the National Science Foundation. This is the RDF version of WordNet, created by mapping the existing WordNet data into RDF. The data is structured according to the lemon model. In addition, links have been added from the following sources: These links increase the usefulness of the WordNet data. If you would like to contribute extra linking to WordNet please Contact us. Curious if you find it easier to integrate WordNet RDF with other data or the more traditional WordNet? I first saw this in a tweet by Bob DuCharme. NLTK-like Wordnet Interface in Scala April 16th, 2014 NLTK-like Wordnet Interface in Scala by Sujit Pal. From the post: I recently figured out how to setup the Java WordNet Library (JWNL) for something I needed to do at work. Prior to this, I have been largely unsuccessful at figuring out how to access Wordnet from Java, unless you count my one attempt to use the Java Wordnet Interface (JWI) described here. I think there are two main reason for this. First, I just didn’t try hard enough, since I could get by before this without having to hook up Wordnet from Java. The second reason was the over-supply of libraries (JWNL, JWI, RiTa, JAWS, WS4j, etc), each of which annoyingly stops short of being full-featured in one or more significant ways. The one Wordnet interface that I know that doesn’t suffer from missing features comes with the Natural Language ToolKit (NLTK) library (written in Python). I have used it in the past to access Wordnet for data pre-processing tasks. In this particular case, I needed to call it at runtime from within a Java application, so I finally bit the bullet and chose a library to integrate into my application – I chose JWNL based on seeing it being mentioned in the Taming Text book (and used in the code samples). I also used code snippets from Daniel Shiffman’s Wordnet page to learn about the JWNL API. After I had successfully integrated JWNL, I figured it would be cool (and useful) if I could build an interface (in Scala) that looked like the NLTK Wordnet interface. Plus, this would also teach me how to use JWNL beyond the basic stuff I needed for my webapp. My list of functions were driven by the examples from the Wordnet section (2.5) from the NLTK book and the examples from the NLTK Wordnet Howto. My Scala class implements most of the functions mentioned on these two pages. The following session will give you an idea of the coverage – even though it looks a Python interactive session, it was generated by my JUnit test. I do render the Synset and Word (Lemma) objects using custom format() methods to preserve the illusion (and to make the output readable), but if you look carefully, you will notice the rendering of List() is Scala’s and not Python’s. NLTK is amazing in its own right and creating a Scala interface will give you an excuse to learn Scala. That’s a win-win situation! Confirmation: Gov. at War with Business April 16th, 2014 The Heartbleed computer security bug is many things: a catastrophic tech failure, an open invitation to criminal hackers and yet another reason to upgrade our passwords on dozens of websites. But more than anything else, Heartbleed reveals our neglect of Internet security. The United States spends more than$50 billion a year on spying and intelligence, while the folks who build important defense software — in this case a program called OpenSSL that ensures that your connection to a website is encrypted — are four core programmers, only one of whom calls it a full-time job.

In a typical year, the foundation that supports OpenSSL receives just $2,000 in donations. The programmers have to rely on consulting gigs to pay for their work. “There should be at least a half dozen full time OpenSSL team members, not just one, able to concentrate on the care and feeding of OpenSSL without having to hustle commercial work,” says Steve Marquess, who raises money for the project. Is it any wonder that this Heartbleed bug slipped through the cracks? Dan Kaminsky, a security researcher who saved the Internet from a similarly fundamental flaw back in 2008, says that Heartbleed shows that it’s time to get “serious about figuring out what software has become Critical Infrastructure to the global economy, and dedicating genuine resources to supporting that code.” I said last week in NSA … *ucked Up …TCP/IP that the NSA was at war with the U.S. computer industry and ecommerce in general. Turns out I was overly optimistic, the entire United States government has an undeclared war against the U.S. computer industry and ecommerce in general. If I were one of the$million high-tech donors to any political party, I would put everyone on notice that the money tap is off.

At least until there is verifiable progress towards building a robust architecture for ecommerce and an end to attacks on the U.S. computer industry, either actively or by neglect by U.S. agencies.

Only the sound of money will get the attention of the 530+ members of the U.S. Congress. When the money goes silent, you will have their full and undivided attention.

Regular expressions unleashed

April 16th, 2014

Regular expressions unleashed by Hans-Juergen Schoenig.

From the post:

When cleaning up some old paperwork this weekend I stumbled over a very old tutorial. In fact, I have received this little handout during a UNIX course I attended voluntarily during my first year at university. It seems that those two days have really changed my life – the price tag: 100 Austrian Schillings which translates to something like 7 Euros in today’s money.

When looking at this old thing I noticed a nice example showing how to test regular expression support in grep. Over the years I had almost forgotten this little test. Here is the idea: There is no single way to print the name of Libya’s former dictator. According to this example there are around 30 ways to do it:…

Thirty (30) sounds a bit low to me but it’s sufficient to point out that mining all thirty (30) is going to give you a number of false positives, when searching for news on the former dictator of Libya.

The regex to capture all thirty (30) variant forms in a PostgreSQL database is great but once you have it, now what?

Particularly if you have sorted out the dictator from the non-dictators and/or placed them in other categories.

Do you pass that sorting and classifying onto the next user or do you flush the knowledge toilet and all that hard work just drains away?

Learn regex the hard way

April 16th, 2014

Learn regex the hard way by Zed A. Shaw.

From the preface:

This is a rough in-progress dump of the book. The grammar will probably be bad, there will be sections missing, but you get to watch me write the book and see how I do things.

Finally, don’t forget that I have href{http://learnpythonthehardway.org}{Learn Python The Hard Way, 2nd Edition} which you should read if you can’t code yet.

Exercises 1 – 16 have some content (out of 27) so it is incomplete but still a goodly amount of material.

Zed has other “hard way” titles on:

Regexes are useful all contexts so you won’t regret learning or brushing up on them.

…Generalized Language Models…

April 16th, 2014

René reports on the core of his dissertation work.

From the post:

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.

So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.

Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Unlike some white papers, webinars and demos, you don’t have to register, list your email and phone number, etc. to see both the test data and code that implements René’s ideas.

Please send René useful feedback as a way to say thank you for sharing both data and code.

‘immersive intelligence’ [Topic Map-like application]

April 16th, 2014

Long: NGA is moving toward ‘immersive intelligence’ by Sean Lyngaas.

From the post:

Of the 17 U.S. intelligence agencies, the National Geospatial-Intelligence Agency is best suited to turn big data into actionable intelligence, NGA Director Letitia Long said. She told FCW in an April 14 interview that mapping is what her 14,500-person agency does, and every iota of intelligence can be attributed to some physical point on Earth.

“We really are the driver for intelligence integration because everything is somewhere on the Earth at a point in time,” Long said. “So we give that ability for all of us who are describing objects to anchor it to the Map of the World.”

NGA’s Map of the World entails much more minute information than the simple cartography the phrase might suggest. It is a mix of information from top-secret, classified and unclassified networks made available to U.S. government agencies, some of their international partners, commercial users and academic experts. The Map of the World can tap into a vast trove of satellite and social media data, among other sources.

NGA has made steady progress in developing the map, Long said. Nine data layers are online and available now, including those for maritime and aeronautical data. A topography layer will be added in the next two weeks, and two more layers will round out the first operational version of the map in August.

Not surprisingly, the National Geospatial-Intelligence Agency sees geography as the organizing principal for intelligence integration. Or as as NGA Director Long says: “…everything is somewhere on the Earth at a point in time.” I can’t argue with the accuracy of that statement, save for extraterrestrial events, satellites, space-based weapons, etc.

On the other hand, you could gather intelligence by point of origin, places referenced, people mentioned (their usual locations), etc., in languages spoken by more than thirty (30) million people and you could have a sack with intelligence in forty (40) languages. List of languages by number of native speakers

When I say “topic map-like” application, I mean that the NGA has chosen geographic locations as the organizing principle for intelligence as opposed to using subjects as the organizing principle for intelligence, of which geographic location is only one type. Noting that with a broader organizing principle, it would be easier to integrate data from other agencies who have their own organizational principles for the intelligence they gather.

I like the idea of “layers” as described in the post. In part because a topic map can exist as an additional layer on top of the current NGA layers to integrate other intelligence data on a subject basis with the geographic location system of the NGA.

Think of topic maps as being “in addition to” and not “instead of” your current integration technology.

What’s your principle for organizing intelligence? Would it be useful to integrate data organized around other principles for organizing intelligence? And still find the way back to the original data?

PS: Do you remember the management book “Who Moved My Cheese?” Moving intelligence from one system to another can result in: “Who Moved My Intelligence?,” when it can no longer be discovered by its originator. Not to mention the intelligence will lack the context of its point of origin.

Titan: Scalable Graph Database

April 15th, 2014

Titan: Scalable Graph Database by Matthias Broecheler.

Conference presentation so long on imagery but short on detail.

However, useful to walk your manager through as a pitch for support to investigate further.

When that support is given, check out: http://thinkaurelius.github.io/titan/. Links to source code, other resources, etc.

Enter, Update, Exit… [D3.js]

April 15th, 2014

From the webpage:

Over the past couple of years, D3, the groundbreaking JavaScript library for data-driven document manipulation developed by Mike Bostock, has become the Swiss Army knife of web-based data visualization. However, talking to other designers or developers who use D3 in their projects, I noticed that one of the core concepts of it remains somewhat obscure and is often referred to as »D3’s magic«: Data joins and selections.

Given a solid command of basic JavaScript, this article should help you to wrap your head around these two fundamental concepts and get you started using D3 for your dataviz projects.

If you encounter anyone not already using D3.js, pass this page along to them.

I first saw this in a tweet by Halftone.

GraphChi-DB [src released]

April 15th, 2014

GraphChi-DB

From the webpage:

GraphChi-DB is a scalable, embedded, single-computer online graph database that can also execute similar large-scale graph computation as GraphChi. it has been developed by Aapo Kyrola as part of his Ph.D. thesis.

GraphChi-DB is written in Scala, with some Java code. Generally, you need to know Scala quite well to be able to use it.

IMPORTANT: GraphChi-DB is early release, research code. It is buggy, it has awful API, and it is provided with no guarantees. DO NOT USE IT FOR ANYTHING IMPORTANT.

GraphChi-DB source code arrives!

Enjoy!