Draft Lucene 5.0 Release Highlights

January 23rd, 2015

Draft Lucene 5.0 Release Highlights

Just a draft of Lucene 5.0 release notes but it is a signal that the release is getting closer!

Or as the guy said in Star Wars, “…almost there!” Hopefully with happier results.

Update: My bad, I forgot to include the Solr 5.0 draft release notes as well!


DiRT Digital Research Tools

January 23rd, 2015

DiRT Digital Research Tools

From the post:

The DiRT Directory is a registry of digital research tools for scholarly use. DiRT makes it easy for digital humanists and others conducting digital research to find and compare resources ranging from content management systems to music OCR, statistical analysis packages to mindmapping software.

Interesting concept but the annotations are too brief to convey much information. Not to mention that within a category, say Conduct linguistic research or Transcribe handwritten or spoken texts, the entries have no apparent order, or should I say they are not arranged in alphabetical order by name. There may be some other order that is escaping me.

Some entries appear in the wrong categories, such as Xalan being found under Transcribe handwritten or spoken texts:

Xalan is an XSLT processor for transforming XML documents into HTML, text, or other XML document types. It implements XSL Transformations (XSLT) Version 1.0 and XML Path Language (XPath) Version 1.0.

Not what I think of when I think about transcribing handwritten or spoken texts. You?

I didn’t see a process for submitting corrections/comments on resources. I will check and post on this again. It could be a useful tool.

I first saw this in a tweet by Christophe Lalanne.

Digital Cartography [84]

January 22nd, 2015

Digital Cartography [84] by Visual Loop.

From the post:

Welcome to the year’s first edition of Digital Cartography, our weekly column where we feature the most recent interactive maps that came to our way. And being this the first issue of 2015, of course that it’s fully packed with more than 40 new interactive maps and cartographic-based narratives.

That means that you’ll need quite a bit of time to spend exploring these examples, but if that isn’t enough, there’s always the list with our 100 favorite interactive maps of 2014 (part one and two), guaranteed to keep you occupied for the next day or so.

…[M]ore than 40 new interactive maps and cartographic-based narratives.

How very cool!

With a couple of notable exceptions (see the article) mostly geography based mappings. There’s nothing wrong with geography based mappings but it makes me curious why there isn’t more diversity in mapping?

Just as a preliminary thought, could it be that geography gives us a common starting point for making ourselves understood? Rather than undertaking a burden of persuasion before we can induce someone to use the map?

From what little I have heard (intentionally) about #Gamergate, I would say a mapping of the people, attitudes, expressions of same and the various forums would vary significantly from person to person. If you did a non-geographic mapping of that event(?) (sorry, I don’t have more precise language to use), what would it look like? What major attitudes, factors, positions would you use to lay out the territory?

Personally I don’t find the lack of a common starting point all that troubling. If a map is extensive enough, it will surely intersect some areas of interest and a reader can start to work outwards from that intersection. They may or may not agree with what they find but it would have the advantage of not being snippet sized texts divorced from some over arching context.

A difficult mapping problem to be sure, one that poses far more difficulties than one that uses physical geography as a starting point. Would even an imperfect map be of use to those trying to sort though issues in such a case?

Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka

January 22nd, 2015

Webinar: Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka by Helena Edelson.

From the post:

On Tuesday, January 13 I gave a webinar on Apache Spark, Spark Streaming and Cassandra. Over 1700 registrants from around the world signed up. This is a follow-up post to that webinar, answering everyone’s questions. In the talk I introduced Spark, Spark Streaming and Cassandra with Kafka and Akka and discussed wh​​​​y these particular technologies are a great fit for lambda architecture due to some key features and strategies they all have in common, and their elegant integration together. We walked through an introduction to implementing each, then showed how to integrate them into one clean streaming data platform for real-time delivery of meaning at high velocity. All this in a highly distributed, asynchronous, parallel, fault-tolerant system.

Video | Slides | Code | Diagram

About The Presenter: Helena Edelson is a committer on several open source projects including the Spark Cassandra Connector, Akka and previously Spring Integration and Spring AMQP. She is a Senior Software Engineer on the Analytics team at DataStax, a Scala and Big Data conference speaker, and has presented at various Scala, Spark and Machine Learning Meetups.

I have long contended that it is possible to have a webinar that has little if any marketing fluff and maximum technical content. Helena’s presentation is an example of that type of webinar.

Very much worth the time to watch.

BTW, being so content full, questions were answered as part of this blog post. Technical webinars just don’t get any better organized than this one.

Perhaps technical webinars should be marked with TW and others with CW (for c-suite webinars). To prevent disorientation in the first case and disappointment in the second one.

Supremes “bitch slaps” Patent Court

January 22nd, 2015

Supreme Court strips more power from controversial patent court by Jeff John Roberts.

From the post:

The Supreme Court issued a ruling Tuesday that will have a significant impact on the patent system by limiting the ability of the Federal Circuit, a specialized court that hears patent appeals, to review key findings by lower court judges.

The 7-2 patent decision, which came the same day as a high profile ruling by the Supreme Court on prisoner beards, concerns an esoteric dispute between two pharmaceutical companies, Teva and Sandoz, over the right way to describe the molecule weight of a multiple sclerosis drug.

The Justices of the Supreme Court, however, appears to have taken the case in part because it presented another opportunity to check the power of the Federal Circuit, which has been subject to a recent series of 9-0 reversals and which some regard as a “rogue court” responsible for distorting the U.S. patent system.

As for the legal decision on Tuesday, it turned on the question of whether the Federal Circuit judges can review patent claim findings as they please (“de novo”) or only in cases where they has been serious error. Writing for the majority, Justice Stephen Breyer concluded that the Federal Circuit could not second guess how lower courts interpret those claims (a process called “claim construction”) except on rare occasions.

There is no doubt the Federal Circuit has done its share of damage to the patent system but it hasn’t acted alone. Congress and the patent system itself bear a proportionate share of the blame.

Better search and retrieval technology can’t clean out the mire in the USPTO stables. That is going to require reform from Congress and a sustained effort at maintaining the system once it has been reformed.

In the meantime, knowing that another blow has been dealt the Federal Circuit on patent issues will have to sustain reform efforts.

The Leek group guide to genomics papers

January 22nd, 2015

The Leek group guide to genomics papers by Jeff Leek.

From the webpage:

When I was a student, my advisor John Storey made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.

  • It got me caught up on the field of computational genomics
  • It was expertly curated, so it filtered a lot of papers I didn’t need to read
  • It gave me my first set of ideas to try to pursue as I was reading the papers

I have often thought I should make a similar list for folks who may want to work wtih me (or who want to learn about statistical genomics). So this is my attempt at that list. I’ve tried to separate the papers into categories and I’ve probably missed important papers. I’m happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.

(reading list follows)

A very clever idea!

The value of such a list, when compared to the World Wide Web is that it is “curated.” Someone who knows the field has chosen and hopefully chosen well, from all the possible resources you could consult. By attending to those resources and not the page rank randomness of search results, you should get a more rounded view of a particular area.

I find such lists from time to time but they are often not maintained. Which seriously diminishes their value.

Perhaps the value-add proposition is shifting from making more data (read data, publications, discussion forums) available to filtering the sea of data into useful sized chunks. The user can always seek out more, but is enabled to start with a manageable and useful portion at first.

Hmmm, think of it as a navigational map, which lists longitude/latitude and major features. A that as you draw closer to any feature or upon request, can change its “resolution” to disclose more information about your present and impeding location.

For what area would you want to build such a navigational map?

I first saw this in a tweet by Christophe Lalanne

Lecture Slides for Coursera’s Data Analysis Class

January 22nd, 2015

Lecture Slides for Coursera’s Data Analysis Class by Jeff Leek.

From the webpage:

This repository contains the lecture slides for the Coursera course Data Analysis. The slides were created with the Slidify package in Rstudio.

From the course description:

You have probably heard that this is the era of “Big Data”. Stories about companies or scientists using data to recommend movies, discover who is pregnant based on credit card receipts, or confirm the existence of the Higgs Boson regularly appear in Forbes, the Economist, the Wall Street Journal, and The New York Times. But how does one turn data into this type of insight? The answer is data analysis and applied statistics. Data analysis is the process of finding the right data to answer your question, understanding the processes underlying the data, discovering the important patterns in the data, and then communicating your results to have the biggest possible impact. There is a critical shortage of people with these skills in the workforce, which is why Hal Varian (Chief Economist at Google) says that being a statistician will be the sexy job for the next 10 years.

This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then we will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in your analysis. You will also have the opportunity to critique and assist your fellow classmates with their data analyses.

Once you master the basics of data analysis with R (or some other language), the best way to hone your data analysis skills is to look for data sets that are new to you. Don’t go so far afield that you can’t judge a useful result from a non-useful one but going to the edges of your comfort zone is good practice as well.


I first saw this in a tweet by Christophe Lalanne.

USI’s (Unidentified Security Incidents) – Security Through Obscurity

January 22nd, 2015

I was reading Cybersecurity Expert Warns Not Enough Being Done to Prevent Highly Destructive Cyberattacks on Critical Infrastructure wherein Steve Mustard, an industrial cybersecurity subject-matter expert of the International Society of Automation (ISA), etc., is reported to be sounding the horn for more industrial security.

From the post:

Mustard points to the steady flow of cyberattacks on industrial automation control systems (IACS) and supervisory control and data acquisition (SCADA) networks being tracked by the Repository of Industrial Security Incidents (RISI).

“There have been many incidents in the past 10 to 15 years that can be traced back to insufficient cybersecurity measures,” he says. “There are many every year, most of which escape public notice. In fact, it’s widely believed that there are many more that are never reported,” he discloses. “The RISI analysis shows time and again that these incidents are generally the result of the same basic cybersecurity control failures. It is often only the presence of external failsafe and protection mechanisms that these incidents do not lead to more catastrophic consequences. Many use these protection mechanisms to argue that the concern over the consequences of cyberattack is exaggerated, and yet incidents such as Deepwater Horizon should teach us that these protection mechanisms can and do fail.”

In case you didn’t follow the Deepwater Horizon link, let me give you the snippet from Wikipedia that covers what you need to know:

On 20 April 2010, while drilling at the Macondo Prospect, an explosion on the rig caused by a blowout killed 11 crewmen and ignited a fireball visible from 40 miles (64 km) away.[12] The resulting fire could not be extinguished and, on 22 April 2010, Deepwater Horizon sank, leaving the well gushing at the seabed and causing the largest offshore oil spill in U.S. history.[13] (emphasis added)

Do you see anything in the description of the events on the Deep Horizon that says “cybersecurity?” I’m not an “oil man” as they say in Louisiana but even I know the difference between a blowout (too much pressure from the well) and a cyberattack. Apparently Steve Mustard does not.

But the point of this post is that you can’t form an opinion about the rest of Steve Mustard’s claims. Or at least not at a reasonable cost.


Follow the link to the Repository of Industrial Security Incidents (RISI) and you will find that access to the Repository of Industrial Security Incidents is $995 for three months or $2995 per year.

So long as the “security” industry continues to play status and access games with security data, hackers are going to remain ahead of defenders. What part of that isn’t clear?

Sony scale hacks will become the norm if the computer security industry continues its “security by obscurity” stance.

Project Blue Book Collection (UFO’s)

January 22nd, 2015

Project Blue Book Collection

From the webpage:

This site was created by The Black Vault to house 129,491 pages, comprising of more than 10,000 cases of the Project Blue Book, Project Sign and Project Grudge files declassifed. Project Blue Book (along with Sign and Grudge) was the name that was given to the official investigation by the United States military to determine what the Unidentified Flying Object (UFO) phenomena was. It lasted from 1947 – 1969. Below you will find the case files compiled for research, and available free to download.

The CNN report Air Force UFO files land on Internet by Emanuella Grinberg reports Roswell is omitted from these files.

You won’t find anything new here, the files have been available on microfilm for years but being searchable and on the Internet is a step forward in terms of accessibility.

When I say “searchable,” the site notes:

1) A search is a good start — but is not 100% – There are more than 10,000 .pdf files here and although all of them are indexed in the search engine, the quality of the original documents, given the fact that many of them are more than 6 decades old, is very poor. This means that when they are converted to text for searching, many of the words are not readable to a computer. As a tip: make your search as basic as possible. Searching for a location? Just search a city, then the state, to see what comes up. Searching for a type of UFO? Use “saucer” vs. “flying saucer” or longer expression. It will increase the chances of finding what you are looking for.

2) The text may look garbled on the search results page (but not the .pdf!) – This is normal. For the same reason above… converting a sentence that may read ok to the human eye, may be gibberish to a computer due to the quality of the decades old state of many of the records. Don’t let that discourage you. Load the .PDF and see what you find. If you searched for “Hollywood” and a .pdf hit came up for Rome, New York, there is a reason why. The word “Hollywood” does appear in the file…so check it out!

3) Not everything was converted to .pdfs – There are a few case files in the Blue Book system that were simply too large to convert. They are:

undated/xxxx-xx-9667997-[BLANK][ 8,198 Pages ]
undated/xxxx-xx-9669100-[ILLEGIBLE]-[ILLEGIBLE]-/ [ 1,450 Pages ]
undated/xxxx-xx-9669191-[ILLEGIBLE]/ [ 3,710 Pages ]

These files will be sorted at a later date. If you are interested in helping, please email contact@theblackvault.com

I tried to access the files not yet processed but was redirected. I will see what is required to see the not yet processed files.

If you are interested in trying your skills at PDF conversion/improvement, the main data set should be more than sufficient.

If you are interested in automatic discovery of what or who was blacked out of government reports, this is also an interesting data set. Personally I think blacking out passages should be forbidden. People should have to accept the consequences of their actions, good or bad. We require that of citizens, why not government staff?

I assume crowd sourcing corrections has already been considered. 130K of pages is a fairly small number when it comes to crowd sourcing. Surely there are more than 10,000 people interested in the data set, which would be 13 pages each. Assuming each one did 100 pages each, you would have more than enough overlap to do statistics to choose the best corrections.

For those of you who see patterns in UFO reports, a good way to reach across the myriad sightings and reports would be to topic map the entire collection.

Personally I suspect at least some of the reports do concern alien surveillance and the absence in the intervening years indicates they have lost interest. Given our performance since the 1940’s, that’s not hard to understand.

Balisage: The Markup Conference 2015

January 21st, 2015

Balisage: The Markup Conference 2015 – There is Nothing As Practical As A Good Theory

Key dates:
– 27 March 2015 — Peer review applications due
– 17 April 2015 — Paper submissions due
– 17 April 2015 — Applications for student support awards due
– 22 May 2015 — Speakers notified
– 17 July 2015 — Final papers due
– 10 August 2015 — Symposium on Cultural Heritage Markup
– 11–14 August 2015 — Balisage: The Markup Conference

Bethesda North Marriott Hotel & Conference Center, just outside Washington, DC (I know, no pool with giant head, etc. Do you think if we ask nicely they would put one in? And change the theme of the decorations about every 30 feet in the lobby?)

Balisage is the premier conference on the theory, practice, design, development, and application of markup. We solicit papers on any aspect of markup and its uses; topics include but are not limited to:

  • Cutting-edge applications of XML and related technologies
  • Integration of XML with other technologies (e.g., content management, XSLT, XQuery)
  • Web application development with XML
  • Performance issues in parsing, XML database retrieval, or XSLT processing
  • Development of angle-bracket-free user interfaces for non-technical users
  • Deployment of XML systems for enterprise data
  • Design and implementation of XML vocabularies
  • Case studies of the use of XML for publishing, interchange, or archiving
  • Alternatives to XML
  • Expressive power and application adequacy of XSD, Relax NG, DTDs, Schematron, and other schema languages
  • Detailed Call for Participation: http://balisage.net/Call4Participation.html
    About Balisage: http://balisage.net/
    Instructions for authors: http://balisage.net/authorinstructions.html

    For more information: info@balisage.net or +1 301 315 9631

    I wonder if the local authorities realize the danger in putting that many skilled markup people so close the source of so much content? (Washington) With attendees sparking off against each other, who knows?, could see an accountable and auditable legislative and rule making document flow arise. There may not be enough members of Congress in town to smother it.

    The revolution may not be televised but it will be powered by markup and its advocates. Come join the crowd with the tools to make open data transparent.

    Emacs is My New Window Manager

    January 21st, 2015

    Emacs is My New Window Manager by Howard Abrams.

    From the post:

    Most companies that employ me, hand me a “work laptop” as I enter the building. Of course, I do not install personal software and keep a clear division between my “work like” and my “real life.”

    However, I also don’t like to carry two computers just to jot down personal notes. My remedy is to install a virtualization system and create a “personal” virtual machine. (Building cloud software as my day job means I usually have a few VMs running all the time.)

    Since I want this VM to have minimal impact on my work, I base it on a “Server” version of Ubuntu. however, I like some graphical features, so my most minimal after market installation approach is:

    Your mileage with Emacs is going to vary but this was too impressive to pass it unremarked.

    I first saw this in a tweet by Christophe Lalanne.

    MrGeo (MapReduce Geo)

    January 21st, 2015

    MrGeo (MapReduce Geo)

    From the webpage:

    MrGeo was developed at the National Geospatial-Intelligence Agency (NGA) in collaboration with DigitalGlobe. The government has “unlimited rights” and is releasing this software to increase the impact of government investments by providing developers with the opportunity to take things in new directions. The software use, modification, and distribution rights are stipulated within the Apache 2.0 license.

    MrGeo (MapReduce Geo) is a geospatial toolkit designed to provide raster-based geospatial capabilities that can be performed at scale. MrGeo is built upon the Hadoop ecosystem to leverage the storage and processing of hundreds of commodity computers. Functionally, MrGeo stores large raster datasets as a collection of individual tiles stored in Hadoop to enable large-scale data and analytic services. The co-location of data and analytics offers the advantage of minimizing the movement of data in favor of bringing the computation to the data; a more favorable compute method for Geospatial Big Data. This framework has enabled the servicing of terabyte scale raster databases and performed terrain analytics on databases exceeding hundreds of gigabytes in size.

    The use cases sound interesting:

    Exemplar MrGeo Use Cases:

  • Raster Storage and Provisioning: MrGeo has been used to store, index, tile, and pyramid multi-terabyte scale image databases. Once stored, this data is made available through simple Tiled Map Services (TMS) and or Web Mapping Services (WMS).
  • Large Scale Batch Processing and Serving: MrGeo has been used to pre-compute global 1 ArcSecond (nominally 30 meters) elevation data (300+ GB) into derivative raster products : slope, aspect, relative elevation, terrain shaded relief (collectively terabytes in size)
  • Global Computation of Cost Distance: Given all pub locations in OpenStreetMap, compute 2 hour drive times from each location. The full resolution is 1 ArcSecond (30 meters nominally)
  • I wonder if you started war gaming attacks on well known cities and posting maps on how the attacks could develop if that would be covered under free speech? Assuming your intent was to educate the general populace about areas that are more dangerous than others in case of a major incident.

    I first saw this in a tweet by Marin Dimitrov.

    How to share data with a statistician

    January 21st, 2015

    How to share data with a statistician by Robert M. Horton.

    From the webpage:

    This is a guide for anyone who needs to share data with a statistician. The target audiences I have in mind are:

    • Scientific collaborators who need statisticians to analyze data for them
    • Students or postdocs in scientific disciplines looking for consulting advice
    • Junior statistics students whose job it is to collate/clean data sets

    The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls and sources of delay in the transition from data collection to data analysis. The Leek group works with a large number of collaborators and the number one source of variation in the speed to results is the status of the data when they arrive at the Leek group. Based on my conversations with other statisticians this is true nearly universally.

    My strong feeling is that statisticians should be able to handle the data in whatever state they arrive. It is important to see the raw data, understand the steps in the processing pipeline, and be able to incorporate hidden sources of variability in one’s data analysis. On the other hand, for many data types, the processing steps are well documented and standardized. So the work of converting the data from raw form to directly analyzable form can be performed before calling on a statistician. This can dramatically speed the turnaround time, since the statistician doesn’t have to work through all the pre-processing steps first.

    My favorite part:

    The code book

    For almost any data set, the measurements you calculate will need to be described in more detail than you will sneak into the spreadsheet. The code book contains this information. At minimum it should contain:

    1. Information about the variables (including units!) in the data set not contained in the tidy data
    2. Information about the summary choices you made
    3. Information about the experimental study design you used

    Does a codebook exist for the data that goes into or emerges from your data processing?

    If someone has to ask you what variables mean, it’s not really “open” data is it?

    I first saw this in a tweet by Christophe Lalanne.

    Free Speech: Burning Issue Last Week, This Week Not So Much

    January 21st, 2015

    As evidence of the emptiness of the free speech rallies last week in Paris, I offer the following story:

    France – Don’t Criminalise Children For Speech by John Sargeant.

    John details the arrest of a 16 year old who published a parody of a cover by Charlie Hebdo. Take the time to see it and side by side comparison of the original cover and the parody.

    Ask yourself, could I make a parody of the parody with certain elected officials of the United States in the parody?

    Not a good area for experimentation. My suspicion, without looking in the United States Code, is that you would be arrested and severely punished.

    Just so you know that you can’t parody some people in the United States. Feeling real robust about freedom of speech now?

    I first saw this in a tweet by Alex Brown.

    Be a 4clojure hero with Emacs

    January 21st, 2015

    Be a 4clojure hero with Emacs by Artur Malabarba.

    From the post:

    This year I made it my resolution to learn clojure. After reading through the unexpectedly engaging romance that is Clojure for the Brave and True, it was time to boldly venture through the dungeons of 4clojure. Sword in hand, I install 4clojure.el and start hacking, but I felt the interface could use some improvements.

    It seems only proper to mention after Windows 10 an editor that you can extend without needing a campus full of programmers and a gaggle of marketing folks. Not to mention it is easier to extend as well.

    Artur has two suggestions/extensions that will help propel you along with 4clojure.


    I first saw this in a tweet by Anna Pawlicka.

    The next generation of Windows: Windows 10

    January 21st, 2015

    The next generation of Windows: Windows 10 by Terry Myerson.

    From the post:

    Today I had the honor of sharing new information about Windows 10, the new generation of Windows.

    Our team shared more Windows 10 experiences and how Windows 10 will inspire new scenarios across the broadest range of devices, from big screens to small screens to no screens at all. You can catch the video on-demand presentation here.

    Windows 10 is the first step to an era of more personal computing. This vision framed our work on Windows 10, where we are moving Windows from its heritage of enabling a single device – the PC – to a world that is more mobile, natural and grounded in trust. We believe your experiences should be mobile – not just your devices. Technology should be out of the way and your apps, services and content should move with you across devices, seamlessly and easily. In our connected and transparent world, we know that people care deeply about privacy – and so do we. That’s why everything we do puts you in control – because you are our customer, not our product. We also believe that interacting with technology should be as natural as interacting with people – using voice, pen, gestures and even gaze for the right interaction, in the right way, at the right time. These concepts led our development and you saw them come to life today.

    I had to find a text equivalent to the video. I was looking for specific information I saw mentioned in an email and watching the entire presentation (2+ hours) just wasn’t in the cards.

    I will be watching the comment lists on Windows 10 for the answers to two questions:

    First, will I be able to run Windows 10 within a VM on Ubuntu?

    Second, for “sharing” of annotations to documents, is the “sharing” protocol open so that annotations can be shared by users not using Windows 10?

    Actually I did see some of the video and assuming you have the skills of a graphic artist, you are going to be producing some rocking content with Windows 10. People who struggle to doodle, not so much.

    The devil will be in the details but I can say this is the first version of Windows that has ever made me consider upgrading from Windows XP. Haven’t decided and may have to run it on a separate box (share monitors with Ubuntu) but I can definitely say I am interested.

    TM-Gen: A Topic Map Generator from Text Documents

    January 21st, 2015

    TM-Gen: A Topic Map Generator from Text Documents by Angel L. Garrido, et al.

    From the post:

    The vast amount of text documents stored in digital format is growing at a frantic rhythm each day. Therefore, tools able to find accurate information by searching in natural language information repositories are gaining great interest in recent years. In this context, there are especially interesting tools capable of dealing with large amounts of text information and deriving human-readable summaries. However, one step further is to be able not only to summarize, but to extract the knowledge stored in those texts, and even represent it graphically.

    In this paper we present an architecture to generate automatically a conceptual representation of knowledge stored in a set of text-based documents. For this purpose we have used the topic maps standard and we have developed a method that combines text mining, statistics, linguistic tools, and semantics to obtain a graphical representation of the information contained therein, which can be coded using a knowledge representation language such as RDF or OWL. The procedure is language-independent, fully automatic, self-adjusting, and it does not need manual configuration by the user. Although the validation of a graphic knowledge representation system is very subjective, we have been able to take advantage of an intermediate product of the process to make an experimental
    validation of our proposal.

    Of particular note on the automatic construction of topic maps:

    Addition of associations:

    TM-Gen adds to the topic map the associations between topics found in each sentence. These associations are given by the verbs present in the sentence. TM-Gen performs this task by searching the subject included as topic, and then it adds the verb as its association. Finally, it links its verb complement with the topic and with the association as a new topic.

    Depending on the archive one would expect associations between the authors and articles but also topics within articles, to say nothing of date, the publication, etc. Once established, a user can request a view that consists of more or less detail. If not captured, however, more detail will not be available.

    There is only a general description of TM-Gen but enough to put you on the way to assembling something quite similar.

    TMR: A Semantic Recommender System using Topic Maps on the Items’ Descriptions

    January 21st, 2015

    TMR: A Semantic Recommender System using Topic Maps on the Items’ Descriptions by Angel L. Garrido and Sergio Ilarri.


    Recommendation systems have become increasingly popular these days. Their utility has been proved to filter and to suggest items archived at web sites to the users. Even though recommendation systems have been developed for the past two decades, existing recommenders are still inadequate to achieve their objectives and must be enhanced to generate appealing personalized recommendations e ectively. In this paper we present TMR, a context-independent tool based on topic maps that works with item’s descriptions and reviews to provide suitable recommendations to users. TMR takes advantage of lexical and semantic resources to infer users’ preferences and thus the recommender is not restricted by the syntactic constraints imposed on some existing recommenders. We have verifi ed the correctness of TMR using a popular benchmark dataset.

    One of the more exciting aspects of this paper is the building of topic maps from free texts that are then used in the recommendation process.

    I haven’t seen the generated topic maps (yet) but suspect that editing an existing topic map is far easier than creating one ab initio.

    Command-line tools can be 235x faster than your Hadoop cluster

    January 21st, 2015

    Command-line tools can be 235x faster than your Hadoop cluster by Adam Drake.

    From the post:

    As I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Hayden about using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive, and generally have fun with EMR. Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR. Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec). (emphasis added)

    BTW, Adam was using twice as much data as Tom in his analysis.

    The lesson here is to not be a one-trick pony as a data scientist. Most solutions, Hadoop, Spark, Titan, can solve most problems. However, anyone who merits the moniker “data scientist” should be able to choose the “best” solution for a given set of circumstances. In some cases that maybe simple shell scripts.

    I first saw this in a tweet by Atabey Kaygun.

    Flask and Neo4j

    January 20th, 2015

    Flask and Neo4j – An example blogging application powered by Flask and Neo4j. by Nicole White.

    From the post:

    I recommend that you read through Flask’s quickstart guide before reading this tutorial. The following is drawn from Flask’s tutorial on building a microblog application. This tutorial expands the microblog example to include social features, such as tagging posts and recommending similar users, by using Neo4j instead of SQLite as the backend database.
    (14 parts follow here)

    The fourteen parts take you all the way through deployment on Heroku.

    I don’t think you will abandon your current blogging platform but you will gain insight into Neo4j and Flask. A non-trivial outcome.

    10 Chemistry Blogs You Should Read

    January 20th, 2015

    10 Chemistry Blogs You Should Read by Aaron Oneal.

    If you are looking for reading in chemistry, Aaron has assembled ten very high quality blogs for you to follow. Each is listed with a short description so you can tune the reading to your taste.

    Personally I recommend taking a sip from each one. It is rare that I read a really good blog and don’t find something of interest and many times relevant to other projects that I would not have seen otherwise.

    The PokitDok HealthGraph

    January 20th, 2015

    The PokitDok HealthGraph by Denise Gosnell, PhD and Alec Macrae.

    From the post:


    While the front-end team has been busy putting together version 3 of our healthcare marketplace, the data science team has been hard at work on several things that will soon turn into new products. Today, I’d like to give you a sneak peek at one of these projects, one that we think will profoundly change the way you think about health data. We call it the PokitDok HealthGraph. Let’s ring in the New Year with some data science!

    Everyone’s been talking about Graph Theory, but what is it, exactly?

    And we aren’t talking about bar graphs and pie charts.

    Social networks have brought the world of graph theory to the forefront of conversation. Even though graph theory has been around since Euler solved the infamous Konigsberg bridge problem in the 1700’s, we can thank the current age of social networking for giving graph theory a modern revival.

    At the very least, graph theory is the art of connecting the dots, kind of like those sweet pictures you drew as a kid. A bit more formally, graph theory studies relationships between people, places and/or things. Take any ol’ social network – Facebook, for example, uses a graph database to help people find friends and interests. In graph theory, we represent this type of information with nodes (dots) and edges (lines) where the nodes are people, places and/or things and the lines represent their relationship.

    To make a long story short: healthcare is about you and connecting you with quality care. When data scientists think of connecting things together, graphs are most often the direction we go.

    At PokitDok, we like to look at your healthcare needs as a social network, aka: your personal HealthGraph. The HealthGraph is a network of doctors, other patients, insurance providers, common ailments and all of the potential connections between them.

    Hard to say in advance but it looks like Denise and Alec are close to the sweet spot on graph explanations for lay people. Having subject matter that is important to users helps. And using familiar names for the nodes of the graph works as well.

    Worth following this series of posts to see if they continue along this path.

    Databases of Biological Databases (yes, plural)

    January 20th, 2015

    Mick Watson points out in a tweet today that there are at least two databases of biological databases.


    MetaBase is a user-contributed list of all the biological databases available on the internet. Currently there are 1,802 entries, each describing a different database. The databases are described in a semi-structured way by using templates and entries can cary various user comments and annotations (see a random entry). Entries can be searched, listed or browsed by category.

    The site uses the same MediaWiki technology that powers Wikipedia, probably the best known user-contributed resource on the internet. The Mediawiki system allows users to participate on many different levels, ranging from authors and editors to curators and designers.

    Database description

    MetaBase aims to be a flexible, user-driven (user-created) resource for the biological database community.

    The main focus of MetaBase is summarised below:

    • As a basic requirement, MB contains a list of databases, URLs and descriptions of the most commonly used biological databases currently available on the internet.
    • The system should be flexible, allowing users to contribute, update and maintain the data in different ways.
    • In the future we aim to generate more communication between the database developer and user communities.

    A larger, more ambitious list of aims is given here.

    The first point was acheived using data taken from the Molecular Biology Database Collection. Secondly, MetaBase has been implemented using MediaWiki. The final point will take longer, and is dependent on the community uptake of MB…

    DBD – Database of Biological Databases

    DBD: Database of Biological Database team are R.R. Siva Kiran, MVN Setty, Department of Biotechnology, MS Ramaiah Institute of Technology, MSR Nagar, Bangalore, India and G. Hanumantha Rao, Center for Biotechnology, Department of Chemical Engineering, Andhra University, Visakhapatnam-530003, India. DBD consists of 1200 Database entries covering wide range of databases useful for biological researchers.

    Be aware that the DBD database reports its last update as 30-July-2008. I have written to confirm if that is the correct date.

    Assuming it is, has anyone validated the links in the DBD database and/or compared them to the links in Metabase? That seems like a worthwhile service to the community.

    Spark Summit East Agenda (New York, March 18-19 2015)

    January 20th, 2015

    Spark Summit East Agenda (New York, March 18-19 2015)


    The plenary and track sessions are on day one. Databricks is offering three training courses on day two.

    The track sessions were divided into developer, applications and data science tracks. To assist you in finding your favorite speakers, I have collapsed that listing and sorted it by the first listed speaker’s last name. I certainly hope all of these presentations will be video recorded!

    Take good notes and blog about your favorite sessions! Ping me with a pointer to your post. Thanks!

    I first saw this in a tweet by Helena Edelson.

    Modelling Data in Neo4j: Labels vs. Indexed Properties

    January 20th, 2015

    Modelling Data in Neo4j: Labels vs. Indexed Properties by Christophe Willemsen.

    From the post:

    A common question when planning and designing your Neo4j Graph Database is how to handle "flagged" entities. This could
    include users that are active, blog posts that are published, news articles that have been read, etc.


    In the SQL world, you would typically create a a boolean|tinyint column; in Neo4j, the same can be achieved in the
    following two ways:

    • A flagged indexed property
    • A dedicated label

    Having faced this design dilemma a number of times, we would like to share our experience with the two
    presented possibilities and some Cypher query optimizations that will help you take a full advantage of a the graph database.

    Throughout the blog post, we'll use the following example scenario:

    • We have User nodes
    • User FOLLOWS other users
    • Each user writes multiple blog posts stored as BlogPost nodes
    • Some of the blog posts are drafted, others are published (active)

    This post will help you make the best use of labels in Neo4j.

    Labels are semantically opaque so if your Neo4j database has “German” to label books written in German, you are SOL if you need German for nationality.

    That is a weakness semantically opaque tokens. Having type properties on labels would push the semantic opaqueness to the next level.

    pgcli [Inspiration for command line tool for XPath/XQuery?]

    January 20th, 2015


    From the webpage:

    Pgcli is a command line interface for Postgres with auto-completion and syntax highlighting.

    Postgres folks who don’t know about pgcli will be glad to see this post.

    But, having spent several days with XPath/XQuery/FO 3.1 syntax, I can only imagine the joy in XML circles for a similar utility for use with command line XML tools.

    Properly done, the increase in productivity would be substantial.

    The same applies for your favorite NoSQL query language. (Datomic?)

    Will SQL users be the only ones with such a command line tool?

    I first saw this in a tweet by elishowk.

    Improved Fault-tolerance and Zero Data Loss in Spark Streaming

    January 20th, 2015

    Improved Fault-tolerance and Zero Data Loss in Spark Streaming by Tathagata Das.

    From the post:

    Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications.


    Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Since Spark Streaming is built on Spark, it enjoys the same fault-tolerance for worker nodes. However, the demand of high uptimes of a Spark Streaming application require that the application also has to recover from failures of the driver process, which is the main application process that coordinates all the workers. Making the Spark driver fault-tolerant is tricky because it is an arbitrary user program with arbitrary computation patterns. However, Spark Streaming applications have an inherent structure in the computation — it runs the same Spark computation periodically on every micro-batch of data. This structure allows us to save (aka, checkpoint) the application state periodically to reliable storage and recover the state on driver restarts.

    For sources like files, this driver recovery mechanism was sufficient to ensure zero data loss as all the data was reliably stored in a fault-tolerant file system like HDFS or S3. However, for other sources like Kafka and Flume, some of the received data that was buffered in memory but not yet processed could get lost. This is because of how Spark applications operate in a distributed manner. When the driver process fails, all the executors running in a standalone/yarn/mesos cluster are killed as well, along with any data in their memory. In case of Spark Streaming, all the data received from sources like Kafka and Flume are buffered in the memory of the executors until their processing has completed. This buffered data cannot be recovered even if the driver is restarted. To avoid this data loss, we have introduced write ahead logs in Spark Streaming in the Spark 1.2 release.

    Solid piece on the principles and technical details you will need for zero data loss in Spark Streaming. With suggestions for changes that may be necessary to support zero data loss at no loss in throughput. The latter being a non-trivial consideration.

    Curious, I understand that many systems require zero data loss but do you have examples of systems were some data loss is acceptable? To what extend is data loss acceptable? (Given lost baggage rates, is airline baggage one of those?)

    Modelling Plot: On the “conversional novel”

    January 20th, 2015

    Modelling Plot: On the “conversional novel” by Andrew Piper.

    From the post:

    I am pleased to announce the acceptance of a new piece that will be appearing soon in New Literary History. In it, I explore techniques for identifying narratives of conversion in the modern novel in German, French and English. A great deal of new work has been circulating recently that addresses the question of plot structures within different genres and how we might or might not be able to model these computationally. My hope is that this piece offers a compelling new way of computationally studying different plot types and understanding their meaning within different genres.

    Looking over recent work, in addition to Ben Schmidt’s original post examining plot “arcs” in TV shows using PCA, there have been posts by Ted Underwood and Matthew Jockers looking at novels, as well as a new piece in LLC that tries to identify plot units in fairy tales using the tools of natural language processing (frame nets and identity extraction). In this vein, my work offers an attempt to think about a single plot “type” (narrative conversion) and its role in the development of the novel over the long nineteenth century. How might we develop models that register the novel’s relationship to the narration of profound change, and how might such narratives be indicative of readerly investment? Is there something intrinsic, I have been asking myself, to the way novels ask us to commit to them? If so, does this have something to do with larger linguistic currents within them – not just a single line, passage, or character, or even something like “style” – but the way a greater shift of language over the course of the novel can be generative of affective states such as allegiance, belief or conviction? Can linguistic change, in other words, serve as an efficacious vehicle of readerly devotion?

    While the full paper is available here, I wanted to post a distilled version of what I see as its primary findings. It’s a long essay that not only tries to experiment with the project of modelling plot, but also reflects on the process of model building itself and its place within critical reading practices. In many ways, its a polemic against the unfortunate binariness that surrounds debates in our field right now (distant/close, surface/depth etc.). Instead, I want us to see how computational modelling is in many ways conversional in nature, if by that we understand it as a circular process of gradually approaching some imaginary, yet never attainable centre, one that oscillates between both quantitative and qualitative stances (distant and close practices of reading).

    Andrew writes of “…critical reading practices….” I’m not sure that technology will increase the use of “…critical reading practices…” but it certainly offers the opportunity to “read” texts in different ways.

    I have done this with IT standards but never a novel, attempt reading it from the back forwards, a sentence at a time. At least with authoring you are proofing, it provides a radically different perspective than the more normal front to back. The first thing you notice is that it interrupts your reading/skimming speed so you will catch more errors as well as nuances in the text.

    Before you think that literary analysis is a bit far afield from “practical” application, remember that narratives (think literature) are what drive social policy and decision making.

    Take the current popular “war on terrorism” narrative that is so popular and unquestioned in the United States. Ask anyone inside the beltway in D.C. and they will blather on and on about the need to defend against terrorism. But there is an absolute paucity of terrorists, at least by deed, in the United States. Why does the narrative persist in the absence of any evidence to support it?

    The various Red Scares in U.S. history were similar narratives that have never completely faded. They too had a radical disconnect between the narrative and the “facts on the ground.”

    Piper doesn’t offer answers to those sort of questions but a deeper understanding of narrative, such as is found in novels, may lead to hints with profound policy implications.

    Opportunistic “Information” on Sony Hack

    January 20th, 2015

    Why the US was so sure North Korea hacked Sony: it had a front-row seat by Lisa Vaas.

    From the post:

    We may finally know why the US was so confident about identifying North Korea’s hand in the Sony attack: it turns out the NSA had front-row seats to the cyber carnage, having infiltrated computers and networks of the country’s hackers years ago.

    According to the New York Times, a recently released top-secret document traces the NSA’s infiltration back to 2010, when it piggybacked on South Korean “implants” on North Korea’s networks and “sucked back the data”.

    The NSA didn’t find North Korea all that interesting, but that attitude changed as time went on, in part because the agency managed to intercept and repurpose a 0-day exploit – a “big win,” according to the document.

    Stories like this one make me wonder if anyone follows hyperlinks embedded in posts?

    The document, http://www.spiegel.de/media/media-35679.pdf is composed of war stories, one of which was to answer the question:

    Is there “fifth party” collection?

    “Fourth party collection” refers to passively or actively obtaining data from some other actor’s CNE activity against a target. Has there ever been an instance of NSA obtaining information from Actor One exploiting Actor Two’s CNE activity against a target that NSA, Actor One, and Actor Two all care about?

    The response:

    Yes. There was a project that I was working last year with regard to the South Korean CNE program. While we weren’t super interested in SK (things changed a bit when they started targeting us a bit more), we were interested in North Korea and SK puts a lot of resources against them. At that point, our access to NK was next to nothing but we were able to make some inroads into the SK CNE program. We found a few instances where there were NK officials with SK implants in their boxes, so we got on the exfil points, and sucked back the data. Thats forth party. (TS//SI//REL) However, some of the individuals that SK was targeting were also part of the NK CNE program. So I guess that would be the fifth party collect you were talking about. But once that started happening, we ramped up efforts to target NK ourselves (as you don’t want to rely on an untrusted actor to do your work for you.) But some of the work that was done there was to help us gain access. (TS//SI//REL) I know of another instance (I will be more vague because I believe there are more compartments involved and parts are probably NF) where there was an actor we were going against. We realized another actor was also going against them and having great success because of a 0 day they wrote. We got the 0 day out of passive and were able to re-purpose it. Big win. (TS//SI//REL) But they were all still referred to as a fourth party.

    Origin: The document appears on the Free Snowden site under the title: ‘4th Party Collection': Taking Advantage of Non-Parter Computer Network Exploitation Activity


    There are a couple of claims in Lisa’s account that are easy to dismiss on the basis of the document itself:

    Lisa says:

    The NSA didn’t find North Korea all that interesting, but that attitude changed as time went on, in part because the agency managed to intercept and repurpose a 0-day exploit – a “big win,” according to the document.

    Assuming that SK = South Korea and NK = North Korea, the document reports:

    While we weren’t super interested in SK (things changed a bit when they started targeting us a bit more), we were interested in North Korea and SK puts a lot of resources against them. (Emphasis added)

    I read that to say we weren’t “super interested” in South Korea until South Korea started targeting us more. Does anyone have an English reading of that to a different conclusion?

    Lisa also says that:

    The NSA didn’t find North Korea all that interesting, but that attitude changed as time went on, in part because the agency managed to intercept and repurpose a 0-day exploit – a “big win,” according to the document.

    The war story in question has concluded the South Korea and North Korea account and then says:

    I know of another instance (I will be more vague because I believe there are more compartments involved and parts are probably NF) where there was an actor we were going against. We realized another actor was also going against them and having great success because of a 0 day they wrote. We got the 0 day out of passive and were able to re-purpose it. Big win. (TS//SI//REL) But they were all still referred to as a fourth party. (emphasis added)

    The “I know of another instance” signals to most readers a change in the narrative to start a different account from the one just concluded. In the second instance, only “actor” is used and there is no intimation that North Korea is one of those actors. Could certainly be but there is no apparent connection between the two accounts.

    Moreover there is nothing in the war story to indicate that a permanent monitoring presence was established in any network, capable of the sort of monitoring that Lisa characterizes as having “a front seat.


    The leaking of this document is an attempt to exploit uncertainty about government claims concerning the Sony hack.

    The document does not establish recovery of data from the North Korean network but only “…NK officials with SK implants in their boxes, so we got on the exfil points, and sucked back the data.”

    Moreover, the document establishes that South Korea attempts to conduct CNE operations against the United States and is considered “…an untrusted actor….”

    The zero day exploit may have been against North Korea, anything is possible but this document gives no basis for concluding it was against North Korea.

    Finally, this document does not establish any basis for concluding that the United States had achieved a network monitoring capability on North Korean CNE networks or operations.

    It is bad enough the United States government keeps inventing specious claims about the Sony hack. Let’s not assist it by manufacturing even less likely accounts.

    D-Lib Magazine January/February 2015

    January 19th, 2015

    D-Lib Magazine January/February 2015

    From the table of contents (see the original toc for abstracts):


    2nd International Workshop on Linking and Contextualizing Publications and Datasets by Laurence Lannom, Corporation for National Research Initiatives

    Data as “First-class Citizens” by Łukasz Bolikowski, ICM, University of Warsaw, Poland; Nikos Houssos, National Documentation Centre / National Hellenic Research Foundation, Greece; Paolo Manghi, Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Italy and Jochen Schirrwagen, Bielefeld University Library, Germany


    Semantic Enrichment and Search: A Case Study on Environmental Science Literature by Kalina Bontcheva, University of Sheffield, UK; Johanna Kieniewicz and Stephen Andrews, British Library, UK; Michael Wallis, HR Wallingford, UK

    A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing by Laura Drăgan, Markus Luczak-Rösch, Elena Simperl, Heather Packer and Luc Moreau, University of Southampton, UK; Bettina Berendt, KU Leuven, Belgium

    A Framework Supporting the Shift from Traditional Digital Publications to Enhanced Publications by Alessia Bardi and Paolo Manghi, Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Italy

    Science 2.0 Repositories: Time for a Change in Scholarly Communication by Massimiliano Assante, Leonardo Candela, Donatella Castelli, Paolo Manghi and Pasquale Pagano, Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Italy

    Data Citation Practices in the CRAWDAD Wireless Network Data Archive by Tristan Henderson, University of St Andrews, UK and David Kotz, Dartmouth College, USA

    A Methodology for Citing Linked Open Data Subsets by Gianmaria Silvello, University of Padua, Italy

    Challenges in Matching Dataset Citation Strings to Datasets in Social Science by Brigitte Mathiak and Katarina Boland, GESIS — Leibniz Institute for the Social Sciences, Germany

    Enabling Living Systematic Reviews and Clinical Guidelines through Semantic Technologies by Laura Slaughter; The Interventional Centre, Oslo University Hospital (OUS), Norway; Christopher Friis Berntsen and Linn Brandt, Internal Medicine Department, Innlandet Hosptial Trust and MAGICorg, Norway and Chris Mavergames, Informatics and Knowledge Management Department, The Cochrane Collaboration, Germany

    Data without Peer: Examples of Data Peer Review in the Earth Sciences by Sarah Callaghan, British Atmospheric Data Centre, UK

    The Tenth Anniversary of Assigning DOI Names to Scientific Data and a Five Year History of DataCite by Jan Brase and Irina Sens, German National Library of Science and Technology, Germany and Michael Lautenschlager, German Climate Computing Centre, Germany

    New Events

    N E W S   &   E V E N T S

    In Brief: Short Items of Current Awareness

    In the News: Recent Press Releases and Announcements

    Clips & Pointers: Documents, Deadlines, Calls for Participation

    Meetings, Conferences, Workshops: Calendar of Activities Associated with Digital Libraries Research and Technologies

    The quality of D-Lib Magazine meets or exceeds the quality claimed by pay-per-view publishers.