September « 2014 « Another Word For It

September 12, 2014

Want to see how #SchemaOrg #Dbpedia and #SKOS taxonomies can be seamlessly integrated?

Filed under: DBpedia,Schema.org,SKOS — Patrick Durusau @ 10:53 am

Want to see how #SchemaOrg #Dbpedia and #SKOS taxonomies can be seamlessly integrated? Register for our webinar: http://www.poolparty.biz/webinar-taxonomy-management-content-management-well-integrated/

is how the tweet read.

From the seminar registration page:

With the arrival of semantic web standards and linked data technologies, new options for smarter content management and semantic search have become available. Taxonomies and metadata management shall play a central role in your content management system: By combining text mining algorithms with taxonomies and knowledge graphs from the web a more accurate annotation and categorization of documents and more complex queries over text-oriented repositories like SharePoint, Drupal, or Confluence are now possible.

Nevertheless, the predominant opinion that taxonomy management is a tedious process currently impedes a widespread implementation of professional metadata strategies.

In this webinar, key people from the Semantic Web Company will describe how content management and collaboration systems like SharePoint, Drupal or Confluence can benefit from professional taxonomy management. We will also discuss why taxonomy management is not necessarily a tedious process when well integrated into content management workflows.

I’ve had mixed luck with webinars this year. Some were quite good and others were equally bad.

I have fairly firm opinions about #Schema.org, #Dbpedia and #SKOS taxonomies but tedium isn’t one of them.

You can register for free for: Webinar “Taxonomy management & content management – well integrated!”, October 8th, 2014.

Usual marketing harvesting of contact information. Linux users will have to use VMs for PCs or Mac.

If you attend, be sure to look for my post reviewing the webinar and post your comments there.

Comments Off

Bokeh 0.6 release

Filed under: Graphics,Python,Visualization — Patrick Durusau @ 10:30 am

Bokeh 0.6 release by Bryan Van de Ven.

From the post:

Bokeh is a Python library for visualizing large and realtime datasets on the web. Its goal is to provide to developers (and domain experts) with capabilities to easily create novel and powerful visualizations that extract insight from local or remote (possibly large) data sets, and to easily publish those visualization to the web for others to explore and interact with.

This release includes many bug fixes and improvements over our most recent 0.5.2 release:

Abstract Rendering recipes for large data sets: isocontour, heatmap

New charts in bokeh.charts: Time Series and Categorical Heatmap

Full Python 3 support for bokeh-server

Much expanded User and Dev Guides

Multiple axes and ranges capability

Plot object graph query interface

Hit-testing (hover tool support) for patch glyphs

See the CHANGELOG for full details.

I’d also like to announce a new Github Organization for Bokeh: https://github.com/bokeh. Currently it is home to Scala and and Julia language bindings for Bokeh, but the Bokeh project itself will be moved there before the next 0.7 release. Any implementors of new language bindings who are interested in hosting your project under this organization are encouraged to contact us.

In upcoming releases, you should expect to see more new layout capabilities (colorbar axes, better grid plots and improved annotations), additional tools, even more widgets and more charts, R language bindings, Blaze integration and cloud hosting for Bokeh apps.

Don’t forget to check out the full documentation, interactive gallery, and tutorial at

http://bokeh.pydata.org

as well as the Bokeh IPython notebook nbviewer index (including all the tutorials) at:

http://nbviewer.ipython.org/github/ContinuumIO/bokeh-notebooks/blob/master/index.ipynb

One of the examples from the gallery:

plot graphic

reminds me of U.S. foreign policy. The unseen attractors are defense contractors and other special interests.

Comments Off

The Lesser Known Normal Forms of Database Design

Filed under: Database,Humor — Patrick Durusau @ 10:13 am

The Lesser Known Normal Forms of Database Design by John Myles White.

A refreshing retake on normal forms of database design!

Enjoy!

Comments Off

September 11, 2014

MRAPs And Bayonets: What We Know About The Pentagon’s 1033 Program

Filed under: Data,Data Analysis — Patrick Durusau @ 5:47 pm

MRAPs And Bayonets: What We Know About The Pentagon’s 1033 Program by by Arezou Rezvani, Jessica Pupovac, David Eads, and Tyler Fisher. (NPR)

From the post:

Amid widespread criticism of the deployment of military-grade weapons and vehicles by police officers in Ferguson, Mo., President Obama recently ordered a review of federal efforts supplying equipment to local law enforcement agencies across the country.

So, we decided to take a look at what the president might find.

NPR obtained data from the Pentagon on every military item sent to local, state and federal agencies through the Pentagon’s Law Enforcement Support Office — known as the 1033 program — from 2006 through April 23, 2014. The Department of Defense does not publicly report which agencies receive each piece of equipment, but they have identified the counties that the items were shipped to, a description of each, and the amount the Pentagon initially paid for them.

We took the raw data, analyzed it and have organized it to make it more accessible. We are making that data set available to the public today.

This is a data set that raises more questions than it answers, as the post points out.

The top ten categories of items distributed (valued in the $millions): vehicles, aircraft, comm. & detection, clothing, construction, fire control, weapons, electric wire, medical equipment, and tractors.

Tractors? I can understand the military having tractors since it is entirely self-reliance during military operations. Why any local law enforcement office needs a tractor is less clear. Or bayonets (11,959 of them).

The NPR post does a good job of raising questions but since there are 3,143 counties or their equivalents in the United States, connecting the dots with particular local agencies, uses, etc. falls on your shoulders.

Could be quite interesting. Is your local sheriff “training” on an amphibious vehicle to reach his deer blind during hunting season? (Utter speculation on my part. I don’t know if your local sheriff likes to hunt deer.)

Comments Off

September 10, 2014

How is a binary executable organized? Let’s explore it!

Filed under: Linux OS,Programming — Patrick Durusau @ 4:48 pm

How is a binary executable organized? Let’s explore it! by Julia Evans.

From the post:

I used to think that executables were totally impenetrable. I’d compile a C program, and then that was it! I had a Magical Binary Executable that I could no longer read.

It is not so! Executable file formats are regular file formats that you can understand. I’ll explain some simple tools to start! We’ll be working on Linux, with ELF binaries. (binaries are kind of the definition of platform-specific, so this is all platform-specific.) We’ll be using C, but you could just as easily look at output from any compiled language.
…

I’ll be the first to admit that following Julia’s blog too closely carries the risk of changing you into a *nix kernel hacker.

I get a UTF-8 encoding error from her RSS feed so I have to follow her posts manually. Maybe the only thing that has saved me thus far.

Seriously, Julia’s posts help you expand your knowledge of what is on other side of the screen.

Enjoy!

PS: Julia is demonstrating a world of subjects that are largely unknown to the casual user. Not looking for a subject does not protect you from a defect in that subject.

Comments Off

Where Does Scope Come From?

Filed under: Computer Science,Mathematics — Patrick Durusau @ 4:29 pm

Where Does Scope Come From? by Michael Robert Bernstein.

From the post:

After several false starts, I finally sat down and watched the first of Frank Pfenning’s 2012 “Proof theory foundations” talks from the University of Oregon Programming Languages Summer School (OPLSS). I am very glad that I did.

Pfenning starts the talk out by pointing out that he will be covering the “philosophy” branch of the “holy trinity” of Philosophy, Computer Science and Mathematics. If you want to “construct a logic,” or understand how various logics work, I can’t recommend this video enough. Pfenning demonstrates the mechanics of many notions that programmers are familiar with, including “connectives” (conjunction, disjunction, negation, etc.) and scope.

…

Scope is demonstrated during this process as well. It turns out that in logic, as in programming, the difference between a sensible concept of scope and a tricky one can often mean the difference between a proof that makes no sense, and one that you can rest other proofs on. I am very interested in this kind of fundamental kernel – how the smallest and simplest ideas are absolutely necessary for a sound foundation in any kind of logical system. Scope is one of the first intuitions that new programmers build – can we exploit this fact to make the connections between logic, math, and programming clearer to beginners? (emphasis in the original)

Michael promises more detail on the treatment of scope in future posts.

The lectures run four (4) hours so it is going to take a while to do all of them. My curiosity is whether “scope” in this context refers to variables in programming or does “scope” here extend in some way to scope as used in topic maps?

More to follow.

Comments Off

TinkerPop3 M2 Delay for MetaProperties

Filed under: Graphs,TinkerPop — Patrick Durusau @ 4:11 pm

TinkerPop3 M2 Delay for MetaProperties by Marko A. Rodreiguez.

From the post:

TinkerPop3 3.0.0.M2 was suppose to be released 1.5 weeks ago. We have delayed the release because we have now introduced MetaProperties into TinkerPop3. Matthias Bröcheler of Titan-fame has been pushing TinkerPop to provide this feature for over a year now. We had numerous discussions about it over the past year, and at one point, rejected the feature request. However, recently, a solid design proposal was presented by Matthias and Stephen and I went about implementing it over the last 1.5 weeks. With that said, TinkerPop3 now has MetaProperties.

What are meta-properties?

Edges have Properties

Vertices have MetaProperties

MetaProperties have Properties

What are the consequences of meta-properties?

A vertex can have multiple “name” properties (for example).

A vertex’s properties (i.e. meta-properties) can have normal key/value properties (e.g. a “name” property can have an “acl:public” property).

What are the use cases?

Provenance: different users have different declarations for Marko’s name: “marko”, “marko rodriguez,” “marko a. rodriguez.”

Security: you can now do property-level security. Marko’s “age” has an acl:private property and his “name”(s) have acl:public properties.

History: who mutated what and when did they do it? each vertex property can have a “creator:stephen” and a “createdAt:2014” property.

If you have ever had to build a graph application that required provenance, security, history, and the like, you realized how difficult it is with the current key/value property graph model. You end up, in essence, creating vertices for properties so you can express such higher order semantics. However, maintaing that becomes a nightmare as tools like Gremlin and GraphWrappers don’t know the semantics and you basically are left to create your own GremlinDSL-extensions and tools to process such a custom representation. Well now, you get it for free and TinkerPop will be able to provide (in the future) wrappers (called strategies in TP3) for provenance, security, history, etc.

I don’t grok the reason for a distinction between properties of vertices and properties of edges so I have posted a note asking about it.

Take the quoted portion as a sample of the quality of work being done on TinkerPop3.

Comments Off

Taxonomies and Toolkits of Regular Language Algorithms

Filed under: Algorithms,Automata,Finite State Automata,String Matching — Patrick Durusau @ 3:32 pm

Taxonomies and Toolkits of Regular Language Algorithms by Bruce William Watson.

From 1.1 Problem Statement:

A number of fundamental computing science problems have been extensively studied since the 1950s and the 1960s. As these problems were studied, numerous solutions (in the form of algorithms) were developed over the years. Although new algorithms still appear from time to time, each of these fields can be considered mature. In the solutions to many of the well-studied computing science problems, we can identify three deficiencies:

Algorithms solving the same problem are difficult to compare to one another. This is usually due to the use of different programming languages, styles of presentation, or simply the addition of unnecessary details.

Collections of implementations of algorithms solving a problem are difficult, if not impossible, to find. Some of the algorithms are presented in a relatively obsolete manner, either using old notations or programming languages for which no compilers exist, making it difficult to either implement the algorithm or find an existing implementation.

Little is known about the comparative practical running time performance of the algorithms. The lack of existing implementations in one and the same framework, especially of the older algorithms, makes it difficult to determine the running time characteristics of the algorithms. A software engineer selecting one of the algorithms will usually do so on the basis of the algorithm’s theoretical running time, or simply by guessing.

In this dissertation, a solution to each of the three deficiencies is presented for each of the following three fundamental computing science problems:

Keyword pattern matching in strings. Given a finite non-empty set of keywords (the patterns) and an input string, find the set of all occurrences of a keyword as a substring of the input string.

Finite automata (FA) construction. Given a regular expression, construct a finite automaton which accepts the language denoted by the regular expression.

Deterministic finite automata (DFA) minimization. Given a DFA, construct the unique minimal DFA accepting the same language.

We do not necessarily consider all the known algorithms solving the problems. For example, we restrict ourselves to batch-style algorithms¹, as opposed to incremental algorithms².

Requires updating given its age, 1995, but a work merits mention.

I first saw this in a tweet by silentbicycle.srec.

Comments Off

ETL: The Dirty Little Secret of Data Science

Filed under: Data Science,ETL — Patrick Durusau @ 3:01 pm

ETL: The Dirty Little Secret of Data Science by Byron Ruth.

From the description:

“There is an adage that given enough data, a data scientist can answer the world’s questions. The untold truth is that the majority of work happens during the ETL and data preprocessing phase. In this talk I discuss Origins, an open source Python library for extracting and mapping structural metadata across heterogenous data stores.”

More than your usual ETL presentation, Byron makes several points of interest to the topic map community:

“domain knowledge” is necessary for effective ETL
“domain knowledge” changes and fades from dis-use
ETL isn’t transparent to consumers of data resulting from ETL, a “black box”
Data provenance is the answer to transparency, changing domain knowledge and persisting domain knowledge
“Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing.”
Project Origins, captures metadata and structures from backends and persists it to Neo4j

Great focus on provenance but given the lack of merging in Neo4j, the collation of information about a common subject, with different names, is going to be a manual process.

Follow @thedevel.

Comments Off

What’s in a Name?

Filed under: Conferences,Names,Subject Identity — Patrick Durusau @ 10:56 am

What’s in a Name?

From the webpage:

What will be covered? The meeting will focus on the role of chemical nomenclature and terminology in open innovation and communication. A discussion of areas of nomenclature and terminology where there are fundamental issues, how computer software helps and hinders, the need for clarity and unambiguous definitions for application to software systems. How can you contribute? As well as the talks from expert speakers there will be plenty of opportunity for discussion and networking. A record will be made of the meeting, including the discussion, and will be made available initially to those attending the meeting. The detailed programme and names of speakers will be available closer to the date of the meeting.

Date: 21 October 2014

Event Subject(s): Industry & Technology

Venue

The Royal Society of Chemistry
Library
Burlington House
Piccadilly
London
W1J 0BA
United Kingdom

Find this location using Google Map

Contact for Event Information

Name: Prof Jeremy Frey

Address:
Chemistry
University of Southampton
United Kingdom

Email: j.g.frey@soton.ac.uk

Now there’s an event worth the hassle of overseas travel during these paranoid times! Alas, I will have to wait for the conference record to be released to non-attendees. The event is a good example of the work going on at the Royal Society of Chemistry.

I first saw this in a tweet by Open PHACTS.

Comments Off

iCloud: Leak for Less

Filed under: Cloud Computing,Cybersecurity,Security — Patrick Durusau @ 10:44 am

Apple rolls out iCloud pricing cuts by Jonathan Vanian.

Jonathan details the new Apple pricing schedule for the iCloud.

Now you can leak your photographs for less!

Cheap storage = Cheap security.

Is there anything about that statement that is unclear?

Comments Off

QPDF – PDF Transformations

Filed under: PDF,Text Mining — Patrick Durusau @ 9:48 am

QPDF – PDF Transformations

From the webpage:

QPDF is a command-line program that does structural, content-preserving transformations on PDF files. It could have been called something like pdf-to-pdf. It also provides many useful capabilities to developers of PDF-producing software or for people who just want to look at the innards of a PDF file to learn more about how they work.

QPDF is capable of creating linearized (also known as web-optimized) files and encrypted files. It is also capable of converting PDF files with object streams (also known as compressed objects) to files with no compressed objects or to generate object streams from files that don’t have them (or even those that already do). QPDF also supports a special mode designed to allow you to edit the content of PDF files in a text editor….

Government agencies often publish information in PDF. PDF which often has restrictions on copying and printing.

I have briefly tested QPDF and it does take care of copying and printing restrictions. Be aware that QPDF has many other capabilities as well.

Comments Off

Recursive Deep Learning For Natural Language Processing And Computer Vision

Filed under: Deep Learning,Machine Learning,Natural Language Processing — Patrick Durusau @ 5:28 am

Recursive Deep Learning For Natural Language Processing And Computer Vision by Richard Socher.

From the abstract:

As the amount of unstructured text data that humanity produces overall and on the Internet grows, so does the need to intelligently process it and extract different types of knowledge from it. My research goal in this thesis is to develop learning models that can automatically induce representations of human language, in particular its structure and meaning in order to solve multiple higher level language tasks.

There has been great progress in delivering technologies in natural language processing such as extracting information, sentiment analysis or grammatical analysis. However, solutions are often based on different machine learning models. My goal is the development of general and scalable algorithms that can jointly solve such tasks and learn the necessary intermediate representations of the linguistic units involved. Furthermore, most standard approaches make strong simplifying language assumptions and require well designed feature representations. The models in this thesis address these two shortcomings. They provide effective and general representations for sentences without assuming word order independence. Furthermore, they provide state of the art performance with no, or few manually designed features.

The new model family introduced in this thesis is summarized under the term Recursive Deep Learning. The models in this family are variations and extensions of unsupervised and supervised recursive neural networks (RNNs) which generalize deep and feature learning ideas to hierarchical structures. The RNN models of this thesis obtain state of the art performance on paraphrase detection, sentiment analysis, relation classification, parsing, image-sentence mapping and knowledge base completion, among other tasks.

Socher’s models offer two significant advances:

No assumption of word order independence
No or few manually designed features

Of the two, I am more partial to elimination of the assumption of word order independence. I suppose in part because I see that leading to abandoning that assumption that words have some fixed meaning separate and apart from the other words used to define them.

Or in topic maps parlance, identifying a subject always involves the use of other subjects, which are themselves capable of being identified. Think about it. When was the last time you were called upon to identify a person, object or thing and you uttered an IRI? Never right?

That certainly works, at least in closed domains, in some cases, but other than simply repeating the string, you have no basis on which to conclude that is the correct IRI. Nor does anyone else have a basis to accept or reject your IRI.

I suppose that is another one of those “simplifying” assumptions. Useful in some cases but not all.

Comments Off

September 9, 2014

OceanColor Web

Filed under: NASA,Oceanography — Patrick Durusau @ 7:36 pm

OceanColor Web

A remarkable source for ocean color data and software for analysis of that data.

From the webpage:

This project creates a variety of established and new ocean color products for evaluation as candidates to become Earth Science Data Records.

Not directly relevant to anything I’m working on but I don’t know what environmental or oceanography projects you are pursuing.

I first saw this in a tweet by Rob Simmon.

Comments Off

The Chemical Analysis Metadata Platform

Filed under: Cheminformatics,Chemistry,Metadata — Patrick Durusau @ 7:28 pm

The Chemical Analysis Metadata Platform

This project is focused on defining the important metadata (data about data) needed to describe a chemical analysis methodology. The idea is to evaluate the current and future needs for accurate representation of both classical (wet chemical) and instrumental analysis procedures and present a unified approach to metadata nomenclature, data types, data structures and semantic annotation.

So what does that really mean? Well, in the growing movement toward semantic annotation of science data there is a real need to provide descriptors (metadata) for all parts of science. With the exponential growth in raw data, having descriptors allows researchers a way to easily (we hope) provide context to the work they are doing. So, because the area of chemical analysis is so broad, and that it is likely that many groups will try and create there own standards for contextualizing the area, this project aims to provide an extensible platform that:

identifies key metadata for chemical analysis

outlines recommended practices for reporting the metadata

defines controlled vocabularies for important metadata (e.g. analysis technique, sample matrix)

defines an ontology for both metadata items and groups of metadata items

Note this project is about defining a platform. It is not, per se, about defining standards (i.e. defining what metadata must be used). However, standards are the application of the ChAMP platform in a particular area, and so we will also link to them once they are developed.

This project is very much a work in progress. It also needs to be defined and critically evaluated by the community and so we encourage you to be part of the process, via the discussion forums on this site, by participation in project conference calls (to be scheduled), through our Facebook page, or by email at the address on the right. And, please vote in our first poll!

For more information on the project look at the overview page linked top right. Stuart and Tony.

The important of metadata is that no string can stand on its own. Everyone “sees” an isolated string from their context, which may or may not match the context in which it was used by its author.

Hence, the opportunity for miss-understanding arises.

Not just in chemical analysis but in all other fields as well.

If you are involved or advising anyone involved in chemical analysis, pass this information along.

I first saw this in a tweet by Analyst.

Comments Off

PLOS Resources on Ebola

Filed under: Bioinformatics,Open Access,Open Data — Patrick Durusau @ 7:09 pm

PLOS Resources on Ebola by Virginia Barbour and PLOS Collections.

From the post:

The current Ebola outbreak in West Africa probably began in Guinea in 2013, but it was only recognized properly in early 2014 and shows, at the time of writing, no sign of subsiding. The continuous human-to-human transmission of this new outbreak virus has become increasingly worrisome.

Analyses thus far of this outbreak mark it as the most serious in recent years and the effects are already being felt far beyond those who are infected and dying; whole communities in West Africa are suffering because of its negative effects on health care and other infrastructures. Globally, countries far removed from the outbreak are considering their local responses, were Ebola to be imported; and the ripple effects on the normal movement of trade and people are just becoming apparent.

A great collection of PLOS resources on Ebola.

Even usual closed sources are making Ebola information available for free:

Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak (Science DOI: 10.1126/science.1259657) This is the gene sequencing report that establishes that one (1) person ate infected bush meat and is the source of all the following Ebola infections.

So much for needing highly specialized labs to “weaponize” biological agents. One infection is likely to result in > 20,000 deaths. You do the math.

I first saw this in a tweet by Alex Vespignani.

Comments Off

Instructor Hangouts, Landing this Friday

Filed under: Communication,Teaching — Patrick Durusau @ 6:54 pm

Instructor Hangouts, Landing this Friday by Bill Mills.

From the post:

I’m pleased to announce that this Friday, 12 Sept, will be the first offering of a new twice-monthly event hosted by the Mozilla Science Lab: Instructor Hangouts.

There is a growing number of fantastic workshop-oriented software and data education programs out there to spool learners up on the ideas and skills we’d like to share. In our Instructor Hangouts, instructors from all these initiatives will be invited to share ideas, discuss skills and strategies, and most importantly, get to know one another. By being part of a diverse but federated community, we all stand to learn from each other’s experience.

Happening twice every other Friday, at 9 AM and 9 PM Pacific (UTC -8), I’ll be inviting instructors from Software Carpentry, Ladies Learning Code, rOpenSci, PyLadies, the School of Data, Code4Lib and more to spend an hour welcoming new instructors to our ranks, discussing lessons learned from recent workshops, and diving deep on all matters of instruction.

Cool!

If you want to educate others, this is a perfect opportunity to sharpen your skills.

I first saw this in a tweet by Mozilla Science Lab.

Comments Off

Activator Template of the Month: Atomic Scala

Filed under: Programming,Scala — Patrick Durusau @ 6:27 pm

Activator Template of the Month: Atomic Scala by Dick Wall.

From the post:

As readers of the Typesafe newsletter may know, every month we promote an Activator template that embodies qualities we look for in tutorials and topics. This month’s template is a return to first principles of Activator as an experimentation and discovery tool. While the selection may be below your level, gentle reader, I will bet as an existing Scala developer, at least someone has asked you at some point how to go about learning Scala from the ground up, or even better, how to go about learning to program.

…

The Atomic Scala Examples activator template, authored by Bruce Eckel (one of the book authors) is a companion to the book but is useful in its own right (i.e. while I recommend the book, you don’t have to buy it to find the template useful). The template takes each of the examples in the book and provides them in an executable and easily runnable form. If you want to help someone to learn to program, and even better to do so using Scala, here’s your template. You can also download the first 100 pages of the book for free if you want to as well.

Whether this is a technique that works for you or your student(s) won’t be known unless you try.

Enjoy!

I first saw this in a tweet by TypeSafe.

Comments Off

BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web

Filed under: Corpora,Natural Language Processing,WWW — Patrick Durusau @ 6:10 pm

BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web

From the webpage:

Despite certain obvious drawbacks (e.g. lack of control, sampling, documentation etc.), there is no doubt that the World Wide Web is a mine of language data of unprecedented richness and ease of access.

It is also the only viable source of “disposable” corpora built ad hoc for a specific purpose (e.g. a translation or interpreting task, the compilation of a terminological database, domain-specific machine learning tasks). These corpora are essential resources for language professionals who routinely work with specialized languages, often in areas where neologisms and new terms are introduced at a fast pace and where standard reference corpora have to be complemented by easy-to-construct, focused, up-to-date text collections.

While it is possible to construct a web-based corpus through manual queries and downloads, this process is extremely time-consuming. The time investment is particularly unjustified if the final result is meant to be a single-use corpus.

…

The command-line scripts included in the BootCaT toolkit implement an iterative procedure to bootstrap specialized corpora and terms from the web, requiring only a list of “seeds” (terms that are expected to be typical of the domain of interest) as input.

In implementing the algorithm, we followed the old UNIX adage that each program should do only one thing, but do it well. Thus, we developed a small, independent tool for each separate subtask of the algorithm.

As a result, BootCaT is extremely modular: one can easily run a subset of the programs, look at intermediate output files, add new tools to the suite, or change one program without having to worry about the others.

Any application following “the old UNIX adage that each program should do only one thing, but do it well” merits serious consideration.

Occurs to me that BootCaT would also be useful for creating small text collections for comparison to each other.

Enjoy!

I first saw this in a tweet by Alyona Medelyan.

Comments Off

The American Yawp [Free History Textbook]

Filed under: History,Texts — Patrick Durusau @ 5:49 pm

The American Yawp [Free History Textbook], Editors: Joseph Locke, University of Houston-Victoria and Ben Wright, Abraham Baldwin Agricultural College.

From the about page:

In an increasingly digital world in which pedagogical trends are de-emphasizing rote learning and professors are increasingly turning toward active-learning exercises, scholars are fleeing traditional textbooks. Yet for those that still yearn for the safe tether of a synthetic text, as either narrative backbone or occasional reference material, The American Yawp offers a free and online, collaboratively built, open American history textbook designed for college-level history courses. Unchecked by profit motives or business models, and free from for-profit educational organizations, The American Yawp is by scholars, for scholars. All contributors—experienced college-level instructors—volunteer their expertise to help democratize the American past for twenty-first century classrooms.

…

The American Yawp constructs a coherent and accessible narrative from all the best of recent historical scholarship. Without losing sight of politics and power, it incorporates transnational perspectives, integrates diverse voices, recovers narratives of resistance, and explores the complex process of cultural creation. It looks for America in crowded slave cabins, bustling markets, congested tenements, and marbled halls. It navigates between maternity wards, prisons, streets, bars, and boardrooms. Whitman’s America, like ours, cut across the narrow boundaries that strangle many narratives. Balancing academic rigor with popular readability, The American Yawp offers a multi-layered, democratic alternative to the American past.

In “beta” now but worth your time to read, comment and possibly contribute. I skimmed to a couple of events that I remember quite clearly and I can’t say the text (yet) captures the tone of the time.

For example, the Chicago Police Riot in 1968 gets a bare two paragraphs in Chapter 27, The Sixties. In the same chapter, 1967, the long hot summer when the cities burned, was over in a sentence.

I am sure the author(s) of that chapter were trying to keep the text to some reasonable length and avoid the death by details I encountered in my college American history textbook so many years ago.

Still, given the wealth of materials online, written, audio, video, expanding the text and creating exploding sub-themes (topic maps anyone?) on particular subjects would vastly enhance this project.

PS: If you want a small flavor of what could be incorporated via hyperlinks, see: http://abbiehoffman.org/ and the documents, such as FBI documents, at that site.

Comments Off

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus

Filed under: Bioinformatics,Cheminformatics,Chemistry,Classification,Document Classification — Patrick Durusau @ 3:11 pm

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus by George Papadatos, et al. (Journal of Cheminformatics 2014, 6:40)

Abstract:

Background

The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.

Results

The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining webcite. These can be readily modified to include additional keyword constraints to further focus searches.

Conclusions

Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

While the abstract mentions “the triage process,” it fails to capture the main goal of this paper:

…the main goal of our project diverges from the goal of the tools mentioned. We aim to meet the following criteria: ranking and prioritising the relevant literature using a fast and high performance algorithm, with a generic methodology applicable to other domains and not necessarily related to chemistry and drug discovery. In this regard, we present a method that builds upon the manually collated and curated ChEMBL document corpus, in order to train a Bag-of-Words (BoW) document classifier.
…
In more detail, we have employed two established classification methods, namely Naïve Bayesian (NB) and Random Forest (RF) approaches [12]-[14]. The resulting classification score, henceforth referred to as ‘ChEMBL-likeness’, is used to prioritise relevant documents for data extraction and curation during the triage process.

In other words, the focus of this paper is a classifier to help prioritize curation of papers. I take that as being different from classifiers used at other stages or for other purposes in the curation process.

I first saw this in a tweet by ChemConnector.

Comments Off

September 8, 2014

Python-ZPar – Python Wrapper for ZPAR

Filed under: Chinese,Language,Natural Language Processing,Parsers — Patrick Durusau @ 7:05 pm

Python-ZPar – Python Wrapper for ZPAR by Nitin Madnani.

From the webpage:

python-zpar is a python wrapper around the ZPar parser. ZPar was written by Yue Zhang while he was at Oxford University. According to its home page: ZPar is a statistical natural language parser, which performs syntactic analysis tasks including word segmentation, part-of-speech tagging and parsing. ZPar supports multiple languages and multiple grammar formalisms. ZPar has been most heavily developed for Chinese and English, while it provides generic support for other languages. ZPar is fast, processing above 50 sentences per second using the standard Penn Teebank (Wall Street Journal) data.

I wrote python-zpar since I needed a fast and efficient parser for my NLP work which is primarily done in Python and not C++. I wanted to be able to use this parser directly from Python without having to create a bunch of files and running them through subprocesses. python-zpar not only provides a simply python wrapper but also provides an XML-RPC ZPar server to make batch-processing of large files easier.

python-zpar uses ctypes, a very cool foreign function library bundled with Python that allows calling functions in C DLLs or shared libraries directly.

Just in case you are looking for a language parser for Chinese or English.

It is only a matter of time before commercial opportunities are going to force greater attention on non-English languages. Forewarned is forearmed.

Comments Off

Visualizing Website Pathing With Network Graphs

Filed under: Graphs,Networks,R,Visualization — Patrick Durusau @ 6:54 pm

Visualizing Website Pathing With Network Graphs by Randy Zwitch.

From the post:

Last week, version 1.4 of RSiteCatalyst was released, and now it’s possible to get site pathing information directly within R. Now, it’s easy to create impressive looking network graphs from your Adobe Analytics data using RSiteCatalyst and d3Network. In this blog post, I will cover simple and force-directed network graphs, which show the pairwise representation between pages. In a follow-up blog post, I will show how to visualize longer paths using Sankey diagrams, also from the d3Network package.

Great technical details and examples but also worth the read for:

I’m not going to lie, all three of these diagrams are hard to interpret. Like wordclouds, network graphs can often be visually interesting, yet difficult to ascertain any concrete information. Network graphs also have the tendency to reinforce what you already know (you or someone you know designed your website, you should already have a feel for its structure!).

Randy does spot some patterns but working out what those patterns “mean” remain for further investigation.

Hairball graph visualizations can be a starting point for the hard work that extracts actionable intelligence.

Comments Off

Speakers, Clojure/conj 2014 Washington, D.C. Nov 20-22

Filed under: Clojure,Conferences,Uncategorized — Patrick Durusau @ 6:39 pm

Speakers, Clojure/conj 2014 Washington, D.C. Nov 20-22

Hyperlinks for authors point to Twitter profile pages, title of paper follows:

Jeanine Adkisson Variants are Not Unions

Bozhidar Batsov The evolution of the Emacs tooling for Clojure

Lucas Cavalcanti Exploring Four Hidden Superpowers of Datomic

Colin Fleming Cursive: a different type of IDE

Julian Gamble Applying the paradigms of core.async in ClojureScript

Brian Goetz Keynote

Paul deGrandis Unlocking Data-Driven Systems

Nathan Herzing Helping voters with Pedestal, Datomic, Om and core.async

Rich Hickey Transducers

Ashton Kemerling Generative Integration Tests.

Michał Marczyk Persistent Data Structures for Special Occasions

Steve Miner Generating Generators

Zach Oakes Making Games at Runtime with Clojure

Anna Pawlicka Om nom nom nom

David Pick Building a Data Pipeline with Clojure and Kafka

Ghadi Shayban JVM Creature Comforts

Chris Shea Helping voters with Pedestal, Datomic, Om and core.async

Zach Tellman Always Be Composing

Glenn Vanderburg Cl6: The Algorithms of TeX in Clojure

Edward Wible Exploring Four Hidden Superpowers of Datomic

Steven Yi Developing Music Systems on the JVM with Pink and Score

Abstracts for the papers appear here.

Obviously a great conference to attend but at a minimum, you have a great list of twitter accounts to follow on cutting edge Clojure news!

I first saw this in a tweet by Alex Miller.

Comments Off

Demystifying The Google Knowledge Graph

Filed under: Entities,Google Knowledge Graph,Search Engines,Searching — Patrick Durusau @ 3:28 pm

Demystifying The Google Knowledge Graph by Barbara Starr.

Barbara covers:

Explicit vs. Implicit Entities (and how to determine which is which on your webpages)
How to improve your chances of being in “the Knowledge Graph” using Schema.org and JSON-LD.
Thinking about “things, not strings.”

Is there something special about “events?” I remember the early Semantic Web motivations being setting up tennis matches between colleagues. The examples here are of sporting and music events.

If your users don’t know how to use TicketMaster, repeating delivery of that data on your site isn’t going to help them.

On the other hand, this is a good reminder to extract from Schema.org all the “types” that would be useful for my blog.

PS: A “string” doesn’t become a “thing” simply because it has a longer token. Having an agreed upon “longer token” from a vocabulary such as Schema.org does provide more precise identification than an unadorned “string.”

Having said that, the power of having several key/value pairs and a declaration of which ones must, may or must not match, should be readily obvious. Particularly when those keys and values may themselves be collections of key/value pairs.

Comments Off

Shoothill GaugeMap

Filed under: Mapping,Maps — Patrick Durusau @ 12:47 pm

Shoothill GaugeMap

From the about page:

The Shoothill GaugeMap is the first interactive map with live river level data from over 2,400 Environment Agency and Natural Resources Wales river level gauges in England and Wales.

The extensive network of river level gauges across England and Wales covers all the major rivers as well as many smaller rivers, streams and brooks. The data displayed on each of the river level gauges on GaugeMap is recorded at 15 minute intervals by the Environment Agency and Natural Resources Wales.

For more information on how to use GaugeMap, please see the help file.

If you live in England or Wales and are concerned about potential flooding, this may be the map for you!

The map reports data from 2400 river level gauges and you can follow individual gauges via Twitter.

I first saw this in a tweet by Rod Plummer.

In case you are interested in other river gauge information:

RiverGauges.com USA only but excludes most of Texas, Georgia, Florida, and most of the Eastern seaboard. Not sure why. Has historical and current data.

RiverApp USA and Europe, over 1,000 rivers. Can’t really comment since I don’t have a smart phone. (Contact me for an smail address if you want to donate a recent smart phone.)

Comments Off

Bringing chemical synthesis to the masses

Filed under: Chemistry,Crowd Sourcing — Patrick Durusau @ 12:27 pm

Bringing chemical synthesis to the masses by Michael Gross.

From the post:

You too can create thousands of new compounds and screen them for a desired activity. That is the promise of a novel approach to building chemical libraries, which only requires simple building blocks in water, without any additional reagents or sample preparation.1

Jeffrey Bode from ETH Zurich and his co-worker Yi-Lin Huang took inspiration both from nature’s non-ribosomal peptide synthesis and from click chemistry. Nature uses specialised non-ribosomal enzymes to create a number of unusual peptides outside the normal paths of protein biosynthesis including, for instance, pharmaceutically relevant peptides like the antibiotic vancomycin. Bode and Huang have now produced these sorts of compounds without cells or enzymes, simply relying on the right chemistry.

…

Given the simplicity of the process and the absence of toxic reagents and by-products, Bode anticipates that it could even be widely used by non-chemists. ‘Our idea is to provide a quick way to make bioactive molecules just by mixing the components in water,’ Bode explains. ‘We would like to use this as a platform for chemistry that anyone can do, including scientists in other fields, high school students and farmers. Anyone could prepare libraries in a few hours with a micropipette, explore different combinations of building blocks and culture conditions along with simple assays to find novel molecules.’

Bode either wasn’t a humanities major or he missed the class on keeping lay people away from routine tasks. Everyone knows that routine tasks, like reading manuscripts must be reserved for graduate students under the fiction that only an “expert” can read non-printed material.

To be fair, there are manuscript characters or usages that require an expert opinion but those can be quickly isolated by statistical analysis of disagreement between different readers. Assuming effective transcription interfaces for manuscripts and a large enough body of readers.

That would reduce the number of personal fiefdoms built on access to particular manuscripts but that prospect finds me untroubled.

You can imagine the naming issues that will ensue from wide spread chemical synthesis by the masses. But, there is too much to be discovered to be miserly with means of discovery or dissemination of those results.

Comments Off

Accelerate Machine Learning with cuDNN Deep Neural Network Library

Filed under: GPU,Neural Networks,NVIDIA — Patrick Durusau @ 10:23 am

Accelerate Machine Learning with the cuDNN Deep Neural Network Library by Larry Brown.

From the post:

Introducing cuDNN

NVIDIA cuDNN is a GPU-accelerated library of primitives for DNNs. It provides tuned implementations of routines that arise frequently in DNN applications, such as:

convolution

pooling

softmax

neuron activations, including:

Sigmoid

Rectified linear (ReLU)

Hyperbolic tangent (TANH)

Of course these functions all support the usual forward and backward passes. cuDNN’s convolution routines aim for performance competitive with the fastest GEMM-based (matrix multiply) implementations of such routines while using significantly less memory.

cuDNN features customizable data layouts, supporting flexible dimension ordering, striding and subregions for the 4D tensors used as inputs and outputs to all of its routines. This flexibility allows easy integration into any neural net implementation and avoids the input/output transposition steps sometimes necessary with GEMM-based convolutions.

cuDNN is thread safe, and offers a context-based API that allows for easy multithreading and (optional) interoperability with CUDA streams. This allows the developer to explicitly control the library setup when using multiple host threads and multiple GPUs, and ensure that a particular GPU device is always used in a particular host thread (for example).

cuDNN allows DNN developers to easily harness state-of-the-art performance and focus on their application and the machine learning questions, without having to write custom code. cuDNN works on Windows or Linux OSes, and across the full range of NVIDIA GPUs, from low-power embedded GPUs like Tegra K1 to high-end server GPUs like Tesla K40. When a developer leverages cuDNN, they can rest assured of reliable high performance on current and future NVIDIA GPUs, and benefit from new GPU features and capabilities in the future.

I didn’t quote the background and promotional material on machine learning or deep neural networks (DNN’s), assuming that if you are interested at all, you will read the original post to pick up that material. Attention has been paid to making cuDNN “easy” to use. “Easy” is a relative term but I think you will appreciate the effort.

BTW, cuDNN is free for any purpose but does require you to have a registered CUDA developer account. If you are already a registered CUDA developer or after you are, see: http://developer.nvidia.com/cuDNN

Caffe, a deep learning framework, has support for cuDNN in its current development branch.

I first saw this in a tweet by Mark Harris.

Comments Off

September 7, 2014

Lgram

Filed under: Linguistics,N-Grams — Patrick Durusau @ 6:51 pm

Lgram: A memory-efficient ngram builder

From the webpage:

Lgram is a cross–platform tool for calculating ngrams in a memory–efficient manner. The current crop of n-gram tools have non–constant memory usage such that ngrams cannot be computed for large input texts. Given the prevalence of large texts in computational and corpus linguistics, this deficit is problematic. Lgram has constant memory usage so it can compute ngrams on arbitrarily sized input texts. Lgram achieves constant memory usages by periodically syncing the computed ngrams to an sqlite database stored on disk.

Lgram was written by Edward J. L. Bell at Lancaster University and funded by UCREL. The project was initiated by Dr Paul Rayson.

Not recent (2011) but new to me. Enjoy!

I first saw this in a tweet by Christopher Phipps.

Comments Off