Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 11, 2013

Semantic Web Rides Into the Sunset

Filed under: CSV,Data,XQuery — Patrick Durusau @ 8:16 pm

W3C’s Semantic Web Activity Folds Into New Data Activity by Jennifer Zaino.

From the post:

The World Wide Web Consortium has headline news today: The Semantic Web, as well as eGovernment, Activities are being merged and superseded by the Data Activity, where Phil Archer serves as Lead. Two new workgroups also have been chartered: CSV on the Web and Data on the Web Best Practices.

The new CSV on the Web Working Group is an important step in that direction, following on the heels of efforts such as R2RML. It’s about providing metadata about CSV files, such as column headings, data types, and annotations, and, with it, making it easily possible to convert CSV into RDF (or other formats), easing data integration. “The working group will define a metadata vocabulary and then a protocol for how to link data to metadata (presumably using HTTP Link headers) or embed the metadata directly. Since the links between data and metadata can work in either direction, the data can come from an API that returns tabular data just as easily as it can a static file,” says Archer. “It doesn’t take much imagination to string together a tool chain that allows you to run SPARQL queries against ’5 Star Data’ that’s actually published as a CSV exported from a spreadsheet.”

The Data on the Web Best Practices working group, he explains, will not define any new technologies but will guide data publishers (government, research scientists, cultural heritage organizations) to better use the Web as a data platform. Additionally, the Data Activity, as well as the new Digital Publishing Activity that will be lead by former Semantic Web Activity Lead Ivan Herman, are now in a new domain called the Information and Knowledge Domain (INK), led by Ralph Swick.

I will spare you all the tedious justification by Phil Archer of the Semantic Web venture.

The W3C is also the home of XSLT, XPath, XQuery, and other standards that require no defense or justification.

Maybe we will all get lucky and the CSV on the Web and Data on the Web Best Practices activities will be successful activities at the W3C.

Neo4j Cypher Refcard 2.0

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 5:54 pm

Neo4j Cypher Refcard 2.0

From the webpage:

Key principles and capabilities of Cypher are as follows:

  • Cypher matches patterns of nodes and relationship in the graph, to extract information or modify the data.
  • Cypher has the concept of identifiers which denote named, bound elements and parameters.
  • Cypher can create, update, and remove nodes, relationships, labels, and properties.
  • Cypher manages indexes and constraints.

You can try Cypher snippets live in the Neo4j Console at console.neo4j.org or read the full Cypher documentation at docs.neo4j.org. For live graph models using Cypher check out GraphGist.

If you plan on entering the Neo4j GraphGist December Challenge, you are probably going to need this Refcard.

I first saw this in a tweet by Peter Neubauer.

December 10, 2013

ThisPlusThat.me: [Topic Vectors?]

Filed under: Search Algorithms,Search Engines,Searching,Vectors — Patrick Durusau @ 7:30 pm

ThisPlusThat.me: A Search Engine That Lets You ‘Add’ Words as Vectors by Christopher Moody.

From the post:

Natural language isn’t that great for searching. When you type a search query into Google, you miss out on a wide spectrum of human concepts and human emotions. Queried words have to be present in the web page, and then those pages are ranked according to the number of inbound and outbound links. That’s great for filtering out the cruft on the internet — and there’s a lot of that out there. What it doesn’t do is understand the relationships between words and understand the similarities or dissimilarities.

That’s where ThisPlusThat.me comes in — a search site I built to experiment with the word2vec algorithm recently released by Google. word2vec allows you to add and subtract concepts as if they were vectors, and get out sensible, and interesting results. I applied it to the Wikipedia corpus, and in doing so, tried creating an interactive search site that would allow users to put word2vec through it’s paces.

For example, word2vec allows you to evaluate a query like King – Man + Woman and get the result Queen. This means you can do some totally new searches.

… (examples omitted)

word2vec is a type of distributed word representation algorithm that trains a neural network in order to assign a vector to every word. Each of the dimensions in the vector tries to encapsulate some property of the word. Crudely speaking, one dimension could encode that man, woman, king and queen are all ‘people,’ whereas other dimensions could encode associations or dissociations with ‘royalty’ and ‘gender’. These traits are learned by trying to predict the context in a sentence and learning from correct and incorrect guesses.

Precisely!!!

😉

Doing it with word2vec requires large training sets of data. No doubt a useful venture if you are seeking to discover or document the word vectors in a domain.

But what if you wanted to declare vectors for words?

And then run word2vec (or something very similar) across the declared vectors.

Thinking along the lines of a topic map construct that has a “word” property with a non-null value. All the properties that follow are key/value pairs representing the positive and negative dimensions that are dimensions that give that word meaning.

Associations are collections of vector sums that identify subjects that take part in an association.

If we do all addressing by vector sums, we lose the need to track and collect system identifiers.

I think this could have legs.

Comments?

PS: For efficiency reasons, I suspect we should allow storage of computed vector sum(s) on a construct. But that would not prohibit another analysis reaching a different vector sum for different purposes.

Supercomputing on the cheap with Parallella

Filed under: HPC,Parallel Programming,Parallela,Parallelism,Supercomputing — Patrick Durusau @ 5:29 pm

Supercomputing on the cheap with Parallella by Federico Lucifredi.

From the post:

Packing impressive supercomputing power inside a small credit card-sized board running Ubuntu, Adapteva‘s $99 ARM-based Parallella system includes the unique Ephiphany numerical accelerator that promises to unleash industrial strength parallel processing on the desktop at a rock-bottom price. The Massachusetts-based startup recently ran a successfully funded Kickstarter campaign and gained widespread attention only to run into a few roadblocks along the way. Now, with their setbacks behind them, Adapteva is slated to deliver its first units mid-December 2013, with volume shipping in the following months.

What makes the Parallella board so exciting is that it breaks new ground: imagine an Open Source Hardware board, powered by just a few Watts of juice, delivering 90 GFLOPS of number crunching. Combine this with the possibility of clustering multiple boards, and suddenly the picture of an exceedingly affordable desktop supercomputer emerges.

This review looks in-depth at a pre-release prototype board (so-called Generation Zero, a development run of 50 units), giving you a pretty complete overview of what the finished board will look like.

Whether you participate in this aspect of the computing revolution or not, you will be impacted by it.

The more successful Parallela and similar efforts become in bringing desktop supercomputing, the more pressure there will be on cloud computing providers to match those capabilities at lower prices.

Another point of impact will be non-production experimentation with parallel processing. Which may, like Thomas Edison, discover (or re-discover) 10,000 ways that don’t work but discover 1 that far exceeds anyone’s expectations.

That is to say that supercomputing will become cheap enough to tolerate frequent failure while experimenting with it.

What would you like to invent for supercomputing?

Paginated Collections with Ember.js + Solr + Rails

Filed under: Interface Research/Design,Solr — Patrick Durusau @ 5:11 pm

Paginated Collections with Ember.js + Solr + Rails by Eduardo Figarola.

From the post:

This time, I would like to show you how to add a simple pagination helper to your Ember.js application.

For this example, I will be using Rails + Solr for the backend and Ember.js as my frontend framework.

I am doing this with Rails and Solr, but you can do it using other backend frameworks, as long as the JSON’s response resembles what we have here:
….

I mention this just on the off-chance that you will encounter users requesting pagination.

I’m not sure anything beyond page 1 and page 2 is needed for most pagination needs.

I remember reading in a study of query behavior using PubMed, you better have a disease that appears in the first two pages of results.

Anywhere beyond the first two pages, well, your family’s best hope is that you have life insurance.

If a client asks for beyond 2 pages of results, I would suggest monitoring search query behavior for say six months.

Just to give them an idea of what beyond page two is really accomplishing.

HyperDex 1.0 Release

Filed under: Hashing,HyperDex,Hyperspace — Patrick Durusau @ 4:46 pm

HyperDex 1.0 Release

From the webpage:

We are proud to announce HyperDex 1.0.0. With this official release, we pass the 1.0 development milestone. Key features of this release are:

  • High Performance: HyperDex is fast. It outperforms MongoDB and Cassandra on industry-standard benchmarks by a factor of 2X or more.
  • Advanced Functionality: With the Warp add-on, HyperDex offers multi-key transactions that span multiple objects with ACID guarantees.
  • Strong Consistency: HyperDex ensures that every GET returns the result of the latest PUT.
  • Fault Tolerance: HyperDex automatically replicates data to tolerate a configurable number of failures.

  • Scalable: HyperDex automatically redistributes data to make use of new resources as you add more nodes to your cluster.

HyperDex runs on 64-bit Linux (Ubuntu, Debian, Fedora, Centos) and OS X. Binary packages for Debian 7, Ubuntu 12.04-13.10, Fedora 18-20, and CentOS 6 are available from the Downloads page[1], as well as source tarballs for other Linux platforms.

This release provides bindings for C, C++, Python, Java, Ruby, and Go.

If that sounds good to you, drop by the Get HyperDex page.

See also: HyperDex Reference Manual v1.0.dev by Robert Escriva, Bernard Wong, and Emin Gün Sirer.

For the real story, see Papers and read HyperDex: A Distributed, Searchable Key-Value Store by Robert Escriva, Bernard Wong and Emin Gün Sirer.

The multidimensional aspects of HyperDex resemble recent efforts to move beyond surface tokens, otherwise known as words.

How to analyze 100 million images for $624

Filed under: Hadoop,Image Processing,OpenCV — Patrick Durusau @ 3:47 pm

How to analyze 100 million images for $624 by Pete Warden.

From the post:

Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world — spotting hipster favorites by the number of mustaches, for example.

[photo omitted]

Treating large numbers of photos as data, rather than just content to display to the user, is a pretty new idea. Traditionally it’s been prohibitively expensive to store and process image data, and not many developers are familiar with both modern big data techniques and computer vision. That meant we had to cut a path through some thick underbrush to get a system working, but the good news is that the free-falling price of commodity servers makes running it incredibly cheap.

I use m1.xlarge servers on Amazon EC2, which are beefy enough to process two million Instagram-sized photos a day, and only cost $12.48! I’ve used some open source frameworks to distribute the work in a completely scalable way, so this works out to $624 for a 50-machine cluster that can process 100 million pictures in 24 hours. That’s just 0.000624 cents per photo! (I seriously do not have enough exclamation points for how mind-blowingly exciting this is.)
….

There are a couple of other components that are necessary to reach the same results as Pete.

Seek HIPI for processing photos on Hadoop and OpenCV and the rest of Pete’s article for some very helpful tips.

Statistics, Data Mining, and Machine Learning in Astronomy:…

Filed under: Astroinformatics,Data Mining,Machine Learning,Statistics — Patrick Durusau @ 3:26 pm

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data by Željko Ivezic, Andrew J. Connolly, Jacob T VanderPlas, Alexander Gray.

From the Amazon page:

As telescopes, detectors, and computers grow ever more powerful, the volume of data at the disposal of astronomers and astrophysicists will enter the petabyte domain, providing accurate measurements for billions of celestial objects. This book provides a comprehensive and accessible introduction to the cutting-edge statistical methods needed to efficiently analyze complex data sets from astronomical surveys such as the Panoramic Survey Telescope and Rapid Response System, the Dark Energy Survey, and the upcoming Large Synoptic Survey Telescope. It serves as a practical handbook for graduate students and advanced undergraduates in physics and astronomy, and as an indispensable reference for researchers.

Statistics, Data Mining, and Machine Learning in Astronomy presents a wealth of practical analysis problems, evaluates techniques for solving them, and explains how to use various approaches for different types and sizes of data sets. For all applications described in the book, Python code and example data sets are provided. The supporting data sets have been carefully selected from contemporary astronomical surveys (for example, the Sloan Digital Sky Survey) and are easy to download and use. The accompanying Python code is publicly available, well documented, and follows uniform coding standards. Together, the data sets and code enable readers to reproduce all the figures and examples, evaluate the methods, and adapt them to their own fields of interest.

  • Describes the most useful statistical and data-mining methods for extracting knowledge from huge and complex astronomical data sets
  • Features real-world data sets from contemporary astronomical surveys
  • Uses a freely available Python codebase throughout
  • Ideal for students and working astronomers

Still in pre-release but if you want to order the Kindle version (or hardback) to be sent to me, I’ll be sure to it on my list of items to blog about in 2014!

Or your favorite book on graphs, data analysis, etc, for that matter. 😉

Command Line One Liners

Filed under: Linux OS,Programming — Patrick Durusau @ 3:16 pm

Command Line One Liners Arturo Herrero.

From the webpage:

After my blog post about command line one-liners, many people want to contribute with their own commands.

What one-liner do you want to contribute for the holiday season?

AstroML:… [0.2 release]

Filed under: Astroinformatics — Patrick Durusau @ 2:44 pm

AstroML: Machine Learning and Data Mining for Astronomy.

astroML 0.2 was released in November. Source on Github.

Introduction to astroML received the CIDU 2012 best paper award.

From the webpage:

AstroML is a Python module for machine learning and data mining built on numpy, scipy, scikit-learn, and matplotlib, and distributed under the 3-clause BSD license. It contains a growing library of statistical and machine learning routines for analyzing astronomical data in python, loaders for several open astronomical datasets, and a large suite of examples of analyzing and visualizing astronomical datasets.

The goal of astroML is to provide a community repository for fast Python implementations of common tools and routines used for statistical data analysis in astronomy and astrophysics, to provide a uniform and easy-to-use interface to freely available astronomical datasets. We hope this package will be useful to researchers and students of astronomy. The astroML project was started in 2012 to accompany the book Statistics, Data Mining, and Machine Learning in Astronomy by Zeljko Ivezic, Andrew Connolly, Jacob VanderPlas, and Alex Gray, published by Princeton University Press. The table of contents is available here: here(pdf), or you can view the book on Amazon.

Version 0.2 has improved documentation and examples.

Looking forward to the further development of this package!

BTW, be aware that data mining skills, save for domain knowledge, are largely transferable.

What’s Not There: The Odd-Lot Bias in TAQ Data

Filed under: Data,Finance Services — Patrick Durusau @ 2:07 pm

What’s Not There: The Odd-Lot Bias in TAQ Data by Maureen O’Hara, Chen Yao, and, Mao Ye.

Abstract:

We investigate the systematic bias that arises from the exclusion of trades for less than 100 shares from TAQ data. In our sample, we find that the median number of missing trades per stock is 19%, but for some stocks missing trades are as high as 66% of total transactions. Missing trades are more pervasive for stocks with higher prices, lower liquidity, higher levels of information asymmetry and when volatility is low. We show that odd lot trades contribute 30 % of price discovery and trades of 100 shares contribute another 50%, consistent with informed traders splitting orders into odd-lots and smaller trade sizes. The truncation of odd-lot trades leads to a significant bias for empirical measures such as order imbalance, challenges the literature using trade size to proxy individual trades, and biases measures of individual sentiment. Because odd-lot trades are more likely to arise from high frequency traders, we argue their exclusion from TAQ and the consolidated tape raises important regulatory issues.

TAQ = Trade and Quote Detail.

Amazing what you can find if you go looking for it. O’Hara and friends find that missing trades can be as much as 66% of the total transactions for some stocks.

The really big news is that from this academic paper, US regulators required disclosure of this hidden data starting on December 9, 2013

For access, see the Daily TAQ, where you will find the raw data for $1,500 per year for one user.

Despite its importance to the public, I don’t know of any time-delayed public archive of trade data.

Format specifications and sample data are available for:

  • Daily Trades File: Every trade reported to the consolidated tape, from all CTA participants. Each trade identifies the time, exchange, security, volume, price, sale condition, and more.
  • Daily TAQ Master File (Beta): (specification only)
  • Daily TAQ Master File: All master securities information in NYSE-listed and non-listed stocks, including Primary Exchange Indicator
  • Daily Quote and Trade Admin Message File: All Limit-up/Limit-down Price Band messages published on the CTA and UTP trade and quote feeds. The LULD trial is scheduled to go live with phase 1 on April 8, 2013.
  • Daily NBBO File: An addendum to the Daily Quotes file, containing continuous National Best Bid and Offer updates and consolidated trades and quotes for all listed and non-listed issues.
  • Daily Quotes File: Every quote reported to the consolidated tape, from all CTA participants. Each quote identifies the time, exchange, security, bid/ask volumes, bid/ask prices, NBBO indicator, and more.

Merging financial data with other data, property transactions/ownership, marriage/divorce, and other activities are a topic map activity.

Reverse Entity Recognition? (Scrubbing)

Filed under: Entities,Entity Resolution,Privacy — Patrick Durusau @ 12:51 pm

Improving privacy with language technologies by Rob Munro.

From the post:

One downside of the kind of technologies that we build at Idibon is that they can be used to compromise people’s privacy and, by extension, their safety. Any technology can be used for positive and negative purposes and as engineers we have a responsibility to ensure that what we create is for a better world.

For language technologies, the most negative application, by far, is eavesdropping: discovering information about people by monitoring their online communications and using that information in ways that harm the individuals. This can be something as direct and targeted as exposing the identities of at-risk individuals in a war-zone or it can be the broad expansion of government surveillance. The engineers at many technology companies announced their opposition to the latter with a loud, unified call today to reform government surveillance.

One way that privacy can be compromised at scale is the use of technology known as “named entity recognition”, which identifies the names of people, places, organizations, and other types of real-world entities in text. Given millions of sentences of text, named entity recognition can extract the names and addresses of everybody in the data in just a few seconds. But the same technology that can we used to uncover personally identifying information (PII) can also be used to remove the personally identifying information from the text. This is known as anonymizing or simply “scrubbing”.

Rob agrees that entity recognition can invade your personal privacy, but points out it can also protect your privacy.

You may think your “handle” on one or more networks provides privacy but it would not take much data to disappoint most people.

Entity recognition software can scrub data to remove “tells” that may identify you from it.

How much scrubbing is necessary depends on the data and the consequences of discovery.

Entity recognition is usually thought of as recognizing names, places, but it could just as easily be content analysis to recognize a particular author.

That would require more sophisticated “scrubbing” than entity recognition can support.

December 9, 2013

Building Client-side Search Applications with Solr

Filed under: Lucene,Search Interface,Searching,Solr — Patrick Durusau @ 7:46 pm

Building Client-side Search Applications with Solr by Daniel Beach.

Description:

Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.

If you need a compelling reason to watch this video, check out:

Global Patent Search Network.

What is the Global Patent Search Network?

As a result of cooperative effort between the United States Patent and Trademark Office (USPTO) and State Intellectual Property Office (SIPO) of the People’s Republic of China, Chinese patent documentation is now available for search and retrieval from the USPTO website via the Global Patent Search Network. This tool will enable the user to search Chinese patent documents in the English or Chinese language. The data available include fulltext Chinese patents and machine translations. Also available are full document images of Chinese patents which are considered the authoritative Chinese patent document. Users can search documents including published applications, granted patents and utility models from 1985 to 2012.

Something over four (4) million patents.

Try the site, then watch the video.

Software mentioned: Spyglass, Ember.js.

Introducing Luwak,…

Filed under: Java,Lucene,Searching — Patrick Durusau @ 5:04 pm

Introducing Luwak, a library for high-performance stored queries by Charlie Hull.

From the post:

A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.

We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:

That may sound odd, using the article as the query but be aware that Charlie reports “speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware.

Perhaps not “big data speed” but certainly enough speed to get your attention.

Charlie mentions in his Dublin slides that Luwak could be used to “Add metadata to items based on their content.”

That one use case but creating topic/associations out of content would be another.

Domino

Filed under: Cloud Computing,Interface Research/Design,R — Patrick Durusau @ 2:11 pm

San Francisco startup takes on collaborative Data Science from The R Backpages 2 by Joseph Rickert.

From the post:

Domino, a San Francisco based startup, is inviting users to sign up to beta test its vision of online, Data Science collaboration. The site is really pretty slick, and the vision of cloud computing infrastructure integrated with an easy to use collaboration interface and automatic code revisioning is compelling. Moreover, it is delightfully easy to get started with Domino. After filling out the new account form, a well thought out series of screens walks the new user through downloading the client software, running a script (R, MatLab or Python) and viewing the results online. The domino software creates a quick-start directory on your PC where it looks for scripts to run. After the installation is complete it is just a matter firing up a command window to run scripts in the cloud with:
….

Great review by Joseph on Domino an its use on a PC.

Persuaded me to do an install on my local box:

Installation on Ubuntu 12.04

  • Get a Domino Account
  • Download/Save the domino-install-unix.sh file to a convenient directory. (Just shy of 20MB.)
  • chmod -744 domino-install-unix.sh
  • ./domino-install-unix.sh
  • If you aren’t root, just ignore the symlink question. A bug but it will continue happily with the install. Tech support promptly reported that will be fixed.
  • BTW, installing from a shell window, requires a new shell window to take advantage of your path being modified to include the domino executable.
  • Follow the QuickStart, Steps 3, 4, and 5.
  • Step six of the QuickStart seems to be unnecessary. As the owner of the job, I was set to get email notification anyway.
  • Steps seven and eight of the QuickStart require no elaboration.

BTW, tech support was quick and on point in response to my questions about the installation script.

I have only run the demo scripts at this point but Domino looks like an excellent resource for R users and a great model from bringing the cloud to your desktop.

Leveraging software a user already knows to seamlessly deliver greater capabilities, has to be a winning combination.
.

ROpenSci

Filed under: Data,Dublin Core,R,Science — Patrick Durusau @ 1:00 pm

ROpenSci

From the webpage:

At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists. We hope that our tools will not only facilitate drawing data into an environment where it can readily be manipulated, but also one in which those analyses and methods can be easily shared, replicated, and extended by other researchers. While all the pieces for connecting researchers with these data sources exist as disparate entities, our efforts will provide a unified framework that will be quickly connect researchers to open data.

More than twenty (20) R packages are available today!

Great for data mining your favorite science data repository, but that isn’t the only reason I mention them.

One of the issues for topic maps has always been how to produce the grist for a topic map mill. There is a lot of data and production isn’t a thrilling task. 😉

But what if we could automate that production, at least to a degree?

The search functions in Treebase offer several examples of auto-generation of semantics would benefit both the data set and potential users.

In Treebase: An R package for discovery, access and manipulation of online phylogenies Carl Boettiger and Duncan Temple Lang point out that Treebase has search functions for “author,” and “subject.”

Err, but Dublin Core 1.1 refers to authors as “creators.” And “subject,” for Treebase means: “Matches in the subject terms.”

The ACM would say “keywords,” as would many others, instead of “subject.”

Not a great semantic speed bump* but one that if left unnoticed, will result in poorer, not richer search results.

What if for an R package like Treebase, a user could request what is identified by a field?

That is in addition to the fields being returned, one or more key/value pairs are returned for each field, which define what is identified by that field.

For example, for “author” an --iden switch could return:

Author Semantics
Creator http://purl.org/dc/elements/1.1/creator
Author/Creator http://catalog.loc.gov/help/author-keyword.htm

and so on, perhaps even including identifiers in other languages.

While this step only addresses identifying what a field identifies, it would be a first step towards documenting identifiers that could be used over and over again to improve access to scientific data.

Future changes and we know there will be future changes, are accommodated by simply appending to the currently documented identifiers.

Document identifier mappings once, Reuse identifier mappings many times.

PS: The mapping I suggest above is a blind mapping, there is no information is given about “why” I thought the alternatives given were alternatives to the main entry “author.”

Blind mappings are sufficient for many cases but are terribly insufficient for others. Biological taxonomies, for example, do change and capturing what characteristics underlie a particular mapping may be important in terms of looking forwards or backwards from some point in time in the development of a taxonomy.

* I note for your amusement that Wikipedia offers “vertical deflection traffic calming devices,” as a class that includes “speed bump, speed hump, speed cushion, and speed table.”

Like many Library of Congress subject headings, “vertical deflection traffic calming devices” doesn’t really jump to mind when doing a search for “speed bump.” 😉

Highlighting text in text mining

Filed under: R,Text Mining — Patrick Durusau @ 10:35 am

Highlighting text in text mining by Scott Chamberlain.

From the post:

rplos is an R package to facilitate easy search and full-text retrieval from all Public Library of Science (PLOS) articles, and we have a little feature which aren't sure if is useful or not. I don't actually do any text-mining for my research, so perhaps text-mining folks can give some feedback.

You can quickly get a lot of results back using rplos, so perhaps it is useful to quickly browse what you got. What better tool than a browser to browse? Enter highplos and highbrow. highplos uses the Solr capabilities of the PLOS search API, and lets you get back a string with the term you searched for highlighted (by default with <em> tag for italics).

The rplos package has various metric and retrieval functions in addition to its main search function.

A product of the ROpenSci project.

Quandl exceeds 7.8 million datasets!

Filed under: Data,Dataset,Marketing — Patrick Durusau @ 9:01 am

From The R Backpages 2 by Joseph Rickert.

From the post:

Quandl contiues its mission to seek out and make available the worlds financial and econometric data. Recently added data sets include:

That’s a big jump since our last post when Quandl broke 5 million datasets! (April 30, 2013)

Any thoughts on how many of these datasets have semantic mapping data to facilitate their re-use and/or combination with other datasets?

Selling the mapping data might be a tough sell because the customer still has to make intelligent use of it.

Selling mapped data on the other hand, that is offering consolidation of specified data sets on a daily, weekly, monthly basis, that might be a different story.

Something to think about.

PS: Do remember that a documented mapping for any dataset at Quandl will work for that same dataset elsewhere. So you won’t be re-discovering the mapping every time a request comes in for that dataset.

Not a “…butts in seats…” approach but then you probably aren’t a prime contractor.

December 8, 2013

Why Relationships are cool…

Filed under: Graph Databases,Graphs,OrientDB — Patrick Durusau @ 8:56 pm

Why Relationships are cool but the “JOIN” sucks by Luca Garulli.

I have been trying to avoid graph “intro” slides and presentations.

There are only so many times you can stand to hear “…all the world is a graph…” as though that’s news. To anyone.

This presentation by Luca is different from the usual introduction to graphs presentation.

Most of my readers won’t learn anything new but it may bump them into thinking of new ways to advocate the use of graphs.

By the way, Luca is responsible for OrientDB.

OrientDB version 1.6.1 was released on November 20, 2013, so if you haven’t looked at OrientDB in a while, now might be the time.

Updating OpenStreetMap…

Filed under: Maps,OpenStreetMap — Patrick Durusau @ 8:38 pm

Updating OpenStreetMap with the latest US road data by Eric Fisher.

From the post:

We can now pull the most current US government index of all roads directly into OpenStreetMap for tracing. Just go to OpenStreetMap.org, click Edit, and choose the “New & Misaligned TIGER Roads” option from the layer menu. “TIGER” is the name of the US road database managed by the Census Bureau. The TIGER layer will reveal in yellow any roads that have been corrected in or added to TIGER since 2006 and that have not also been corrected in OpenStreetMap. Zoom in on any yellow road to see how TIGER now maps it, verify it against the aerial imagery, and correct it in OpenStreetMap.

This could be very useful.

For planning protest, retreat, escape routes and such. 😉

Advances in Neural Information Processing Systems 26

Advances in Neural Information Processing Systems 26

The NIPS 2013 conference ended today.

All of the NIPS 2013 papers were posted today.

I count three hundred and sixty (360) papers.

From the NIPS Foundation homepage:

The Foundation: The Neural Information Processing Systems (NIPS) Foundation is a non-profit corporation whose purpose is to foster the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects. Neural information processing is a field which benefits from a combined view of biological, physical, mathematical, and computational sciences.

The primary focus of the NIPS Foundation is the presentation of a continuing series of professional meetings known as the Neural Information Processing Systems Conference, held over the years at various locations in the United States, Canada and Spain.

Enjoy the proceedings collection!

I first saw this in a tweet by Benoit Maison.

Mapping the open web using GeoJSON

Filed under: Geo Analytics,Geographic Data,Geographic Information Retrieval,JSON,NSA — Patrick Durusau @ 5:59 pm

Mapping the open web using GeoJSON by Sean Gillies.

From the post:

GeoJSON is an open format for encoding information about geographic features using JSON. It has much in common with older GIS formats, but also a few new twists: GeoJSON is a text format, has a flexible schema, and is specified in a single HTML page. The specification is informed by standards such as OGC Simple Features and Web Feature Service and streamlines them to suit the way web developers actually build software today.

Promoted by GitHub and used in the Twitter API, GeoJSON has become a big deal in the open web. We are huge fans of the little format that could. GeoJSON suits the web and suits us very well; it plays a major part in our libraries, services, and products.

A short but useful review of why GeoJSON is important to MapBox and why it should be important to you.

A must read if you are interested in geo-locating data of interest to your users to maps.

Sean mentions that Github promotes GeoJSON but I’m curious if the NSA uses/promotes it as well? 😉

Neo4j GraphGist December Challenge

Filed under: Contest,Graphs,Neo4j — Patrick Durusau @ 5:50 pm

Neo4j GraphGist December Challenge

Meetup Slides say: Deadline for entry is January 31st (2014). I mention that because the webpage still says Dec 31, 2013.

From the webpage:

This time we want you to look into these 10 categories and provide us with really easy to understand and still insightful Graph Use-Cases: Do not take the example keywords literally, you know your domain much better than we do!

  • Education – Schools, Universities, Courses, Planning, Management etc
  • Finance – Loans, Risks Fraud
  • Life Science – Biology, Genetics, Drug research, Medicine, Doctors, Referrals
  • Manufacturing – production line management, supply chain, parts list, product lines
  • Sports – Football, Baseball, Olympics, Public Sports
  • Resources – Energy Market, Consumption, Resource exploration, Green Energy, Climate Modeling
  • Retail – Recommendations, Product categories, Price Management, Seasons, Collections
  • Telecommunication – Infrastructure, Authorization, Planning, Impact
  • Transport – Shipping, Logistics, Flights, Cruises, Road/Train optimizations, Schedules
  • Advanced Graph Gists – for those of you that run outside of the competition anyway, give your best 🙂

Prizes:

We want to offer in each of our 10 categories Amazon gift-cards valued:

  1. Winner: 300 USD
  2. Second: 150 USD
  3. Third: 50 USD
  4. Every participant gets a special GraphGist t-shirt too.

In addition to the resources at the webpage, you may find AsciiDoc Cheatsheet helpful.

The meetup video where the GraphGist was announced.

Easy to understand graph use cases should not be too difficult.

Easy to solve graph use cases, that may be another matter. 😉

BayesDB

Filed under: Bayesian Data Analysis,Database — Patrick Durusau @ 5:14 pm

BayesDB (Alpha 0.1.0 Release)

From the webpage:

BayesDB, a Bayesian database table, lets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.

BayesDB is suitable for analyzing complex, heterogeneous data tables with up to tens of thousands of rows and hundreds of variables. No preprocessing or parameter adjustment is required, though experts can override BayesDB’s default assumptions when appropriate.

BayesDB’s inferences are based in part on CrossCat, a new, nonparametric Bayesian machine learning method, that automatically estimates the full joint distribution behind arbitrary data tables.

Now there’s an interesting idea!

Not sure if it is a good idea but it certainly is an interesting one.

December 7, 2013

Recommender Systems Course from GroupLens

Filed under: CS Lectures,Recommendation — Patrick Durusau @ 5:26 pm

Recommender Systems Course from GroupLens by Danny Bickson.

From the post:

I got the following course link from my colleague Tim Muss. The GroupLens research group (Univ. of Minnesota) have released a coursera course about recommender systems. Michael Konstan and Michael Ekstrand are lecturing. Any reader of my blog which has an elephant memory will recall I wrote about the Lenskit project already 2 years ago where I intreviewed Michael Ekstrand.

Would you agree that recommendation involves subject recognition?

At a minimum recognition of the subject to be recommended and the subject of a particular user’s preference.

I ask because the key to topic map “merging” isn’t ontological correctness but “correctness” in the eyes of a particular user.

What other standard would I use?

Large-Scale Machine Learning and Graphs

Filed under: GraphChi,GraphLab,Graphs,Python — Patrick Durusau @ 5:10 pm

Large-Scale Machine Learning and Graphs by Carlos Guestrin.

The presentation starts with a history of the evolution of GraphLab, which is interesting in and of itself.

Carlos then goes beyond a history lesson and gives a glimpse of a very exciting future.

Such as: installing GraphLab with Python, using Python for local development, running the same Python with Graphlab in the cloud.

Thought that might catch your eye.

Something to remember when people talk about scaling graph analysis.

If you are interested in seeing one possible future of graph processing today, not some day, check out: GraphLab Notebook (Beta).

BTW, Carlos mentions a technique call “think like a vertex” which involves distributing vertexes across machines rather than splitting graphs on edges.

Seems to me that would work to scale the processing of topic maps by splitting topics as well. Once “merging” has occurred on different machines, then “merge” the relevant topics back together across machines.

The Society of the Mind

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 2:40 pm

The Society of the Mind by Marvin Minsky.

From the Prologue:

This book tries to explain how minds work. How can intelligence emerge from nonintelligence? To answer that, we’ll show that you can build a mind from many little parts, each mindless by itself.

I’ll call Society of Mind this scheme in which each mind is made of many smaller processes. These we’ll call agents. Each mental agent by itself can only do some simple thing that needs no mind or thought at all. Yet when we join these agents in societies — in certain very special ways — this leads to true intelligence.

There’s nothing very technical in this book. It, too, is a society — of many small ideas. Each by itself is only common sense, yet when we join enough of them we can explain the strangest mysteries of mind. One trouble is that these ideas have lots of cross-connections. My explanations rarely go in neat, straight lines from start to end. I wish I could have lined them up so that you could climb straight to the top, by mental stair-steps, one by one. Instead they’re tied in tangled webs.

Perhaps the fault is actually mine, for failing to find a tidy base of neatly ordered principles. But I’m inclined to lay the blame upon the nature of the mind: much of its power seems to stem from just the messy ways its agents cross-connect. If so, that complication can’t be helped; it’s only what we must expect from evolution’s countless tricks.

What can we do when things are hard to describe? We start by sketching out the roughest shapes to serve as scaffolds for the rest; it doesn’t matter very much if some of those forms turn out partially wrong. Next, draw details to give these skeletons more lifelike flesh. Last, in the final filling-in, discard whichever first ideas no longer fit.

That’s what we do in real life, with puzzles that seem very hard. It’s much the same for shattered pots as for the cogs of great machines. Until you’ve seen some of the rest, you can’t make sense of any part.

All 270 essays in 30 chapters of Minsky’s 1988 book by the same name.

To be read critically.

It is dated but a good representative of a time in artificial intelligence.

I first saw this in Nat Torkington’s Five Short Links for 6 December 2013.

Free GIS Data

Filed under: Data,GIS,Mapping — Patrick Durusau @ 2:13 pm

Free GIS Data by Robin Wilson.

Over 300 GIS data sets. As of 7 December 2013, last updated 6 December 2013.

A very wide ranging collection of “free” GIS data.

Robin recommends you check the licenses of individual data sets. The meaning of “free” varies from person to person.

If you discover “free” GIS resources not listed on Robin’s page, drop him a note.

I first saw this in Pete Warden’s Five Short Links for November 30, 2013.

Think Tank Review

Filed under: Data,EU,Government — Patrick Durusau @ 11:47 am

Think Tank Review by Central Library of the General Secretariat of the EU Council.

The title could mean a number of things so when I saw it at Full Text Reports, I followed it.

From the first page:

Welcome to issue 8 of the Think Tank Review compiled by the Council Library.* It references papers published in October 2013. As usual, we provide the link to the full text and a short abstract.

The current Review and past issues can be downloaded from the Intranet of the General Secretariat of the Council or requested to the Library.

A couple of technical points: the Think Tank Review will soon be made available – together with other bibliographic and research products from the Library – on our informal blog at http://www.councillibrary.wordpress.com. A Beta version is already online for you to comment.

More broadly, in the next months we will be looking for ways to disseminate the contents of the Review in a more sophisticated way than the current – admittedly spartan – collection of links cast in a pdf format. We will look at issues such as indexing, full text search, long-term digital preservation, ease of retrieval and readability on various devices. Ideas from our small but faithful community of readers are welcome. You can reach us at central.library@consilium.europa.eu.

I’m not a policy wonk so scanning the titles didn’t excite me but it might you or (more importantly) one of your clients.

It seemed like an odd enough resource that you may not encounter it by chance.

December 6, 2013

Analysis of PubMed search results using R

Filed under: Bioinformatics,PubMed,R — Patrick Durusau @ 8:35 pm

Analysis of PubMed search results using R by Pilar Cacheiro.

From the post:

Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year…

Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine.

Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,…

Pliar does a great job introducing RISmed and pointing to additional sources for more examples and discussion of the package.

Meta-analysis is great but you could also be selling the results of your queries to PubMed.

After all, they would be logging your IP address, not that of your client.

Some people prefer more anonymity than others and are willing to pay for that privilege.

« Newer PostsOlder Posts »

Powered by WordPress