Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 13, 2013

JSNetworkX

Filed under: D3,Graphs,Javascript,Networks,NetworkX — Patrick Durusau @ 12:39 pm

JSNetworkX

A port of the NetworkX graph library to JavaScript

SNetworkX is a port of the popular Python graph library NetworkX. Lets describe it with their words:

NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and function of complex networks.

With NetworkX you can load and store networks in standard and nonstandard data formats, generate many types of random and classic networks, analyze network structure, build network models, design new network algorithms, draw networks, and much more.

Github.

Wiki.

Looks like an easy way to include graph representations of topic maps in a web page.

I suspect you will be seeing more of this in the not too distant future.

I first saw this in a tweet by Christophe Viau.

eSpatial launches free edition of mapping software

Filed under: Geographic Data,GIS,Mapping,Maps — Patrick Durusau @ 12:22 pm

eSpatial launches free edition of mapping software

From the post:

eSpatial, leading provider of powerful mapping software today announced the launch of a free edition of their flagship mapping software, also called eSpatial.

eSpatial mapping software lets users convert spreadsheet data into map form, with just a few clicks. This visualization provides immediate insights into market trends and challenges.

The new free edition of eSpatial is available to anyone who signs up for an account at www.espatial.com. Once logged on, users can create maps from their existing data and then post them on websites as interactive maps.

Since it launched last year, eSpatial has made strong inroads into the sales mapping and territory mapping software market, especially in the United States.

Paid editions (including Basic, Pro and Team) of the application with greater functionality – including the ability to handle increased amounts of data, reporting and sharing options – start at $399 for an annual subscription.

www.espatial.com

Just starting playing with this but it could be radically cool!

For example, what if you mapped a particular congressional district and then mapped by zip codes the donations to their campaign?

I need to read the manual and find some data to import.

BTW, high marks for one of the easiest registrations I have ever encountered.

Inferring Social Rank in…

Filed under: Networks,Probalistic Models,Social Networks — Patrick Durusau @ 4:05 am

Inferring Social Rank in an Old Assyrian Trade Network by David Bamman, Adam Anderson, Noah A. Smith.

Abstract:

We present work in jointly inferring the unique individuals as well as their social rank within a collection of letters from an Old Assyrian trade colony in K\”ultepe, Turkey, settled by merchants from the ancient city of Assur for approximately 200 years between 1950-1750 BCE, the height of the Middle Bronze Age. Using a probabilistic latent-variable model, we leverage pairwise social differences between names in cuneiform tablets to infer a single underlying social order that best explains the data we observe. Evaluating our output with published judgments by domain experts suggests that our method may be used for building informed hypotheses that are driven by data, and that may offer promising avenues for directed research by Assyriologists.

An example of how digitization of ancient texts enables research other than text searching.

Inferring identity and social rank may be instructive for creation of topic maps from both ancient and modern data sources.

I first saw this in a tweet by Stefano Bertolo.

D3 Alternative Documentation

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 3:41 am

D3 Alternative Documentation

Documentation generated with a “TypeScript Ambient Source File Documentation Generator.”

dtsdoc is a documentation generator for TypeScript ambient source file(.d.ts). You can also use markdown to document your code.

dtsdoc runs on a Web browser or node.js.

I first saw this in a tweet by Christophe Viau.

March 12, 2013

Fast Data Gets A Jump On Big Data

Filed under: BigData,Decision Making,Intelligence — Patrick Durusau @ 2:59 pm

Fast Data Gets A Jump On Big Data by Hasan Rizvi.

The title reminded me of a post by Sam Hunting that asked: “How come we’ve got Big Data and not Good Data?”

Now “big data” is to give way to “fast data.”

From the post:

Today, both IT and business users alike are facing business scenarios where they need better information to differentiate, innovate, and radically transform their business.

In many cases, that transformation is being enabled by a move to “Big Data.” Organizations are increasingly collecting vast quantities of real-time data from a variety of sources, from online social media data to highly-granular transactional data to data from embedded sensors. Once collected, users or businesses are mining the data for meaningful patterns that can be used to drive business decisions or actions.

Big Data uses specialized technologies (like Hadoop and NoSQL) to process vast amounts of information in bulk. But most of the focus on Big Data so far has been on situations where the data being managed is basically fixed—it’s already been collected and stored in a Big Data database.

This is where Fast Data comes in. Fast Data is a complimentary approach to Big Data for managing large quantities of “in-flight” data that helps organizations get a jump on those business-critical decisions. Fast Data is the continuous access and processing of events and data in real-time for the purposes of gaining instant awareness and instant action. Fast Data can leverage Big Data sources, but it also adds a real-time component of being able to take action on events and information before they even enter a Big Data system.

Sorry Sam, “good data” misses out again.

Data isn’t the deciding factor in human decision making, instant or otherwise, see Thinking, Fast and Slow by Daniel Kahnman.

Supplying decision makers with good data and sufficient time to consider it, is the route to better decision making.

Of course, that leaves time to discover the poor quality of data provided by fast/big data delivery mechanisms.

Elasticsearch and Joining

Filed under: ElasticSearch,Joins,Lucene — Patrick Durusau @ 2:40 pm

Elasticsearch and Joining by Felix Hürlimann.

From the post:

With the success of elasticsearch, people, including us, start to explore the possibilities and mightiness of the system. Including border cases for which the underlying core, Lucene, never was originally intended or optimized for. One of the many requests that come up pretty quickly is the whish for joining data across types or indexes, similar to an SQL join clause that combines records from two or more tables in a database. Unfortunately full join support is not (yet?) available out of the box. But there are some possibilities and some attempts to solve parts of issue. This post is about summarizing some of the ideas in this field.

To illustrate the different ideas, let’s work with the following example: we would like to index documents and comments with a one to many relationship between them. Each comment has an author and we would like to answer the question: Give me all documents that match a certain query and a specific author has commented on it.

The latest beta release of Elasticsearch is described as:

If you have more complex requirements for join, a new feature introdcued in the latest beta release may can help you. It introduces another feature that allows for a kind of join by looking up filter terms in another index or type. This allows then e.g. for queries like ‘Show me all comments from documents that relate to this document and the author is ‘John Doe’.

The “looking up” in a different index or type sounds quite interesting.

Have you looked at the new beta of Elasticsearch?

…Apache HBase REST Interface, Part 1

Filed under: Cloudera,HBase — Patrick Durusau @ 2:20 pm

How-to: Use the Apache HBase REST Interface, Part 1 by Jesse Anderson.

From the post:

There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.

There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.

This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.

Post also has a reminder about HBaseCon 2013 (June 13, San Francisco).

Data Catalog Vocabulary (DCAT) [Last Call ends 08 April 2013]

Filed under: DCAT,RDF,Vocabularies — Patrick Durusau @ 2:01 pm

Data Catalog Vocabulary (DCAT)

Abstract:

DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. This document defines the schema and provides examples for its use.

By using DCAT to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs. It further enables decentralized publishing of catalogs and facilitates federated dataset search across sites. Aggregated DCAT metadata can serve as a manifest file to facilitate digital preservation.

If you have comments, now would be a good time to finish them up for submission.

I first saw this in a tweet by Sandro Hawke.

RDF Data Cube Vocabulary [Last Call ends 08 April 2013]

Filed under: Data Cubes,RDF,RDF Data Cube Vocabulary,Statistics — Patrick Durusau @ 1:57 pm

RDF Data Cube Vocabulary

Abstract:

There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows or other multi-dimensional data sets.

If you have comments, now would be a good time to finish them up for submission.

I first saw this in a tweet by Sandro Hawke.

SPIN: SPARQL Inferencing Notation

Filed under: RDF,SPARQL — Patrick Durusau @ 1:49 pm

SPIN: SPARQL Inferencing Notation

From the webpage:

SPIN is a W3C Member Submission that has become the de-facto industry standard to represent SPARQL rules and constraints on Semantic Web models. SPIN also provides meta-modeling capabilities that allow users to define their own SPARQL functions and query templates. Finally, SPIN includes a ready to use library of common functions.

SPIN in Five Slides.

In case you encounter SPARQL rules and constraints.

I first saw this in a tweet by Stian Danenbarger.

Apache Lucene/Solr 4.2 Out!

Filed under: Lucene,Solr — Patrick Durusau @ 1:40 pm

Apache Lucene 4.2

Download

Changes

Apache Solr 4.2

Download

Changes

See the Lucene homepage for a summary of the 4.2 changes in Lucene and Solr.

Warning: Reference good only until the next Lucene/Solr release. 😉

The Mythical WITH (Neo4j’s Cypher query language)

Filed under: Cypher,Neo4j — Patrick Durusau @ 1:17 pm

The Mythical WITH (Neo4j’s Cypher query language) by Wes Freeman.

From the post:

Coming from SQL, I found Cypher a quick learn. The match was new, and patterns were new, but everything else seemed to fit well with SQL concepts. Except with, the way to build a sort of sub-query–it seemed hard to wrap my head around. So, what really happens behind the scenes with a with clause in your query? How does it work? It turns out, almost any complex query ends up needing a with in it, but let’s start with a simple example.

After reading this post, you will be waiting for part 2!

Very good introduction to with in Cypher!

BTW, as an added bonus, Wes blogs about chess as well.

prefix.cc

Filed under: Namespace,RDF,RDFa — Patrick Durusau @ 10:35 am

prefix.cc: namespace lookup for RDF developers

From the about page:

The intention of this service is to simplify a common task in the work of RDF developers: remembering and looking up URI prefixes.

You can look up prefixes from the search box on the homepage, or directly by typing URLs into your browser bar, such as http://prefix.cc/foaf or http://prefix.cc/foaf,dc,owl.ttl.

New prefix mappings can be added by anyone. If multiple conflicting URIs are submitted for the same namespace, visitors can vote for the one they consider the best. You are only allowed one vote or namespace submission per day.

For n3, xml, rdfa, sparql, the result interface shows the URI prefixes in use.

But if there is more than one URI prefix, difference URI prefixes appear with each example.

Don’t multiple, conflicting URI prefixes seem problematic to you?

Sinking Data to Neo4j from Hadoop with Cascading

Filed under: Cascading,Hadoop,Neo4j — Patrick Durusau @ 10:18 am

Sinking Data to Neo4j from Hadoop with Cascading by Paul Ingles.

From the post:

Recently, I worked with a colleague (Paul Lam, aka @Quantisan on building a connector library to let Cascading interoperate with Neo4j: cascading.neo4j. Paul had been experimenting with Neo4j and Cypher to explore our data through graphs and we wanted an easy way to flow our existing data on Hadoop into Neo4j.

The data processing pipeline we’ve been growing at uSwitch.com is built around Cascalog, Hive, Hadoop and Kafka.

Once the data has been aggregated and stored a lot of our ETL is performed upon Cascalog and, by extension, Cascading. Querying/analysis is a mix of Cascalog and Hive. This layer is built upon our long-term data storage system: Hadoop; this, all combined, lets us store high-resolution data immutably at a much lower cost than uSwitch’s previous platform.

As Paul notes later in his post, this isn’t a fast solution, about 20,000 nodes a second.

But if that fits your requirements, could be a good place to start.

A Concise Course in Algebraic Topology

Filed under: Algebra,Mathematics,Topology — Patrick Durusau @ 10:05 am

A Concise Course in Algebraic Topology by J.P. May. (PDF)

From the introduction:

The first year graduate program in mathematics at the University of Chicago consists of three three-quarter courses, in analysis, algebra, and topology. The first two quarters of the topology sequence focus on manifold theory and differential geometry, including differential forms and, usually, a glimpse of de Rham cohomology. The third quarter focuses on algebraic topology. I have been teaching the third quarter off and on since around 1970. Before that, the topologists, including me, thought that it would be impossible to squeeze a serious introduction to algebraic topology into a one quarter course, but we were overruled by the analysts and algebraists, who felt that it was unacceptable for graduate students to obtain their PhDs without having some contact with algebraic topology.

This raises a conundrum. A large number of students at Chicago go into topology, algebraic and geometric. The introductory course should lay the foundations for their later work, but it should also be viable as an introduction to the subject suitable for those going into other branches of mathematics. These notes reflect my efforts to organize the foundations of algebraic topology in a way that caters to both pedagogical goals. There are evident defects from both points of view. A treatment more closely attuned to the needs of algebraic geometers and analysts would include Čech cohomology on the one hand and de Rham cohomology and ˇ perhaps Morse homology on the other. A treatment more closely attuned to the needs of algebraic topologists would include spectral sequences and an array of calculations with them. In the end, the overriding pedagogical goal has been the introduction of basic ideas and methods of thought.

Tough sledding but having insights, like those found in the GraphLab project, require a deeper than usual understanding of the issues at hand.

I first saw this in a tweet by Topology Fact.

March 11, 2013

GraphLab: A Distributed Abstraction…

Filed under: Graph Partitioning,GraphLab,Graphs,Networks — Patrick Durusau @ 7:32 pm

GraphLab: A Distributed Abstraction for Machine Learning in the Cloud by Carlos Guestrin. (video)

Take away line: “How does a vertex think?”

Deeply impressive presentation and performance numbers!

Resources:

GraphLab 2.1: http://graphlab.org

GraphChi 0.1: http://graphchi.org

Slides from the talk.

This needs to be very high on your graph reading/following list.

I first saw this at: SFBayACM talk: GraphLab framework for Machine Learning in the Cloud.

Reco4j

Filed under: Graphs,Neo4j,Recommendation — Patrick Durusau @ 7:26 pm

Reco4j

Reco4j is an open source project that aims at developing a recommendation framework based on graph data sources. We choose graph databases for several reasons. They are NoSQL databases that are “schemaless”. This means that it is possible to extend the basic data structure with intermediate information, i.e. similarity value between item and so on. Moreover, since every information are expressed with some properties, nodes and relations, the recommendation process can be customized to work on every graph.
Indeed Reco4j can be used on every graph where “user” and “item” are represented by nodes and the preferences are modelled as relationship between them.

The current implementation leverages on Neo4j as first example of graph database integrated in our framework.

The main features of Reco4j are:

  1. Performance, leveraging on the graph database and storing information in it for future retrieving it produce fast recommendations also after a system restart;
  2. Use of Network structure, integrating the simple recommendation algorithms with (social) network analisys;
  3. General purpose, it can be used with preexisting databases;
  4. Customizability, editing the properties file the recommender framework can be adapted to the current graph structure and use several types of the recommendation algorithms;
  5. Ready for Cloud, leveraging on the graph database cloud features the recommendation process can be splitted on several nodes.

Just in case you don’t like the recommendations you get from Amazon. 😉

BTW, “splitted” is an archaic past tense form of split. (According to Merriam-Webster.)

Say rather “…the recommendation process can be split onto several nodes.”

Studying PubMed usages in the field…

Filed under: Curation,Interface Research/Design,Searching,Usability,Users — Patrick Durusau @ 4:24 pm

Studying PubMed usages in the field for complex problem solving: Implications for tool design by Barbara Mirel, Jennifer Steiner Tonks, Jean Song, Fan Meng, Weijian Xuan, Rafiqa Ameziane. (Mirel, B., Tonks, J. S., Song, J., Meng, F., Xuan, W. and Ameziane, R. (2013), Studying PubMed usages in the field for complex problem solving: Implications for tool design. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22796)

Abstract:

Many recent studies on MEDLINE-based information seeking have shed light on scientists’ behaviors and associated tool innovations that may improve efficiency and effectiveness. Few, if any, studies, however, examine scientists’ problem-solving uses of PubMed in actual contexts of work and corresponding needs for better tool support. Addressing this gap, we conducted a field study of novice scientists (14 upper-level undergraduate majors in molecular biology) as they engaged in a problem-solving activity with PubMed in a laboratory setting. Findings reveal many common stages and patterns of information seeking across users as well as variations, especially variations in cognitive search styles. Based on these findings, we suggest tool improvements that both confirm and qualify many results found in other recent studies. Our findings highlight the need to use results from context-rich studies to inform decisions in tool design about when to offer improved features to users.

From the introduction:

For example, our findings confirm that additional conceptual information integrated into retrieved results could expedite getting to relevance. Yet—as a qualification—evidence from our field cases suggests that presentations of this information need to be strategically apportioned and staged or they may inadvertently become counterproductive due to cognitive overload.

Curated data raises its ugly head, again.

Topic maps curate data and search results.

Search engines don’t curate data or search results.

How important is it for your doctor to find the right answers? In a timely manner?

The Annotation-enriched non-redundant patent sequence databases [Curation vs. Search]

Filed under: Bioinformatics,Biomedical,Marketing,Medical Informatics,Patents,Topic Maps — Patrick Durusau @ 2:01 pm

The Annotation-enriched non-redundant patent sequence databases Weizhong Li, Bartosz Kondratowicz, Hamish McWilliam, Stephane Nauche and Rodrigo Lopez.

Not a real promising title is it? 😉 The reason I cite it here is that by curation, the database is “non-redundant.”

Try searching for some of these sequences at the USPTO and compare the results.

The power of curation will be immediately obvious.

Abstract:

The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases.

Database URL: http://www.ebi.ac.uk/patentdata/nr/

Topic maps are curated data. Which one do you prefer?

Microsoft Goes After 3 Big Data Myths

Filed under: BigData,Design — Patrick Durusau @ 2:00 pm

Microsoft Goes After 3 Big Data Myths by Jeff Bertolucci.

It’s Jeff’s coverage of the second myth that I want to mention:

The second myth, Microsoft said, pertains to the looming data scientist shortage: Enterprises can’t find enough qualified big data gurus to pull insights from unstructured information sources, such as social media feeds and machine sensor data.

“While it is true that the industry needs more data scientists, it is equally true that most organizations are equipped with the employees they need today to help them gather the valuable insights from their data that will better their business,” writes Kelly.

In other words, big data tools and apps can save the day. Microsoft’s argument ties in with the so-called democratization of data movement. Popular tools, such as Excel with the Data Explorer add-in, allow end users to perform BI analysis without having to pester IT for help.

Isn’t that similar to the difference between being able to use MS Word and being an author?

I know lots of people who can do one but not the other.

The danger from the Microsoft argument comes from staff on payroll performing poorly at BI analysis isn’t a line item in the budget. Lost opportunities never are.

On the other hand, getting competent help that uses Microsoft or other data analytic tools, is a line item in the budget.

Managers may be tempted in budget conscious times to opt for the no budget line item option.

Consider that carefully, the opportunities you lose may be your own.


Update: A better example is using MS PowerPoint does not make you a presenter. We have all sat through dead by PowerPoint presentations and will again.

NewGenLib FOSS Library Management System [March 15th, 2013 Webinar]

Filed under: Library,Library software — Patrick Durusau @ 1:28 pm

NewGenLib FOSS Library Management System

From the post:

EIFL-FOSS is organising a free webinar on NewGenLib (NGL), an open-source Library Management System (ILS). The event will take place this coming Friday, March 15th, 2013 at 09.00-10.00 GMT / UK time (10.00-11.00 CET / Rome, Italy). The session is open to anyone to attend but places are limited, so registration is recommended.

NGL, an outcome of collaboration between Verus and Kesavan Institute of Information and Knowledge management, has been implemented in over 30 countries in at least 4 different languages supporting fully international library metadata standards. The software runs on Windows or Linux and is designed to work equally well in one single library as it does across a dispersed network of libraries.

URL for more info: http://www.eifl.net/events/newgenlib-ils-windows-and-linux-free-webina

As you already know, there is no shortage of vendor-based and open source library information systems.

That diversity is an opportunity to show how topic maps can make distinct systems appear as one, while retaining their separate character.

Big Bang Meets Big Data

Filed under: Astroinformatics,BigData — Patrick Durusau @ 1:10 pm

Big Bang Meets Big Data

From the post:

Pretoria, South Africa, March 11, 2013: Square Kilometer Array (SKA) South Africa, a business unit of the country’s National Research Foundation is joining ASTRON, the Netherlands Institute for Radio Astronomy, and IBM in a four-year collaboration to research extremely fast, but low-power exascale computer systems aimed at developing advanced technologies for handling the massive amount of data that will be produced by the SKA, which is one of the most ambitious science projects ever undertaken.

The SKA is an international effort to build the world’s largest and most sensitive radio telescope, which is to be located in Southern Africa and Australia to help better understand the history of the universe. The project constitutes the ultimate Big Data challenge, and scientists must produce major advances in computing to deal with it. The impact of those advances will be felt far beyond the SKA project-helping to usher in a new era of computing, which IBM calls the era of cognitive systems.

FLICKR IMAGES: http://www.flickr.com/photos/ibm_research_zurich/sets/72157629212636619
VIDEO: https://www.youtube.com/watch?v=zU7KNRpn6co

When the SKA is completed, it will collect Big Data from deep space containing information dating back to the Big Bang more than 13 billion years ago. The aperture arrays and dishes of the SKA will produce 10 times the global internet traffic*, but the power to process all of this data as it is collected far exceeds the capabilities of the current state-of-the-art technology.

Just in case you are interested in “big data” writ large. 😉

There will be legacy data from optical and radio astronomy instruments, to say nothing of the astronomical literature, to curate along side this data tsunami.

Spatial Orientation and the Brain:…
[Uni-Sex Data Navigation?]

Filed under: Interface Research/Design,Navigation,Usability,Users — Patrick Durusau @ 12:05 pm

Spatial Orientation and the Brain: The Effects of Map Reading and Navigation by Rebecca Maxwell.

From the post:

The human brain is a remarkable organ. It has the ability to reason, create, analyze, and process tons of information each day. The brain also gives humans the ability to move around in an environment using an innate sense of direction. This skill is called spatial orientation, and it is especially useful for finding routes in an unfamiliar place, following directions to another person’s house, or making a midnight raid of the refrigerator in the dark. Spatial orientation is crucial for adapting to new environments and getting from one point to another. Without it, people will walk around in endless circles, never being able find which way they want to go.

The brain has a specialized region just for navigating the spatial environment. This structure is called the hippocampus, also known as the map reader of the brain. The hippocampus helps individuals determine where they are, how they got to that particular place, and how to navigate to the next destination. Reading maps and developing navigational skills can affect the brain in beneficial ways. In fact, using orientation and navigational skills often can actually cause the hippocampus and the brain to grow, forming more neural pathways as the number of mental maps increase.

A study by scientists at University College in London found that grey matter in the brains of taxi drivers grew and adapted to help them store detailed mental maps of the city. The drivers underwent MRI scans, and those scans showed that the taxi drivers have larger hippocampi when compared to other people. In addition, the scientists found that the more time the drivers spent on the job, the more the hippocampus changes structurally to accommodate the large amount of navigational experience. Drivers who spent more than forty years in a taxi had more developed hippocampi than those just starting out. The study shows that experience with the spatial environment and navigation can have a direct influence on the brain itself.

However, the use of modern navigational technology and smartphone apps has the potential to harm the brain depending on how it is used in today’s world. Map reading and orienteering are becoming lost arts in the world of global positioning systems and other geospatial technologies. As a result, more and more people are losing the ability to navigate and find their way in unfamiliar terrain. According to the BBC, police in northern Scotland issued an appeal for hikers to learn orienteering skills rather than relying solely on smartphones for navigation. This came after repeated rescues of lost hikers by police in Grampian, one of which included finding fourteen people using mountain rescue teams and a helicopter. The police stated that the growing use of smartphone apps for navigation can lead to trouble because people become too dependent on technology without understanding the tangible world around them.

….

Other studies demonstrate that men and women develop different methods of navigating and orienting themselves to the spatial environment because of differences in roles as hunters and gatherers. This could explain the reason why men get lost in supermarkets while women can find their way around in minutes. Research done at Queen Mary, University of London demonstrated that men are better at finding hidden objects while women are better at remembering where objects are at. In addition, Frank Furedi, a sociology professor at Kent University, states that women are better at making judgment calls while men tend to overcomplicate the most basic navigational tasks.

The use of map reading and navigating skills to explore the spatial environment can benefit the brain and cause certain areas to grow while the use of modern technology for navigation seems to only hinder the brain. No matter which strategy men and women use for navigation, it is important to practice those skills and tune into the environment. While technology is a useful tool, in the end the human brain remains the most sophisticated map reader.

Very interesting post on the impact of GIS systems on the human brain and gender differences in methods of navigation.

Question: Gender differences in navigation are more than folktales so why do we have uni-sex data navigation interfaces?

Onomastics 2.0 – The Power of Social Co-Occurrences

Filed under: co-occurrence,Names,Onomastics,Subject Identity — Patrick Durusau @ 6:45 am

Onomastics 2.0 – The Power of Social Co-Occurrences by Folke Mitzlaff, Gerd Stumme.

Abstract:

Onomastics is “the science or study of the origin and forms of proper names of persons or places.” [“Onomastics”. Merriam-Webster.com, 2013. this http URL (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste.

With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names.

The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed.

The discovered relations among given names are the foundation of “nameling” [this http URL], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.

Interesting work on the co-occurrence of names.

Chosen names in this case but I wonder if the same would be true for false names?

Are there patterns to false names chosen by actors who are attempting to conceal their identities?

I first saw this in a tweet by Stefano Bertolo.

Learning Hash Functions Using Column Generation

Filed under: Hashing,Indexing,Similarity — Patrick Durusau @ 6:11 am

Learning Hash Functions Using Column Generation by Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick.

Abstract:

Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets.

Interesting review of hashing techniques.

Raises the question of customized similarity (read sameness) hashing algorithms for topic maps.

I first saw this in a tweet by Stefano Bertolo.

March 10, 2013

Why Data Lineage is Your Secret … Weapon [Auditing Topic Maps]

Filed under: Data Quality,Merging,Provenance — Patrick Durusau @ 8:42 pm

Why Data Lineage is Your Secret Data Quality Weapon by Dylan Jones.

From the post:

Data lineage means many things to many people but it essentially refers to provenance – how do you prove where your data comes from?

It’s really a simple exercise. Just pull an imaginary string of data from where the information presents itself, back through the labyrinth of data stores and processing chains, until you can go no further.

I’m constantly amazed by why so few organisations practice sound data lineage management despite having fairly mature data quality or even data governance programs. On a side note, if ever there was a justification for the importance of data lineage management then just take a look at the brand damage caused by the recent European horse meat scandal.

But I digress. Why is data lineage your secret data quality weapon?

The simple answer is that data lineage forces your organisation to address two big issues that become all too apparent:

  • Lack of ownership
  • Lack of formal information chain design

Or to put it into a topic map context, can you trace what topics merged to create the topic you are now viewing?

And if you can’t trace, how can you audit the merging of topics?

And if you can’t audit, how do you determine the reliability of your topic map?

That is reliability in terms of date (freshness), source (reliable or not), evaluation (by screeners), comparison (to other sources), etc.

Same questions apply to all data aggregation systems.

Or as Mrs. Weasley tells Ginny:

“Never trust anything that can think for itself if you can’t see where it keeps its brain.”


Correction: Wesley -> Weasley. We had a minister friend over Sunday and were discussing the former, not the latter. 😉

Cool GSS training video! And cumulative file 1972-2012!

Filed under: Dataset,Social Sciences,Socioeconomic Data — Patrick Durusau @ 8:42 pm

Cool GSS training video! And cumulative file 1972-2012! by Andrew Gelman.

From the post:

Felipe Osorio made the above video to help people use the General Social Survey and R to answer research questions in social science. Go for it!

From the GSS: General Social Survey website:

The General Social Survey (GSS) conducts basic scientific research on the structure and development of American society with a data-collection program designed to both monitor societal change within the United States and to compare the United States to other nations.

The GSS contains a standard ‘core’ of demographic, behavioral, and attitudinal questions, plus topics of special interest. Many of the core questions have remained unchanged since 1972 to facilitate time-trend studies as well as replication of earlier findings. The GSS takes the pulse of America, and is a unique and valuable resource. It has tracked the opinions of Americans over the last four decades.

The information “gap” is becoming more of a matter of skill than access to underlying data.

How would you match the GSS data up to other data sets?

Interviewing Databases???

Filed under: News,Reporting — Patrick Durusau @ 8:42 pm

“We’re going to tell people how to interview databases”: The rise of data (big and small) in journalism

Caroline O’Donovan writes:

Viktor Mayer-Schönberger and Kenneth Cukier published their joint tome on big data this week, Big Data: A Revolution That Will Transform How We Live, Work and Think. Mayer-Schönberger, a professor of Internet governance and regulation at Oxford, and Cukier, the data editor of The Economist, argue that having access to vast amounts of data will soon overwhelm our natural human tendency to look for correlation and causality where there is none. In the near future, we’ll be able to rely on much larger pools of “messy” data rather than small pools of “clean” data to get more accurate answers to our questions.

“We are taking things we never thought of as informational and rendering them in data,” Mayer-Schönberger said in a talk Wednesday at the Berkman Center for Internet & Society at Harvard. “Once we think of it as data, we can organize it and extract new information.”

In their book, Mayer-Schönberger and Cukier give a number of examples of industries that will be changed forever by the new messiness of data. Bradford Cross cofounded FlightCaster.com, which predicted U.S. flight delays using data about flight times and weather patterns. The company was sold in 2011, at which point “Cross turned his sights on another aging industry.” He started Prismatic, one of a number of news aggregators that filters content for users by analyzing data about sharing frequency on social networks and user preferences.

Caroline quotes Cukier on “interviewing databases,” saying:

When we teach journalism in the future, we’re not just going to teach people the fundamentals of how to do an interview, or what a lede paragraph is. We’re going to tell people how to interview databases. And also, just as we train journalists by telling them that sometimes people that we interview are unfaithful and lie, we’re going to have to teach them to be suspicious of the data, because sometimes the data lies, too. You have to bring the same scrutiny as in the analog world — talking to people and observing — to the data as well.

I like the image of interviewing a database.

How many times do you think a database will be asked the same questions by different reporters?

Do you think recording and sharing those answers would save other reporters time and resources?

How about enabling other reporters to ask questions you forgot or didn’t know enough to ask?

If any of that rings a bell, there may be topic maps in your future.

Solr: Custom Ranking with Function Queries

Filed under: Lucene,Ranking,Solr — Patrick Durusau @ 8:42 pm

Solr: Custom Ranking with Function Queries by Sujit Pal.

From the post:

Solr has had support for Function Queries since version 3.1, but before sometime last week, I did not have a use for it. Which is probably why when I would read about Function Queries, they would seem like a nice idea, but not interesting enough to pursue further.

Most people get introduced to Function Queries through the bf parameter in the DisMax Query Parser or through the geodist function in Spatial Search. So far, I haven’t had the opportunity to personally use either feature in a real application. My introduction to Function Queries was through a problem posed to me by one of my coworkers.

The problem was as follows. We want to be able to customize our search results based on what a (logged-in) user tells us about himself or herself via their profile. This could be gender, age, ethnicity and a variety of other things. On the content side, we can annotate the document with various features corresponding to these profile features. For example, we can assign a score to a document that indicates its appeal/information value to males versus females that would correspond to the profile’s gender.

So the idea is that if we know that the profile is male, we should boost the documents that have a high male appeal score and deboost the ones that have a high female appeal score, and vice versa if the profile is female. This idea can be easily extended for multi-category features such as ethnicity as well. In this post, I will describe a possible implementation that uses Function Queries to rerank search results using male/female appeal document scores.

Does your topic map deliver information based on user characteristics?

Have you re-invented the ranking or are you using an off-the-shelf solution?

Programming Isn’t Math

Filed under: Algebird,Scala,Scalding,Tweets — Patrick Durusau @ 8:41 pm

Programming Isn’t Math by Oscar Boykin.

From the description:

Functional programming has a rich history of drawing from mathematical theory, yet in this highly entertaining talk from the Northeast Scala Symposium, Twitter data scientist Oscar Boykin make the case that programming is distinct from mathematics. This distinction is healthy and does not mean we can’t leverage many results and concepts from mathematics.

As examples, Oscar will discuss some recent work — algebird, bijection, scalding — and show cases where mathematical purity were both helpful and harmful to developing products at Twitter.

The phrase “…highly entertaining…” may be an understatement.

The type of presentation where you want to starting reading new material during the presentation but you are afraid of missing the next gold nugget!

Definitely one to start the week on!

« Newer PostsOlder Posts »

Powered by WordPress