Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 14, 2013

Looking ahead [Exploratory Merging?]

Filed under: Interface Research/Design,Merging,Searching — Patrick Durusau @ 6:31 pm

Looking ahead by Gene Golovchinsky.

From the post:

It is reasonably well-known that people who examine search results often don’t go past the first few hits, perhaps stopping at the “fold” or at the end of the first page. It’s a habit we’ve acquired due to high-quality results to precision-oriented information needs. Google has trained us well.

But this habit may not always be useful when confronted with uncommon, recall-oriented, information needs. That is, when doing research. Looking only at the top few documents places too much trust in the ranking algorithm. In our SIGIR 2013 paper, we investigated what happens when a light-weight preview mechanism gives searchers a glimpse at the distribution of documents — new, re-retrieved but not seen, and seen — in the query they are about to execute.

The preview divides the top 100 documents retrieved by a query into 10 bins, and builds a stacked bar chart that represents the three categories of documents. Each category is represented by a color. New documents are shown in teal, re-retrieved ones in the light blue shade, and documents the searcher has already seen in dark blue. The figures below show some examples:

(…)

The blog post is great but you really need to ready the SIGIR paper in full.

Speaking of exploratory searching, is anyone working on exploratory merging?

That is where a query containing a statement of synonymy or polysemy from a searcher results in exploratory merging of topics?

I am assuming that experts in a particular domain will see merging opportunities that eluded automatic processes.

Seems like a shame to waste their expertise, which could be captured to improve a topic map for future users.


The SIGIR paper:

Looking Ahead: Query Preview in Exploratory Search

Abstract:

Exploratory search is a complex, iterative information seeking activity that involves running multiple queries, finding and examining many documents. We introduced a query preview interface that visualizes the distribution of newly-retrieved and re-retrieved documents prior to showing the detailed query results. When evaluating the preview control with a control condition, we found effects on both people’s information seeking behavior and improved retrieval performance. People spent more time formulating a query and were more likely to explore search results more deeply, retrieved a more diverse set of documents, and found more different relevant documents when using the preview. With more time spent on query formulation, higher quality queries were produced and as consequence the retrieval results improved; both average residual precision and recall was higher with the query preview present.

Sherlock’s Last Case

Filed under: Erlang,Searching,Similarity — Patrick Durusau @ 1:03 pm

Sherlock’s Last Case by Joe Armstrong.

Joe states the Sherlock problem as given one X and millions of Yi’s, “Which Yi is “nearer to X?”

For some measure of “nearer,” or as we prefer, similarity.

One solution is given in Programming Erlang: Software for a Concurrent World, 2nd ed., 2013, by Joe Armstrong.

Joe describes two possibly better solutions in this lecture.

Great lecture even if he omits a fundamental weakness in TF-IDF.

From the Wikipedia entry:

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query “the brown cow”. A simple way to start out is by eliminating documents that do not contain all three words “the”, “brown”, and “cow”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.

However, because the term “the” is so common, this will tend to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to the more meaningful terms “brown” and “cow”. The term “the” is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words “brown” and “cow”. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

For example, TF-IDF would not find a document with “the brown heifer,” for a query of “the brown cow.”

TF-IDF does not account for relationships between terms, such as synonymy or polysemy.

Juam Ramos states as much in describing the limitations of TF-IDF in: Using TF-IDF to Determine Word Relevance in Document Queries:

Despite its strength, TF-IDF has its limitations. In terms of synonyms, notice that TF-IDF does not make the jump to the relationship between words. Going back to (Berger & Lafferty, 1999), if the user wanted to find information about, say, the word ‘priest’, TF-IDF would not consider documents that might be relevant to the query but instead use the word ‘reverend’. In our own experiment, TF-IDF could not equate the word ‘drug’ with its plural ‘drugs’, categorizing each instead as separate words and slightly decreasing the word’s wd value. For large document collections, this could present an escalating problem.

Ramos cites Information Retrieval as Statistical Translation by Adam Berger and John Lafferty to support his comments on synonymy or polysemy.

The Berger and Lafferty treat synonymy and polysemy, issues that TF-IDF misses, as statistical translation issues:

Ultimately document retrieval systems must be sophisticated enough to handle polysemy and synonymyto know for instance that pontiff and pope are related terms The eld of statistical translation concerns itself with how to mine large text databases to automatically discover such semantic relations Brown et al [3, 4] showed for instance how a system can learn to associate French terms with their English translations given only a collection of bilingual FrenchEnglish sentences We shall demonstrate how in a similar fashion an IR system can from a collection of documents automatically learn which terms are related and exploit these relations to better nd and rank the documents it returns to the user

Merging powered by the results of statistical translation?

The Berger and Lafferty paper is more than a decade old so I will be running the research forward.

July 13, 2013

Working examples for the ‘Graph Databases’ book

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:21 pm

Working examples for the ‘Graph Databases’ book by Joerg Baach.

From the post:

The examples in the ‘Graph Databases’ book don’t work out of the box. I’ve modified them, so that they do work (for chapter 3, that is).

In the version I have, 2013-02-25, the examples in question occur in chapter 4.

But whichever chapter, the corrections are welcome news.

BTW, there are other chapters that probably need the same treatment.

Transforming Log Events into Information

Filed under: ElasticSearch,Functional Programming,Log Analysis — Patrick Durusau @ 4:08 pm

Transforming Log Events into Information by Matthias Nehlsen.

From the post:

Last week I was dealing with an odd behavior of the chat application demo I was running for this article. The issue was timing-related and there were no actual exceptions that would have helped in identifying the problem. How are you going to even notice spikes and pauses in potentially thousands of lines in a logfile? I was upset, mostly with myself for not finding the issue earlier, and I promised myself to find a better tool. I needed a way to transform the raw logging data into useful information so I could first understand and then tackle the problem. In this article I will show what I have put together over the weekend. Part I describes the general approach and applies to any application out there, no matter what language or framework you are using. Part II describes one possible implementation of this approach using Play Framework.

Starting point for transforming selected log events into subjects represented by topics?

Not sure I would want to generate IRIs to identify the events as subjects, particularly since they already have identifiers in the log.

A broader processing model for the TAO should allow for the use of user defined identifiers.

What is the Latin for: User Beware? 😉

Fun with Music, Neo4j and Talend

Filed under: Graphs,Neo4j,Talend — Patrick Durusau @ 3:52 pm

Fun with Music, Neo4j and Talend by Rik Van Bruggen.

From the post:

Many of you know that I am a big fan of Belgian beers. But of course I have a number of other hobbies and passions. One of those being: Music. I have played music, created music (although that seems like a very long time ago) and still listen to new music almost every single day. So when sometime in 2006 I heard about this really cool music site called Last.fm, I was one of the early adopters to try use it. So: a good 7 years later and 50k+ scrobbles later, I have quite a bit of data about my musical habits.

On top of that, I have a couple of friends that have been using Last.fm as well. So this got me thinking. What if I was somehow able to get that last.fm data into neo4j, and start “walking the graph”? I am sure that must give me some interesting new musical insights… It almost feels like a “recommendation graph for music” … Let’s see where this brings us.

Usual graph story but made more interesting by the use of Talend ETL tools.

Good opportunity to become familiar with Talend if you don’t know the tools already.

JChemInf volumes as single PDFs

Filed under: Cheminformatics,Serendipity — Patrick Durusau @ 2:52 pm

JChemInf volumes as single PDFs by Egon Willighagen.

From the post:

One of the advantages of a print journal is that you are effectively forced to look at papers which may not have received your attention in the first place. Online journals do not provide such functionality, and you’re stuck with the table of contents, and never see that cool figure from that paper with the boring title.

Of course, the problem is artificial. We have pdftk and we can make PDF of issues, or in the present example, of complete volumes. Handy, I’d say. It saves you from many, many downloads and forces you to scan through all pages. Anyway, I wanted to scan the full JChemInf volumes, and rather have one PDF per volume. So, I created them. And you can get them too. The journal is Open Access after all (CC-BY).

(…)

Egon has links to the Journal of Cheminformatics (as complete volumes), vols. 1 – 4.

He also has a good point about print journals increasing the potential for a chance encounter with unexpected information.

Personalization of search results is a step away from serendipity.

Thoughts on how to step back towards serendipity?

ggplot2 Choropleth of Supreme Court Decisions: A Tutorial

Filed under: Ggplot2,Law,Law - Sources — Patrick Durusau @ 1:34 pm

ggplot2 Choropleth of Supreme Court Decisions: A Tutorial

From the post:

I don't do much GIS but I like to. It's rather enjoyable and involves a tremendous skill set. Often you will find your self grabbing data sets from some site, scraping, data cleaning and reshaping, and graphing. On the ride home from work yesterday I heard an NPR talk about the Supreme Court decisions being very close with this court. This got me wondering if there is a data base with this information and the journey began. This tutorial is purely exploratory but you will learn to:

  1. Grab .zip files from a data base and read into R
  2. Clean data
  3. Reshape data with reshape2
  4. Merge data sets
  5. Plot a choropleth map in ggplot2
  6. Arrange several grid plots with gridExtra

I'm lazy and like a good challenge. I challenged myself to not manually open a file so I downloaded Biobase from bioconductor to open the pdf files for the codebook. Also I used my own package qdap because it had some functions I like and I'm used to using them. This blog post was created in the dev. version of the reports package using the wordpress_rmd template.

Good R practice and an interesting view of Supreme Court cases.

Discovering User’s Models (Instead of Selling One)

Filed under: Cultural Anthropology,Users — Patrick Durusau @ 1:07 pm

Cultural Anthropology/Anthropological Methods (wikibook)

From the homepage:

Ethnography is a qualitative research method used in social sciences like Anthropology where researchers immerse themselves in other cultures for the purpose of recording information about their lifestyle for comparative research.

The built-in semantics of the TAO model (actually of the TMDM) have been discussed recently. Capturing the semantic models of our users is more important than to imposing a default model on their data.

How would you react to someone who was trying to sell you a service on the basis that your model for data is obviously inferior to what they are offering?

Not the start of a great sales pitch?

But that is what the Semantic Web and Topic Maps have been pushing. Abandon your current model! Salvation is just a new model away!

Hardly.

I don’t dislike the TAO model. We need a model to start the conversation about the user’s model.

But does every user of topic maps have to march in lock-step with the built-in semantics of the TMDM or can they fashion their own semantics?

A sales pitch that starts “We can help you capture your data model, for preservation/migration and add new capabilities to your existing infrastructure.” is a lot less threatening.

What do you think?

Strategies for Effective Teaching…

Filed under: Education,Teaching — Patrick Durusau @ 12:39 pm

Strategies for Effective Teaching: A Handbook for Teaching Assistants – University of Wisconsin – Madison College of Engineering.

From the foreword:

We help our students understand engineering concepts and go beyond the knowledge level to higher levels of thinking. We help them to apply, analyze, and synthesize, to create new knowledge, and solve new problems. So, too, as teachers, we need to recognize our challenge to go beyond knowledge about effective teaching. We need to apply these strategies, analyze what works, and take action to modify or synthesize our learnings to help our students learn in a way that works for us as individuals and teams of teachers.

The learning community consists of both students and teachers. Students benefit from effective teaching and learning strategies inside and outside the classroom. This Handbook focuses on teaching strategies you can use in the classroom to foster effective learning.

Helping students learn is our challenge as teachers. Identifying effective teaching strategies, therefore, is our challenge as we both assess the effectiveness of our current teaching style and consider innovative ways to improve our teaching to match our students’ learning styles.

I mention this as a resource for anyone who is trying to educate others, students, clients or a more general audience about topic maps.

Methods in Biostatistics I [Is Your World Black-or-White?]

Filed under: Biostatistics,Mathematics,Statistics — Patrick Durusau @ 12:24 pm

Methods in Biostatistics I John Hopkins School of Public Health.

From the webpage:

Presents fundamental concepts in applied probability, exploratory data analysis, and statistical inference, focusing on probability and analysis of one and two samples. Topics include discrete and continuous probability models; expectation and variance; central limit theorem; inference, including hypothesis testing and confidence for means, proportions, and counts; maximum likelihood estimation; sample size determinations; elementary non-parametric methods; graphical displays; and data transformations.

If you want more choices than black-or-white for modeling your world, statistics are a required starting point.

July 12, 2013

Rapid hadoop development with progressive testing

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:45 pm

Rapid hadoop development with progressive testing by Abe Gong.

From the post:

Debugging Hadoop jobs can be a huge pain. The cycle time is slow, and error messages are often uninformative — especially if you’re using Hadoop streaming, or working on EMR.

I once found myself trying to debug a job that took a full six hours to fail. It took more than a week — a whole week! — to find and fix the problem. Of course, I was doing other things at the same time, but the need to constantly check up on the status of the job was a huge drain on my energy and productivity. It was a Very Bad Week.

crushed by elephant

Painful experiences like this have taught me to follow a test-driven approach to hadoop development. Whenever I’m working on a new hadoop-based data pipe, my goal is to isolate six distinct kinds of problems that arise in hadoop development.

(…)

See Abe’s post for the six steps and suggestions for how to do them.

Reformatted a bit with local tool preferences, Abe’s list will make a nice quick reference for Hadoop development.

Introducing Morphlines:…

Filed under: Cloudera,ETL,Hadoop,Morphlines — Patrick Durusau @ 3:07 pm

Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop by Wolfgang Hoschek.

From the post:

Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

A “morphline” is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data, and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing, maintaining, or integrating custom ETL projects.

Morphlines is a library, embeddable in any Java codebase. A morphline is an in-memory container of transformation commands. Commands are plugins to a morphline that perform tasks such as loading, parsing, transforming, or otherwise processing a single record. A record is an in-memory data structure of name-value pairs with optional blob attachments or POJO attachments. The framework is extensible and integrates existing functionality and third-party systems in a simple and straightforward manner.

The Morphlines library was developed as part of Cloudera Search. It powers a variety of ETL data flows from Apache Flume and MapReduce into Solr. Flume covers the real time case, whereas MapReduce covers the batch processing case.

Since the launch of Cloudera Search, Morphlines development has graduated into the Cloudera Development Kit (CDK) in order to make the technology accessible to a wider range of users, contributors, integrators, and products beyond Search. The CDK is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem (and hence a perfect home for Morphlines). The CDK is hosted on GitHub and encourages involvement by the community.

(…)

The sidebar promises: Morphlines replaces Java programming with simple configuration steps, reducing the cost and effort of doing custom ETL.

Sound great!

But how do I search one or more morphlines for the semantics of the records/fields that are being processed or the semantics of that processing?

If I want to save “cost and effort,” shouldn’t I be able to search for existing morphlines that have transformed particular records/fields?

True, morphlines have “#” comments but that seems like a poor way to document transformations.

How would you test for field documentation?

Or make sure transformations of particular fields always use the same semantics?

Ponder those questions while you are reading:

Cloudera Morphlines Reference Guide

and,

Syntax – HOCON github page.

If we don’t capture semantics at the point of authoring, subsequent searches are mechanized guessing.

Aggregation Module – Phase 1 – Functional Design (ElasticSearch #3300)

Filed under: Aggregation,ElasticSearch,Merging,Search Engines,Topic Maps — Patrick Durusau @ 2:47 pm

Aggregation Module – Phase 1 – Functional Design (ElasticSearch Issue #3300)

From the post:

The new aggregations module is due to elasticsearch 1.0 release, and aims to serve as the next generation replacement for the functionality we currently refer to as “faceting”. Facets, currently provide a great way to aggregate data within a document set context. This context is defined by the executed query in combination with the different levels of filters that are defined (filtered queries, top level filters, and facet level filters). Although powerful as is, the current facets implementation was not designed from ground up to support complex aggregations and thus limited. The main problem with the current implementation stem in the fact that they are hard coded to work on one level and that the different types of facets (which account for the different types of aggregations we support) cannot be mixed and matched dynamically at query time. It is not possible to compose facets out of other facet and the user is effectively bound to the top level aggregations that we defined and nothing more than that.

The goal with the new aggregations module is to break the barriers the current facet implementation put in place. The new name (“Aggregations”) also indicate the intention here – a generic yet extremely powerful framework for defining aggregations – any type of aggregation. The idea here is to have each aggregation defined as a “standalone” aggregation that can perform its task within any context (as a top level aggregation or embedded within other aggregations that can potentially narrow its computation scope). We would like to take all the knowledge and experience we’ve gained over the years working with facets and apply it when building the new framework.

(…)

If you have been following the discussion about “what would we do differently with topic maps” in the XTM group at LinkedIn, this will be of interest.

What is an aggregation if it is not a selection of items matching some criteria, which you can then “merge” together for presentation to a user?

Or “merge” together for further querying?

That is inconsistent with the imperative programming model of the TMDM, but it has the potential to open up distributed and parallel processing of topic maps.

Same paradigm but with greater capabilities.

Riak 1.4 – More Install Notes on Ubuntu 12.04 (precise)

Filed under: Erlang,Riak — Patrick Durusau @ 1:37 pm

Following up on yesterday’s post on installing Riak 1.4 with some minor nits.

Open File Limits

The Open Files Limit leaves the reader dangling with:

However, what most needs to be changed is the per-user open files limit. This requires editing /etc/security/limits.conf, which you’ll need superuser access to change. If you installed Riak or Riak Search from a binary package, add lines for the riak user like so, substituting your desired hard and soft limits:

(next paragraph)

Suggest:

riak soft nofile 65536
riak hard nofile 65536

Tab separated values in /etc/security/limits.conf.

The same page also suggests an open file value of 50384 if you are starting Riak with init scripts. I don’t know the reason for the difference but 50384 occurs only once in Linux examples so while it may work, I am starting with the higher value.

Performance Tuning

I followed the directions at Linux Performance Tuning, but suggest you also add:

# Added by
# Network tuning parameters for Riak 1.4
# As per: http://docs.basho.com/riak/1.3.1/cookbooks/Linux-Performance-Tuning/

both here and for your changes to limits.conf.

Puts others on notice of the reason for the settings and points to documentation.

Enter the same type of note for your setting of the noatime flag in /etc/fstab (under Mounts and Scheduler in Linux Performance Tuning).

On reboot, check your settings with:

ulimit -a

I was going to do the Riak Fast Track today but got distracted with configuration issues with Ruby, RVM, KDE and the viewer for Riak docs.

Look for Fast Track notes over the weekend.

Hadoop Summit 2013

Filed under: Hadoop,MapReduce — Patrick Durusau @ 8:55 am

Hadoop Summit 2013

Videos and slides from Hadoop Summit 2013!

Forty-two (42) presentations on day one and Forty-one (41) on day two.

Just this week I got news that ISO is hunting down “rogue” copies of ISO standards, written by volunteers, that aren’t behind its paywall.

While others, like the presenters at the Hadoop Summit 2013, are sharing their knowledge in hopes of creating more knowledge.

Which group do you think will be relevant in a technology driven future?

July 11, 2013

Who reads Outlook.com Portal Mail? (Sender, Recipient(s) and the NSA)

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 6:23 pm

Revealed: how Microsoft handed the NSA access to encrypted messages by Glenn Greenwald, Ewen MacAskill, Laura Poitras, Spencer Ackerman and Dominic Rushe in today’s issue of the Guardian.

From the story:

Microsoft has collaborated closely with US intelligence services to allow users’ communications to be intercepted, including helping the National Security Agency to circumvent the company’s own encryption, according to top-secret documents obtained by the Guardian.

The news goes down hill from there. See the full story for all the gory details.

Not that encryption is and of itself any type of security.

After all, the NSA has all sorts of encryption, physical security, etc. and it was defeated by a USB thumb drive.

Makes you wonder if anyone with a USB thumb drive loaded false data into the NSA computers?

What is your means of tracking data provenance?

DC Conference Swag

Filed under: Mapping,Maps — Patrick Durusau @ 2:58 pm

Sketching D.C. Crime Data With R by Matt Stiles.

From the post:

A car burglar last week nabbed a radio from our car, prompting me to think (once again) about crime in Washington, D.C., where I live.

I wanted to know if certain crimes were more common in particular neighborhoods, so I downloaded a list of every serious crime in 2012 from the city’s data portal. The data contained about 35,000 reported incidents of homicides, thefts, assaults, etc., with fields listing the date, time and neighborhood associated with each case.

I used the statistical programming language R, which is great for quickly creating small multiples to examine data, to make some rough visual sketches.

First, since we’re talking about cars, the first grid shows thefts from vehicles, by hour and “advisory neighborhood commission“. These commissions are the small groups of officials who represent their respective D.C. neighborhoods on issues like real estate development and alcohol sales, among other things. (I live in Brookland, which is governed by ANC 5B). You can find your ANC here.

(…)

Matt charts a variety of crimes in DC and is sure to get your attention.

Occurs to me that a map of DC, color coded by crime and time of day, would be great conference swag for conference tote bags.

With the subway system marked with “Do Not Exit” here signs.

Bayesian Methods for Hackers

Bayesian Methods for Hackers by a community of authors!

From the readme:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

(…)

Useful in case all the knowledge you want to put in a topic map is far from certain. 😉

Riak 1.4 – Install Notes on Ubuntu 12.04 (precise)

Filed under: Erlang,Riak — Patrick Durusau @ 12:57 pm

While installing Riak 1.4 I encountered some issues and thought writing down the answers might help someone else.

Following the instructions for Installing From Apt-Get, when I reached:

sudo apt-get install riak

I got this message:

Failed to fetch
http://apt.basho.com/pool/precise/main/riak_1.4.0-1_amd64.deb Size mismatch
E: Unable to fetch some archives, maybe run apt-get update or try with
–fix-missing?

Not a problem with the Riak 1.4 distribution but an error with Ubuntu.

Correct as follows:

sudo aptitude clean

(rtn)

Then;

sudo aptitude update

(rtn)

close, restart Linux

Cleans the apt cache and then the install was successful.

Post Installation Notes:

Basho suggests to start Riak with:

riak start

My results:

Unable to access /var/run/riak, permission denied, run script as root

Use:

sudo riak start

I then read:

sudo riak start
!!!!
!!!! WARNING: ulimit -n is 1024; 4096 is the recommended minimum.
!!!!

The ulimit warning is not unexpected and solutions are documented at: Open Files Limit.

As soon as I finish this session, I am going to create the file /etc/default/riak and its contents will be:

ulimit -n 65536

The file needs to be created as root.

May as well follow the instructions for “Enable PAM Based Limits for Debian & Ubuntu” in the Open Files document as well. Requires a reboot.

The rest of the tests of the node went well until I got to:

riak-admin diag

The documentation notes:

Make the recommended changes from the command output to ensure optimal node operation.

I was running in an Emacs shell so capturing the output was easy:

riak-admin diag
[critical] vm.swappiness is 60, should be no more than 0
[critical] net.core.wmem_default is 229376, should be at least 8388608
[critical] net.core.rmem_default is 229376, should be at least 8388608
[critical] net.core.wmem_max is 131071, should be at least 8388608
[critical] net.core.rmem_max is 131071, should be at least 8388608
[critical] net.core.netdev_max_backlog is 1000, should be at least 10000
[critical] net.core.somaxconn is 128, should be at least 4000
[critical] net.ipv4.tcp_max_syn_backlog is 2048, should be at least 40000
[critical] net.ipv4.tcp_fin_timeout is 60, should be no more than 15
[critical] net.ipv4.tcp_tw_reuse is 0, should be 1
[warning] The following preflists do not satisfy the n_val:
[[{0,
‘riak@127.0.0.1’},
{22835963083295358096932575511191922182123945984,
‘riak@127.0.0.1’},
{45671926166590716193865151022383844364247891968,
‘riak@127.0.0.1’}],
[approx. 376 lines omitted]
[{1438665674247607560106752257205091097473808596992,
‘riak@127.0.0.1’},
{0,
‘riak@127.0.0.1’},
{22835963083295358096932575511191922182123945984,
‘riak@127.0.0.1’}]]
[notice] Data directory /var/lib/riak/bitcask is not mounted with ‘noatime’. Please remount its disk with the ‘noatime’ flag to improve performance.

The first block of messages:

[critical] vm.swappiness is 60, should be no more than 0
[critical] net.core.wmem_default is 229376, should be at least 8388608
[critical] net.core.rmem_default is 229376, should be at least 8388608
[critical] net.core.wmem_max is 131071, should be at least 8388608
[critical] net.core.rmem_max is 131071, should be at least 8388608
[critical] net.core.netdev_max_backlog is 1000, should be at least 10000
[critical] net.core.somaxconn is 128, should be at least 4000
[critical] net.ipv4.tcp_max_syn_backlog is 2048, should be at least 40000
[critical] net.ipv4.tcp_fin_timeout is 60, should be no more than 15
[critical] net.ipv4.tcp_tw_reuse is 0, should be 1

are network tuning issues.

Basho answers the “how to correct?” question at Linux Performance Tuning but there is no link from the Post Installation Notes.

The next block of messages:

[warning] The following preflists do not satisfy the n_val:
[[{0,
‘riak@127.0.0.1’},
{22835963083295358096932575511191922182123945984,
‘riak@127.0.0.1’},
{45671926166590716193865151022383844364247891968,
‘riak@127.0.0.1’}],
[approx. 376 lines omitted]
[{1438665674247607560106752257205091097473808596992,
‘riak@127.0.0.1’},
{0,
‘riak@127.0.0.1’},
{22835963083295358096932575511191922182123945984,
‘riak@127.0.0.1’}]]

is a known issue: N Value – Preflist Message is Vague.

From the issue, the message means: “these preflists have more than one replica on the same node.”

Not surprising since I am running on one physical node and not in production.

The Riak Fast Track has you create four nodes on one physical node as a development environment. So I’m going to ignore the “prelists” warning in this context.

The last message:

[notice] Data directory /var/lib/riak/bitcask is not mounted with ‘noatime’. Please remount its disk with the ‘noatime’ flag to improve performance.

is resolved under “Mounts and Scheduler” in the Linux Performance Tuning document.

I am going to make all the system changes, reboot and start on the The Riak Fast Track tomorrow.

PS: In case you are wondering what this has to do with topic maps, ask yourself what characteristics you would want in a distributed topic map system?

July 10, 2013

Data visualization: ambiguity as a fellow traveler

Filed under: Ambiguity,Uncertainty,Visualization — Patrick Durusau @ 4:48 pm

Data visualization: ambiguity as a fellow traveler by Vivien Marx. (Nature Methods 10, 613–615 (2013) doi:10.1038/nmeth.2530)

From the article:

Data from an experiment may appear rock solid. Upon further examination, the data may morph into something much less firm. A knee-jerk reaction to this conundrum may be to try and hide uncertain scientific results, which are unloved fellow travelers of science. After all, words can afford ambiguity, but with visuals, “we are damned to be concrete,” says Bang Wong, who is the creative director of the Broad Institute of MIT and Harvard. The alternative is to face the ambiguity head-on through visual means.

Color or color gradients in heat maps, for example, often show degrees of data uncertainty and are, at their core, visual and statistical expressions. “Talking about uncertainty is talking about statistics,” says Martin Krzywinski, whose daily task is data visualization at the Genome Sciences Centre at the British Columbia Cancer Agency.

Statistically driven displays such as box plots can work for displaying uncertainty, but most visualizations use more ad hoc methods such as transparency or blur. Error bars are also an option, but it is difficult to convey information clearly with them, he says. “It’s likely that if something as simple as error bars is misunderstood, anything more complex will be too,” Krzywinski says.

I don’t hear “ambiguity” and “uncertainty” as the same thing.

The duck/rabbit image you will remember from Sperberg-McQueen’s presentations is ambiguous, but not uncertain.

duck rabbit

Granting that “uncertainty” and its visualization is a difficult task but let’s not compound the task by confusing it with ambiguity.

The uncertainty issue in this article echoes Steve Pepper’s concern over binary choices for type under the current TMDM. Either a topic, for example, is of a particular type or not. There isn’t any room for uncertainty.

The article has a number of suggestions on visualizing uncertainty that I think you may find helpful.

I first saw this at: Visualizing uncertainty still unsolved problem by Nathan Yau.

Riak 1.4 Hits the Street!

Filed under: Erlang,Riak — Patrick Durusau @ 4:23 pm

Well, they actually said: Basho Announces Availability of Riak 1.4.

From the post:

We are excited to announce the launch of Riak 1.4. With this release, we have added in more functionality and addressed some common requests that we hear from customers. In addition, there are a few features available in technical preview that you can begin testing and will be fully rolled out in the 2.0 launch later this year.

The new features and updates in Riak 1.4 include:

  • Secondary Indexing Improvements: Query results are now sorted and paginated, offering developers much richer semantics
  • Introducing Counters in Riak: Counters, Riak’s first distributed data type, provide automatic conflict resolution after a network partition
  • Simplified Cluster Management With Riak Control: New capabilities in Riak’s GUI-based administration tool improve the cluster management page for preparing and applying changes to the cluster
  • Reduced Object Storage Overhead: Values and associated metadata are stored and transmitted using a more compact format, reducing disk and network overhead
  • Hinted Handoff Progress Reporting: Makes operating the cluster, identifying and troubleshooting issues, and monitoring the cluster simpler
  • Improved Backpressure: Riak responds with an overload message if a vnode has too many messages in queue

Plus performance and management enhancements for the enterprise crowd.

Download Riak 1.4: http://docs.basho.com/riak/latest/downloads/

Code at: Github.com/basho

Live webcast: What’s New in Riak 1.4” on July 12th.

That’s this coming Friday.

Naming Conventions for Naming Things

Filed under: Names,Semantics — Patrick Durusau @ 3:36 pm

Naming Conventions for Naming Things by David Loshin.

From the post:

In a recent email exchange with a colleague, I have been discussing two aspects of metadata: naming conventions and taxonomies. Just as a reminder, “taxonomy” refers to the practice of organization and classification, and in this context it refers to the ways that concepts are defined and how the real-world things referred to by those concepts are logically grouped together. After pondering the email thread, which was in reference to documenting code lists and organizing the codes within particular classes, I was reminded of a selection from Lewis Carroll’s book Through the Looking Glass, at the point where the White Knight is leaving Alice in her continued journey to become a queen.

At that point, the White Knight proposes to sing Alice a song to comfort her as he leaves, and in this segment they discuss the song he plans to share:

Any of you who have been following the discussion of “default semantics” in the XTM group at LinkedIn should appreciate this post.

Your default semantics are very unlikely to be my default semantics.

What I find hard to believe is that prior different semantics are acknowledged in one breath and then a uniform semantic is proposed in the next.

Seems to me that prior semantic diversity is a good sign that today we have semantic diversity. A semantic diversity that will continue into an unlimited number of tomorrows.

Yes?

If so, shouldn’t we empower users to choose their own semantics? As opposed to ours?

Visualizing Web Scale Geographic Data…

Filed under: Geographic Data,Geography,Graphics,Visualization — Patrick Durusau @ 2:22 pm

Visualizing Web Scale Geographic Data in the Browser in Real Time: A Meta Tutorial by Sean Murphy.

From the post:

Visualizing geographic data is a task many of us face in our jobs as data scientists. Often, we must visualize vast amounts of data (tens of thousands to millions of data points) and we need to do so in the browser in real time to ensure the widest-possible audience for our efforts and we often want to do this leveraging free and/or open software.

Luckily for us, Google offered a series of fascinating talks at this year’s (2013) IO that show one particular way of solving this problem. Even better, Google discusses all aspects of this problem: from cleaning the data at scale using legacy C++ code to providing low latency yet web-scale data storage and, finally, to rendering efficiently in the browser. Not surprisingly, Google’s approach highly leverages **alot** of Google’s technology stack but we won’t hold that against them.

(…)

Sean sets the background for two presentations:

All the Ships in the World: Visualizing Data with Google Cloud and Maps (36 minutes)

and,

Google Maps + HTML5 + Spatial Data Visualization: A Love Story (60 minutes) (source code: https://github.com/brendankenny)

Both are well worth your time.

Data Sharing and Management Snafu in 3 Short Acts

Filed under: Archives,Astroinformatics,Open Access,Open Data — Patrick Durusau @ 1:43 pm

As you may suspect, my concerns are focused on the preservation of the semantics of the field names, Sam1, Sam2, Sam3, but also with the field names that will be generated by the requesting researcher.

I found this video embedded in: A call for open access to all data used in AJ and ApJ articles by Kelle Cruz.

From the post:

I don’t fully understand it, but I know the Astronomical Journal (AJ) and Astrophysical Journal (ApJ) are different than many other journals: They are run by the American Astronomical Society (AAS) and not by a for-profit publisher. That means that the AAS Council and the members (the people actually producing and reading the science) have a lot of control over how the journals are run. In a recent President’s Column, the AAS President, David Helfand proposed a radical, yet obvious, idea for propelling our field into the realm of data sharing and open access: require all journal articles to be accompanied by the data on which the conclusions are based.

We are a data-rich—and data-driven—field [and] I am advocating [that authors provide] a link in articles to the data that underlies a paper’s conclusions…In my view, the time has come—and the technological resources are available—to make the conclusion of every ApJ or AJ article fully reproducible by publishing the data that underlie that conclusion. It would be an important step toward enhancing and sharing our scientific understanding of the universe.

Kelle points out several reasons why existing efforts are insufficient to meet the sharing and archiving needs of the astronomical community.

Suggested reading if you are concerned with astronomical data or archives more generally.

July 9, 2013

Graph-based Approach to Automatic Taxonomy Generation (GraBTax)

Filed under: Graphs,Taxonomy,Topic Maps,XML — Patrick Durusau @ 7:43 pm

Graph-based Approach to Automatic Taxonomy Generation (GraBTax) by Pucktada Treeratpituk, Madian Khabsa, C. Lee Giles.

Abstract:

We propose a novel graph-based approach for constructing concept hierarchy from a large text corpus. Our algorithm, GraBTax, incorporates both statistical co-occurrences and lexical similarity in optimizing the structure of the taxonomy. To automatically generate topic-dependent taxonomies from a large text corpus, GraBTax first extracts topical terms and their relationships from the corpus. The algorithm then constructs a weighted graph representing topics and their associations. A graph partitioning algorithm is then used to recursively partition the topic graph into a taxonomy. For evaluation, we apply GraBTax to articles, primarily computer science, in the CiteSeerX digital library and search engine. The quality of the resulting concept hierarchy is assessed by both human judges and comparison with Wikipedia categories.

Interesting work.

For example:

Unfortunately, existing taxonomies for concepts in computer science such as ODP categories and the ACM Classification System1 are unsuitable as a gold standard. ODP categories are too broad and do not contain the majority of concepts produced by our algorithm. For instance, there are no sub-concepts for “Semantic Web” in ODP. Also some portions of ODP categories under computer science are not computer science related concepts, especially at the lower level. For example, the concepts under “Neural Networks” are Books, People, Companies, Publications, FAQs, Help and Tutorials, etc. The ACM Classification System has similar drawbacks, where its categories are too broad for comparison.

Makes me curious if comparing the topics extracted from articles would consistently map to the broad categories assigned by the ACM.

Also instructive for the use of graphs, which admit to no pre-determined data structure.

I say that because of an on-going discussion about alternative data models for topic maps.

As you know, I don’t think topic maps have only one data model, not even my own.

The model you construct with your topic map should meet your needs, not mine.

Graphs are a good example of interchangeable information artifacts despite no one being able to constrain the graphs of others.

XML is another, although it gets overlooked from time to time.

PS: The authors don’t say but I am assuming that ODP = Open Directory Project.

The Last Mile

Filed under: Graphs,Neo4j — Patrick Durusau @ 6:50 pm

The Last Mile by Max De Marzi.

From the post:

The “last mile” is a term used in the telecommunications industry that refers to delivering connectivity to the customers that will actually be using the system. In the sense of Graph Databases, it refers to how well the end user can extract value and insight from the graph. We’ve already seen an example of this concept with Graph Search, allowing a user to express their requests in natural language. Today we’ll see another example. We’ll be taking advantage of the features of Neo4j 2.0 to make this work, so be sure to have read the previous post on the matter.

We’re going to be using VisualSearch.js made by Samuel Clay of NewsBlur. VisualSearch.js enhances ordinary search boxes with the ability to autocomplete faceted search queries. It is quite easy to customize and there is an annotated walkthrough of the options available. You can see what it does in the image below, or click it to try their demo.

Graphs are ok.

Storing data in graphs is better.

Useful retrieval of data from graphs is the best. 😉

Max does his usual excellent job of illustrating useful retrieval of information from a Neo4j graph.

His use of labels does remind me of a post I need to finishj.

How To Unlock Business Value from your Big Data with Hadoop

Filed under: BigData,Hadoop,Marketing,Topic Maps — Patrick Durusau @ 6:36 pm

How To Unlock Business Value from your Big Data with Hadoop by Jim Walker.

From the post:

By now, you’re probably well aware of what Hadoop does: low-cost processing of huge amounts of data. But more importantly, what can Hadoop do for you?

We work with many customers across many industries with many different specific data challenges, but in talking to so many customers, we are also able to see patterns emerge on certain types of data and the value that could bring to a business.

We love to share these kinds of insights, so we built a series of video tutorials covering some of those scenarios:

The tutorials cover social media, server logs, clickstream data, geolocation data, and others.

This is a brilliant marketing move.

Hadoop may be the greatest invention since sliced bread but if it isn’t shown to help you, what good is it?

These tutorials answer that question for several different areas of potential customer interest.

We should do something very similar for topic maps.

Something that focuses on a known need or interest of customers.

The Blur Project: Marrying Hadoop with Lucene

Filed under: Hadoop,Lucene — Patrick Durusau @ 3:40 pm

The Blur Project: Marrying Hadoop with Lucene by Aaron McCurry.

From the post:

Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.

(…)

Blur was initially released on Github as an Apache Licensed project and was then accepted into the Apache Incubator project in July 2012, with Patrick Hunt as its champion. Since then, Blur as a software project has matured and become much more stable. One of the major milestones over the past year has been the upgrade to Lucene 4, which has brought many new features and massive performance gains.

Recently there has been some interest in folding some of Blur’s code (HDFSDirectory and BlockCache) back into the Lucene project for others to utilize. This is an exciting development that legitimizes some of the approaches that we have taken to date. We are in conversations with some members of the Lucene community, such as Mark Miller, to figure out how we can best work together to benefit both the fledgling Blur project as well as the much larger and more well known/used Lucene project.

Blur’s community is small but growing. Our project goals are to continue to grow our community and graduate from the Incubator project. Our technical goals are to continue to add features that perform well at scale while maintaining the fault tolerance that is required of any modern distributed system.

We welcome your contributions at http://incubator.apache.org/blur/!

Another exciting Apache project that needs contributors!

Friend Recommendations using MapReduce

Filed under: Hadoop,MapReduce,Recommendation — Patrick Durusau @ 3:26 pm

Friend Recommendations using MapReduce by John Berryman.

From the post:

So Jonathan, one of our interns this summer, asked an interesting question today about MapReduce. He said, “Let’s say you download the entire data set of who’s following who from Twitter. Can you use MapReduce to make recommendations about who any particular individual should follow?” And as Jonathan’s mentor this summer, and as one of the OpenSource Connections MapReduce experts I dutifully said, “uuuhhhhh…”

And then in a stoke of genius … I found a way to stall for time. “Well, young Padawan,” I said to Jonathan, “first you must more precisely define your problem… and only then will the answer be revealed to you.” And then darn it if he didn’t ask me what I meant! Left with no viable alternatives, I squeezed my brain real hard, and this is what came out:

This is a post to work through carefully while waiting for the second post to drop!

Particularly the custom partitioning, grouping and sorting in MapReduce.

…Apache HBase REST Interface, Part 3

Filed under: Cloudera,HBase — Patrick Durusau @ 1:51 pm

How-to: Use the Apache HBase REST Interface, Part 3 by Jesse Anderson.

From the post:

This how-to is the third in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 showed you how to insert multiple rows simultaneously using XML and JSON. Part 3 below will show how to get multiple rows using XML and JSON.

Jesse is an instructor with Cloudera University. I checked but Cloudera doesn’t offer a way to search for courses by instructor. 🙁

I will drop them a note.

« Newer PostsOlder Posts »

Powered by WordPress