Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 13, 2011

Ontology Matching 2011

Filed under: Identification,Identity,Ontology — Patrick Durusau @ 9:54 pm

Ontology Matching 2011

Proceedings of the 6th International Workshop on Ontology Matching (OM-2011)

From the conference website:

Ontology matching is a key interoperability enabler for the Semantic Web, as well as a useful tactic in some classical data integration tasks dealing with the semantic heterogeneity problem. It takes the ontologies as input and determines as output an alignment, that is, a set of correspondences between the semantically related entities of those ontologies. These correspondences can be used for various tasks, such as ontology merging, data translation, query answering or navigation on the web of data. Thus, matching ontologies enables the knowledge and data expressed in the matched ontologies to interoperate.


The workshop has three goals:

  • To bring together leaders from academia, industry and user institutions to assess how academic advances are addressing real-world requirements. The workshop will strive to improve academic awareness of industrial and final user needs, and therefore direct research towards those needs. Simultaneously, the workshop will serve to inform industry and user representatives about existing research efforts that may meet their requirements. The workshop will also investigate how the ontology matching technology is going to evolve.
  • To conduct an extensive and rigorous evaluation of ontology matching approaches through the OAEI (Ontology Alignment Evaluation Initiative) 2011 campaign. The particular focus of this year’s OAEI campaign is on real-world specific matching tasks involving, e.g., open linked data and biomedical ontologies. Therefore, the ontology matching evaluation initiative itself will provide a solid ground for discussion of how well the current approaches are meeting business needs.
  • To examine similarities and differences from database schema matching, which has received decades of attention but is just beginning to transition to mainstream tools.

An excellent set of papers and posters.

While I was writing this post, I realized that had the papers been described as matching subject identifications by similarity measures, I would have felt completely different about the papers.

Isn’t that odd?

Question: Do you agree/disagree that mapping ontologies is different from mapping subject identifications? Why/why not?

py2neo 0.99

Filed under: Neo4j,Python — Patrick Durusau @ 9:53 pm

py2neo 0.99

Python binding to Neo4j.

Not news but I wanted to create a post so I would not lose track of it.

BTW, while you are there, take a look at the Package documentation, then choose Indices.

Have you ever wondered which of those “identifiers” are the same as other identifiers and which ones are different?

Just curious.

Seems like that would be a really nice programming resource. Or evaluation tool.

UMBEL Services, Part 1: Overview

Filed under: Ontology,Open Semantic Framework,SPARQL — Patrick Durusau @ 9:52 pm

UMBEL Services, Part 1: Overview

From the post:

UMBEL, the Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts and a vocabulary designed for domain ontologies and ontology mapping [1]. When we first released UMBEL in mid-2008 it was accompanied by a number of Web services and a SPARQL endpoint, and general APIs. In fact, these were the first Web services developed for release by Structured Dynamics. They were the prototypes for what later became the structWSF Web services framework, which incorporated many lessons learned and better practices.

By the time that the structWSF framework had evolved with many additions to comprise the Open Semantic Framework (OSF), those original UMBEL Web services had become quite dated. Thus, upon the last major update to UMBEL to version 1.0 back in February of this year, we removed these dated services.

Like what I earlier mentioned about the cobbler’s children being the last to get new shoes, it has taken us a bit to upgrade the UMBEL services. However, I am pleased to announce we have now completed the transition of UMBEL’s earlier services to use the OSF framework, and specifically the structWSF platform-independent services. As a result, there are both upgraded existing services and some exciting new ones. We will now be using UMBEL as one of our showcases for these expanding OSF features. We will be elaborating upon these features throughout this series, some parts of which will appear on Fred Giasson’s blog.

In this first part, we provide a broad overview of the new UMBEL OSF implementation. We also begin to foretell some of the parts to come that will describe some of these features in more detail.

There are three more parts that follow this one.

If you have the time, I am interested in your take on this resource.

A lot of time and effort has gone into making this a useful site, so what parts do you like best/least? What would you change?

More to follow on this one.

Which search engine when?

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 9:51 pm

Which search engine when?

A listing of search engines in the following categories:

  • keyword search
  • index or directory based
  • multi or meta search engines
  • visual results
  • category
  • blended results

There are fifty-three (53) entries so plenty to choose from if you are bored with your current search “experience.”

Not to mention learning about different ways to present search results to users.

BTW, if you run across a blog mentioning that AllPlus was listed in two separate categories, like this one, realize that SearchLion was also listed in two separate categories.

Search engines are an important topic for topic mappers because it is one of the places where semantic impedance and the lack of useful organization of information is a major time sink for all users.

Getting 400,000 “hits” is just a curiosity, getting 402 “hits,” in a document archive like I did this morning, is a considerable amount of content but a manageable one.

No, it wasn’t a topic map that I was searching but the results may well find themselves into a topic map.

Orev: The Apache OpenRelevance Viewer

Filed under: Crowd Sourcing,Natural Language Processing,Relevance — Patrick Durusau @ 9:50 pm

Orev: The Apache OpenRelevance Viewer

From the webpage:

The OpenRelevance project is an Apache project, aimed at making materials for doing relevance testing for information retrieval (IR), Machine Learning and Natural Language Processing (NLP). Think TREC, but open-source.

These materials require a lot of managing work and many human hours to be put into collecting corpora and topics, and then judging them. Without going into too many details here about the actual process, it essentially means crowd-sourcing a lot of work, and that is assuming the OpenRelevance project had the proper tools to offer the people recruited for the work.

Having no such tool, the Viewer – Orev – is meant for being exactly that, and so to minimize the overhead required from both the project managers and the people who will be doing the actual work. By providing nice and easy facilities to add new Topics and Corpora, and to feed documents into a corpus, it will make it very easy to manage the surrounding infrastructure. And with a nice web UI to be judging documents with, the work of the recruits is going to be very easy to grok.

Focuses on judging of documents but that is a common level of granularity these days for relevance.

I don’t know of anything more granular but if you find such a tool, please sing out!

Making Sense of Microposts

Filed under: Conferences,Tweets — Patrick Durusau @ 9:49 pm

Making Sense of Microposts (#MSM2012) – Big things come in small packages

In connection with World Wide Web 2012.

Important dates:

  • Submission of Abstracts (mandatory): 03 Feb 2012
  • Paper Submission deadline: 06 Feb 2012
  • Notification of acceptance: 06 Mar 2012*
  • Camera-ready deadline: 23 Mar 2012
  • Workshop program issued: 08 Mar 2012
  • Proceedings published (CEUR): 31 Mar 2012
  • Workshop – 16 Apr 2012 (Registration open to all)

(all deadlines 23:59 Hawaii Time)

From the post:

With the appearance and expansion of Twitter, Facebook Like, Foursquare, and similar low-effort publishing services, the effort required to participate on the Web is getting lower and lower. The high-end technology user and developer and the ordinary end user of ubiquitous, personal technology, such as the smart phone, contribute diverse information to the Web as part of informal and semi-formal communication and social activity. We refer to such small user input as ‘microposts’: these range from ‘checkin’ at a location on a geo-social networking platform, through to a status update on a social networking site. Online social media platforms are now very often the portal of choice for the modern technology user accustomed to sharing public-interest information. They are, increasingly, an alternative carrier to traditional media, as seen in their role in the Arab Spring and crises such as the 2011 Japan earthquake. Online social activity has also witnessed the blurring of the lines between private lives and the semi-public online social world, opening a new window into the analysis of human behaviour, implicit knowledge, and adaptation to and adoption of technology.

The challenge of developing novel methods for processing the enormous streams of heterogeneous, disparate micropost data in intelligent ways and producing valuable outputs, that may be used on a wide variety of devices and end uses, is more important than ever before. Google+ is one of the better-known new services, whose aim is to bootstrap microposts in order to more effectively tailor search results to a user’s social graph and profile.

This workshop will examine, broadly:

  • information extraction and leveraging of semantics from microposts, with a focus on novel methods for handling the particular challenges due to enforced brevity of expression;
  • making use of the collective knowledge encoded in microposts’ semantics in innovative ways;
  • social and enterprise studies that guide the design of appealing and usable new systems based on this type of data, by leveraging Semantic Web technologies.

This workshop is unique in its interdisciplinary nature, targeting both Computer Science and the Social Sciences, to help also to break down the barriers to optimal use of Semantic Web data and technologies. The workshop will focus on both the computational means to handle microposts and the study of microposts, in order to identify the motivational aspects that drive the creation and consumption of such data.

Is tailoring of search results to “…a user’s social graph and profile” a good or bad thing? We all exist in self-imposed mono-cultures in which “other” viewpoints are allowed in carefully measured amounts. How would you gauge what we are missing?

Tiered Storage Approaches to Big Data:…

Filed under: Archives,Data,Storage — Patrick Durusau @ 9:47 pm

Tiered Storage Approaches to Big Data: Why look to the Cloud when you’re working with Galaxies?

Event Date: 12/15/2011 02:00 PM Eastern Standard Time

From the email:

The ability for organizations to keep up with the growth of Big Data in industries like satellite imagery, genomics, oil and gas, and media and entertainment has strained many storage environments. Though storage device costs continue to be driven down, corporations and research institutions have to look to setting up tiered storage environments to deal with increasing power and cooling costs and shrinking data center footprint of storing all this big data.

NASA’s Earth Observing System Data and Information Management (EOSDIS) is arguably a poster child when looking at large image file ingest and archive. Responsible for processing, archiving, and distributing Earth science satellite data (e.g., land, ocean and atmosphere data products), NASA EOSDIS handles hundreds of millions of satellite image data files averaging roughly from 7 MB to 40 MB in size and totaling over 3PB of data.

Discover long-term data tiering, archival, and data protection strategies for handling large files using a product like Quantum’s StorNext data management solution and similar solutions from a panel of three experts. Hear how NASA EOSDIS handles its data workflow and long term archival across four sites in North America and makes this data freely available to scientists.

Think of this as a starting point to learn some of the “lingo” in this area and perhaps hear some good stories about data and NASA.

Some questions to think about during the presentation/discussion:

How do you effectively access information after not only the terminology but the world view of a discipline has changed?

What do you have to know about the data and its storage?

How do the products discussed address those questions?

From datasets to algorithms in R

Filed under: R,Similarity — Patrick Durusau @ 10:43 am

From datasets to algorithms in R by John Johnson.

From the post:

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame….

Interesting to consider if R would be useful language for exploring similarity measures? After all, in Analysis of Amphibian Biodiversity Data I pointed out work that reviewed forty-six (46) similarity measures. I suspect that is a small percentage of all similarity measures. I remember a report that said (in an astronomy context) that more than 100 algorithms/data models for data integration appear every month.

Obviously a rough guess/estimate but one that should give us pause in terms of being too wedded to one measure of similarity or another.

Suggestions of existing collections of similarity measures? Either in literature or code?

Thinking it would be instructive to throw some of the open government data sources against similarity measures.

December 12, 2011

Semantic Data Integration For Free With IO Informatics’ Knowledge Explorer Personal Edition

Filed under: Uncategorized — Patrick Durusau @ 10:24 pm

Semantic Data Integration For Free With IO Informatics’ Knowledge Explorer Personal Edition

From the post:

Bioinformatics software provider IO Informatics recently released its free Knowledge Explorer Personal Edition. Version 3.6 of the Personal Edition can handle most of what Knowledge Explorer Professional 3.6, launched in October, can, but it does all its work in memory without direct connectivity to a back-end database.

“In particular, a lot of the strengths of Knowledge Explorer have to do with modeling data as RDF and then testing queries, visualizing and browsing the data to see that you have the ontologies and data mappings you need for your integration and application requirements.” says Robert Stanley, IO Informatics president and CEO. The Personal version is aimed at academic experts focused on data integration and semantic data modeling, as well as personal power users in life sciences and other data-intensive industries, or anyone who wants to learn the tool in anticipation of leveraging their enterprise data sets for collaboration and integration projects.

The latest Knowledge Explorer 3.6 feature set extends the thesaurus application in the product, so that users can bring in additional thesauri and vocabularies, as well as the user interaction options for importing, merging and modifying ontologies. For the Pro edition, IO Informatics has also been working with database vendors to increase query speed and loading.

I am not sure what we did collectively to merit presents so early in the holiday seasons but I won’t spend a lot of time worrying about it.

Particularly interested in the “…additional thesauri and vocabularies…” aspect of the software. In part because it isn’t that big a step to a topic map to add in which could help provide context and other factors to better enable integration of information.

Oh, and from further down on the webpage:

Stanley sees a number of potential applications for those who might like to try the Personal version for integrating and modeling smaller data sets. “Maybe a customer has a number of reports on protein expression experiments and lot of clinical data associated with that, including healthcare records and various report spreadsheets, and they must integrate those to do some research for themselves or their internal customers,” he says, as one example. “You can do that even using the Personal version to create a well integrated, semantically formatted file.”

Sure and when researchers move on, how do their successors maintain those integrations? Inquiring minds want to know? What do we do about semantic rot?

Slides for the NIPS 2011 tutorial

Filed under: Graphical Models — Patrick Durusau @ 10:23 pm

Slides for the NIPS 2011 tutorial by Alex Smola.

From the post:

The slides for the 2011 NIPS tutorial on Graphical Models for the Internet are online. Lots of stuff on parallelization, applications to user modeling, content recommendation, and content analysis here.

Very cool! Wish I could have seen the tutorial!

Read slowly and carefully!

NLM Plus

Filed under: Bioinformatics,Biomedical,Search Algorithms,Search Engines — Patrick Durusau @ 10:22 pm

NLM Plus

From the webpage:

NLMplus is an award winning Semantic Search Engine and Biomedical Knowledge Base application that showcases a variety of natural language processing tools to provide an improved level of access to the vast collection of biomedical data and services of the National Library of Medicine.

Utilizing its proprietary Web Knowledge Base, WebLib LLC can apply the universal search and semantic technology solutions demonstrated by NLMplus to libraries, businesses, and research organizations in all domains of science and technology and Web applications

Any medical librarians in the audience? Or ones you can forward this post to?

Curious what professional researchers make of NLM Plus? I don’t have the domain expertise to evaluate it.

Thanks!

Extracting data from the Facebook social graph with expressor, a Tutorial

Filed under: Expressor,Facebook,Social Graphs — Patrick Durusau @ 10:21 pm

Extracting data from the Facebook social graph with expressor, a Tutorial by Michael Tarallo.

From the post:

In my last article,Enterprise Application Integration with Social Networking Data, I describe how social networking sites, such as Facebook and Twitter, provide APIs to communicate with the various components available in these applications. One in particular, is their “social graph” API which enables software developers to create programs that can interface with the many “objects” stored within these graphs.

In this article, I will briefly review the Facebook social graph and provide a simple tutorial with an expressor downloadable project. I will cover how expressor can extract data using the Facebook graph API and flatten it by using the provided reusable Datascript Module. I will also demonstrate how to add new user defined attributes to the expressor Dataflow so one can customize the output needed.

Looks interesting.

Seems appropriate after starting today’s posts with work on the ODP files.

As you know, I am not a big fan of ETL but it has been a survivor. And if the folks who are signing off on the design want ETL, maybe it isn’t all that weird. 😉

Nutch Tutorial: Supplemental II

Filed under: Nutch,Search Engines,Searching — Patrick Durusau @ 10:20 pm

This continues Nutch Tutorial: Supplemental.

I am getting a consistent error from:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

and have posted to the solr list, although my post has yet to appear on the list. More on that in my next post.

I wanted to take a quick detour into 3.2 Using Individual Commands for Whole-Web Crawling as it has some problematic advice in it.

First, Open Directory Project data can be downloaded from How to Get ODP Data. (Always nice to have a link, I think they call them hyperlinks.)

Second, as of last week, the content.rdf.u8.gz file is 295831712. Something about the file should warn you that gunzip on this file is a very bad idea.

A better one?

Run: gunzip -l (the switch is lowercase “l” as in Larry) which delivers the following information:

compressed size: size of the compressed file
uncompressed size: size of the uncompressed file
ratio: compression ratio (0.0% if known)
uncompressed_name: name of the uncompressed file

Or, in this case:

gunzip -l content.rdf.u8.gz
compressed uncompressed ratio uncompressed_name
295831712 1975769200 85.0% content.rdf.u8

Yeah, that 1 under uncompressed is in the TB column. So just a tad shy of 2 TB of data.

Not everyone who is keeping up with search technology has a couple of spare TBs of drive space lying around, although it is becoming more common.

What got my attention was the lack of documentation of the file size or potential problems such a download could cause causal experimenters.

But, if we are going to work with this file we have to do so without decompressing it.

Since the tutorial only extracts URLs, I am taking that as the initial requirement although we will talk about more sophisticated requirements in just a bit.

On a *nix system it is possible to move the results of a command to another command by what are called pipes. My thinking in this case was to use the decompress command not to decompress the file but to decompress it and send the results of that compression to another command that would extract the URLs. After I got that part to working, I sorted and then deduped the URL set.

Here is the command with step [step] numbers that you should remove before trying to run it:

[1]gunzip -c content.rdf.u8.gz [2]| [3]grep -o ‘http://[^”]*’ [2]| [4]sort [2]| [5]uniq [6]> [7]dmoz.urls

  1. gunzip -c content.rdf.u8.gz – With the -c switch, gunzip does not change the original file but streams the uncompressed content of the file to standard out. This is our starting point for dealing with files too large to expand.
  2. | – This is the pipe character that moves the output of one command to be used by another. The shell command in this case has three (3) pipe commands.
  3. grep -o ‘http://[^”]*’ – With the -o switch, grep will print on the “matched” parts of a matched line (grep normally prints the entire line), with each part on a different line. The ‘http://[^”]*’ is a regular expression looking for parts that start with http:// and proceed to match any character other than the doublequote mark. When the double quote mark is reached, the match is complete and that part prints. Note the use of the wildcard character “*” which allows any number of charaters up to the closing double quote. The entire expression is surrounded with single ” ‘ ” characters because it contains a double quote character.
  4. sort – The result of #3 is piped into #4, where it is sorted. The sort was necessary because of the next command in in the pipe.
  5. uniq – The sorted result is delivered to the uniq command which deletes any duplicate URLs. A requirement for the uniq command is that the duplicates be located next to each other, hence the sort command.
  6. > Is the command to write the results of the uniq command to a file.
  7. dmoz.urls Is the file name for the results.

The results were as follows:

  • dmoz.urls = 130,279,429 – Remember the projected expansion of the original was 1,975,769,200 or 1,845,489,771 larger.
  • dmoz.urls.gz = 27,832,013 – The original was 295,831,712 or 267,999,699 larger.
  • unique urls – 3,838,759 (I have no way to compare to the original)

Note that it wasn’t necessary to process the RDF in order to extract a set of URLs for seeding a search engine.

Murray Altheim made several very good suggestions with regard to Java libraries and code for this task. Those don’t appear here but will appear in a command line tool for this dataset that allows the user to choose categories of websites to be extracted for seeding a search engine.

All that is preparatory to a command line tool for creating a topic map from a selected part of this data set and then enhancing it with the results of the use of a search engine.

Apologies for getting off track on the Nutch tutorial. There are some issues that remain to be answers, typos and the like, which I will take up in the next post on this subject.

December 11, 2011

tokenising the visible english text of common crawl

Filed under: Cloud Computing,Dataset,Natural Language Processing — Patrick Durusau @ 10:20 pm

tokenising the visible english text of common crawl by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Well, 30TB of data, that certainly sounds like a small project. 😉

What small amount of data are you using for your next project?

Graphs in Statistical Analysis

Filed under: Graphics,Graphs — Patrick Durusau @ 10:20 pm

Graphs in Statistical Analysis By Ajay Ohri.

From the post:

One of the seminal papers establishing the importance of data visualization (as it is now called) was the 1973 paper by F J Anscombe in http://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf.

If you haven’t read or remember the Anscombe paper (I can’t remember which it is for me) take the time after you read this post. You will be glad you did.

Klout Search Powered by ElasticSearch, Scala, Play Framework and Akka

Filed under: Social Media,Social Networks — Patrick Durusau @ 9:24 pm

Klout Search Powered by ElasticSearch, Scala, Play Framework and Akka

From the post:

At Klout, we love data and as Dave Mariani, Klout’s VP of Engineering, stated in his latest blog post, we’ve got lots of it! Klout currently uses Hadoop to crunch large volumes of data but what do we do with that data? You already know about the Klout score, but I want to talk about a new feature I’m extremely excited about — search!

Problem at Hand

I just want to start off by saying, search is hard! Yet, the requirements were pretty simple: we needed to create a robust solution that would allow us to search across all scored Klout users. Did I mention it had to be fast? Everyone likes to go fast! The problem is that 100 Million People have Klout (and that was this past September—an eternity in Social Media time) which means our search solution had to scale, scale horizontally.

Well, more of a “testimonial” as the Wizard of Oz would say but the numbers are serious enough to merit further investigation.

Although I must admit that social networking sites are spreading faster than, well, spreading faster that some social contagions.

Unless someone is joining multiple times for each one, for spamming purposes, I suspect some consolidation is in the not too distant future. What happens to all the links, etc., at the services that go away?

Just curious.

Installing HBase over HDFS on a Single Ubuntu Box

Filed under: HBase — Patrick Durusau @ 9:23 pm

Installing HBase over HDFS on a Single Ubuntu Box

From the post:

I faced some issues making HBase run over HDFS on my Ubuntu box. This is a informal step-by-step guide from setting up HDFS to running HBase on a single Ubuntu machine.

I am going to be doing this fairly soon so let me know if this sounds about right. 😉

If I get to it before you do, I will return the favor.

The Coron System

Filed under: Associations,Data Mining,Software — Patrick Durusau @ 9:23 pm

The Coron System

From the overview:

Coron is a domain and platform independent, multi-purposed data mining toolkit, which incorporates not only a rich collection of data mining algorithms, but also allows a number of auxiliary operations. To the best of our knowledge, a data mining toolkit designed specifically for itemset extraction and association rule generation like Coron does not exist elsewhere. Coron also provides support for preparing and filtering data, and for interpreting the extracted units of knowledge.

In our case, the extracted knowledge units are mainly association rules. At the present time, finding association rules is one of the most important tasks in data mining. Association rules allow one to reveal “hidden” relationships in a dataset. Finding association rules requires first the extraction of frequent itemsets.

Currently, there exist several freely available data mining algorithms and tools. For instance, the goal of the FIMI workshops is to develop more and more efficient algorithms in three categories: (1) frequent itemsets (FI) extraction, (2) frequent closed itemsets (FCI) extraction, and (3) maximal frequent itemsets (MFI) extraction. However, they tend to overlook one thing: the motivation to look for these itemsets. After having found them, what can be done with them? Extracting FIs, FCIs, or MFIs only is not enough to generate really useful association rules. The FIMI algorithms may be very efficient, but they are not always suitable for our needs. Furthermore, these algorithms are independent, i.e. they are not grouped together in a unified software platform. We also did experiments with other toolkits, like Weka. Weka covers a wide range of machine learning tasks, but it is not really suitable for finding association rules. The reason is that it provides only one algorithm for this task, the Apriori algorithm. Apriori finds FIs only, and is not efficient for large, dense datasets.

Because of all these reasons, we decided to group the most important algorithms into a software toolkit that is aimed at data mining. We also decided to build a methodology and a platform that implements this methodology in its entirety. Another advantage of the platform is that it includes the auxiliary operations that are often missing in the implementations of single algorithms, like filtering and pre-processing the dataset, or post-processing the found association rules. Of course, the usage of the methodology and the platform is not narrowed to one kind of dataset only, i.e. they can be generalized to arbitrary datasets.

I found this too late in the weekend to do more than report it.

I have spent most of the weekend trying to avoid expanding a file to approximately 2 TB before parsing it. More on that saga later this week.

Anyway, Coron looks/sounds quite interesting.

Anyone using it that cares to comment on it?

Lost in Complexity

Filed under: Graphics,Visualization — Patrick Durusau @ 9:22 pm

Lost in Complexity

A lesson that complex graphics may outstrip the verbal abilities of the author. Be careful!

Visualization of Prosper.com’s Loan Data Part I of II….

Filed under: Finance Services,Visualization — Patrick Durusau @ 9:22 pm

Visualization of Prosper.com’s Loan Data Part I of II – Compare and Contrast with Lending Club

From the post:

Due to the positive feedback received on this post I thought I would re-create the analysis on another peer-to-peer lending dataset, courtesy of Prosper.com. You can access the Prosper Marketplace data via an API or by simply downloading XML files that are updated nightly http://www.prosper.com/tools/.

Interesting work both for data analysis as well as visualization.

Finance data and financial markets are all the rage these days, mostly because the rationally self-interested managed to trash them so badly. I thought this might be a good starting point for any topic mapping activities in the area.

Lifting the veil on my “system”

Filed under: Research Methods — Patrick Durusau @ 8:41 pm

Lifting the veil on my “system” by Meredith Farkas.

From the post:

I am a huge fan of research log and research process reflection assignments. Because research is a means to an end (the paper) and because people are often doing it in a rush, there is little reflection on process. What worked? What didn’t? What can I take from this experience for the next time I have to do something similar? Because this reflection is not usually written into the curriculum, students don’t learn enough from their mistakes or even the good things they did. Having a research log helps students become better researchers in the future and, most importantly, helps them to develop a “system” that works for them.

I definitely remember the many years that I did not have a system for research and writing. Most reference librarians have probably encountered a frantic student who realizes just before his/her paper is due that s/he can’t track down some of the sources they need to cite. Yeah, that was me (though I would have been too embarrassed to come to the reference desk). I probably never followed the same path twice and wasted a lot of time doing things over again because I wasn’t organized. Looking back, I wish a nice librarian had provided an session for me on developing a system for finding, organizing, reading and synthesizing information, because I wasted a lot of time and sweat needlessly.

What do you think? Would a topic mapping tool do better? Worse? About the same?

While you are at it, give Meredith some feedback as well.

December 10, 2011

Whentotweet.com – Twitter analytics for the masses

Filed under: Marketing — Patrick Durusau @ 8:10 pm

Whentotweet.com – Twitter analytics for the masses

From the post:

Twitter handles an amazing number of Tweets – over 200 million tweets are sent per day.

We saw that many Twitter users were tweeting interesting content but much of it was lost in the constant stream of tweets.

Whentotweet.com is born

While there were many tools for corporate Twitter users that performed deep analytics and provided insight into their tweets, there were none that answered the most basic question: what time of the day are my followers actually using Twitter?

And so the idea behind Whentotweet was born. In its current form, Whentotweet analyzes when your followers tweet and gives you a personalized recommendation of the best time of day to tweet to reach as many as possible.

I mention this in part so that you may become better at getting your messages about topic maps out over Twitter.

An equally pragmatic reason is that the success of topic maps depends on the identification of use cases that will seem perfectly natural once you suggest them. Take this site/service as an example of meeting a need that is “obvious” once someone pointed it out.

Try it at: www.whentotweet.com

Software as a Religion ( SaaR)

Filed under: Humor — Patrick Durusau @ 8:09 pm

Software as a Religion ( SaaR) by Ajay Ohri

From the post:

The decline of organized religion and debate about such matters in the Western Hemisphere has been co-related to the increase in debates and arguments (again mostly) in the Western Hemisphere on software. Be it the PC vs Mac, the Microsofties vs Open Sourcers, the not so evil Google versus fans of Facebook, considerable activity is now being done by human beings in terms of social interaction on the merit’s and demerit’s of each software bundle. Perhaps for the first time in human history these interactions are being captured digitally on medium (that is hopefully longer lasting than papyrus).

I like that. Ontologies, folksonomies, taxonomies, Cyc, SUMO, RDF, OWL, topic maps, Description Logic, Formal Logic, Half-way Logic, Graphs, Existential Graphs, Essential Graphs, BCS Graphs, etc., all go unmentioned! Woefully partial listing of religious debates.

Not real sure where the author gets “…hopefully longer lasting that papyrus.” Fairly “recent” specimens are on the order of 4,000 years old. Some texts written in clay are a couple of thousand years older than that. (Allowing for differences in calendars and episodic destruction of entire civilizations.)

What’s your religious flavor?

Exploring Hadoop OutputFormat

Filed under: Hadoop — Patrick Durusau @ 8:06 pm

Exploring Hadoop OutputFormat by Jim.Blomo.

From the post:

Hadoop is often used as a part in a larger ecosystem of data processing. Hadoop’s sweet spot, batch processing large amounts of data, can best be put to use by integrating it with other systems. At a high level, Hadoop ingests input files, streams the contents through custom transformations (the Map-Reduce steps), and writes output files back to disk. Last month InfoQ showed how to gain finer control over the first step, ingestion of input files via the InputFormat class. In this article, we’ll discuss how to customize the final step, writing the output files. OutputFormats let you easily interoperate with other systems by writing the result of a MapReduce job in formats readable by other applications. To demonstrate the usefulness of OutputFormats, we’ll discuss two examples: how to split up the result of a job into different directories, and how to write files for a service providing fast key-value lookups.

One more set of tools to add to your Hadoop toolbox!

Discover Knowledge Paths

Filed under: Education,Training — Patrick Durusau @ 8:05 pm

Discover Knowledge Paths

Have you seen the “Knowledge Paths” at IBM developerWorks?

I don’t know if it is “new” or if the logo next to a page where I was reading happened to catch my eye. Looking at the “paths” by their dates, it looks like early October 2011 when it was rolled out. Does anyone know differently?

It doesn’t look real promising at first but you have to drill down to find the goodies.

For example, I chose “Open Source Skills,” which lead to:

Open source development with Eclipse: Master the basics
Learn the basics and get started working with Eclipse, an extensible open source development platform.

OK, but it isn’t clear what I am about to find when I follow: “Open source development with Eclipse: Master the basics,”

1. Learn about the Eclipse platform
2. Install and use Eclipse
3. Migrate to Eclipse from other environments
4. Debug with Eclipse
5. Combine Eclipse with other tools

12 Reads, 8 Practice, 1 Watch, 1 Download.

IBM needs to distinguish this material from other developerWorks content, which are all great articles but this is supposed to be something different.

It could be as simple as:

Open source development with Eclipse: Master the basics
12 Reads, 8 Practice, 1 Watch, 1 Download

So the reader knows this isn’t your average read along with the author sort of resource.

And while I did not look at the others closely, consistency in presentation of the paths, that is all paths have read/practice/resources (or some other structure) so that readers have an expectation of the content between paths. Think of the Java paths that Sun pioneered as an example.

Oh, and do have someone review the naming of the paths. “Querying XML from Java Applications” and its description don’t mention XQuery at all. Something like: “XQuery: Bending Data (and XML) to Your Will” would be much better.

A good start that could become a lodestone for training materials for designers and engineers. Particularly if sufficient guidance is given on creation and maintenance of content to make it attractive for third party content developers.

An alternative to having to hunt down partial, dated and not always accurate guidance about open source projects from mailing lists and blogs.

Scheduling in Hadoop

Filed under: Hadoop — Patrick Durusau @ 8:04 pm

Scheduling in Hadoop An introduction to the pluggable scheduler framework,
by M. Tim Jones, Consultant Engineer, Independent author

Summary:

Hadoop implements the ability for pluggable schedulers that assign resources to jobs. However, as we know from traditional scheduling, not all algorithms are the same, and efficiency is workload and cluster dependent. Get to know Hadoop scheduling, and explore two of the algorithms available today: fair scheduling and capacity scheduling. Also, learn how these algorithms are tuned and in what scenarios they’re relevant.

Not all topic maps are going to need Hadoop but enough will to make knowing the internals of Hadoop a real plus!

Understanding and Visualizing Solr Explain Information

Filed under: Solr,Visualization — Patrick Durusau @ 8:03 pm

Understanding and Visualizing Solr Explain Information by Rafal Kuc.

From the description:

This talk and presentation by Rafal Kuc, a DZone MVB, is about how to use, understand and visualize Solr ‘explain’ information—essential output from Solr that lets you better tune and debug your search application. In the talk, I’ll show the free software that is in development right now, that visualize Solr ‘explain’ information, such as how the score of the documents were counted, from what it is taken, how it was counted,which tokens mattered the most, and so on.

Session slides are also available.

Be forewarned, the proposed “explain” application has pie charts. 😉

Wakanda js.everywhere()

Filed under: Javascript,Wakanda — Patrick Durusau @ 8:02 pm

Wakanda js.everywhere() by Alexandre Morgaut.

Really do wish I had seen this one.

Guess what well known representation appears on slide 6?

😉

It is a worthy, if unattainable goal.

Still, enjoy the slides and Wakanda.

Fluentd: the missing log collector

Filed under: Fluentd,Flume,Log Analysis — Patrick Durusau @ 8:01 pm

Fluentd: the missing log collector

From the post:

The Problems

The fundamental problem with logs is that they are usually stored in files although they are best represented as streams (by Adam Wiggins, CTO at Heroku). Traditionally, they have been dumped into text-based files and collected by rsync in hourly or daily fashion. With today’s web/mobile applications, this creates two problems.

Problem 1: Need Ad-Hoc Parsing

The text-based logs have their own format, and the analytics engineer needs to write a dedicated parser for each format. However, You are a DATA SCIENTIST, NOT A PARSER GENERATOR, right? 🙂

Problem 2: Lacks Freshness

The logs lag. The realtime analysis of user behavior makes feature iterations a lot faster. A nimbler A/B testing will help you differentiate your service from competitors.

This is where Fluentd comes in. We believe Fluentd solves all issues of scalable log collection by getting rid of files, and turns logs into true semi-structured data streams.

If you are interested in log file processing, take a look at Fluentd and compare it to the competition.

As far as logs as streams, I think the “file view” of most data, logs or not, isn’t helpful. What does it matter to me if the graphs for a document are being generated in real time by a server and updated in my document? Or that a select bibliography is being updated so that readers get the late breaking research in a fast developing field?

The “fixed text” of a document is a view based upon the production means for documents. When those production means change, so should our view of documents.

December 9, 2011

dmoz – open directory project

Filed under: Search Data,Search Engines — Patrick Durusau @ 8:25 pm

dmoz – open directory project

This came up in the discussion of the Nutch Tutorial and I thought it might be helpful to have an entry on the site.

It is a collection of hand-edited resources which as of today claims:

4,952,266 sites – 92,824 editors – over 1,008,717 categories

The information you will find under the “help” menu item will be very valuable as you learn to make sure of the data files from this source.

« Newer PostsOlder Posts »

Powered by WordPress