Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 5, 2012

The Graphical Logic of C. S. Peirce

Filed under: Peirce — Patrick Durusau @ 6:57 pm

The Graphical Logic of C. S. Peirce by J. Jay Zeman.

Zeman’s 1964 dissertation on C.S. Peirce.

Other materials collected by Zeman on Peirce at: Peirce Contents.

You will hear Peirce’s name bandied about a good bit in discussions of semiotics, semantics, logic, etc.

Zeman wrote before Peirce became “popular.”

Big Data Analytics with R and Hadoop

Filed under: BigData,Hadoop,R — Patrick Durusau @ 6:56 pm

Big Data Analytics with R and Hadoop by David Smith.

From the post:

The open-source RHadoop project makes it easier to extract data from Hadoop for analysis with R, and to run R within the nodes of the Hadoop cluster — essentially, to transform Hadoop into a massively-parallel statistical computing cluster based on R. In yesterday’s webinar (the replay of which is embedded below), Data scientist and RHadoop project lead Antonio Piccolboni introduced Hadoop and explained how to write map-reduce statements in the R language to drive the Hadoop cluster.

Something to catch up on over the weekend.

BTW, do you know the difference between “massively-parallel” and “parallel?” I would think the “Connection Machine” was “massively-parallel” for its time but that was really specialized hardware. Does “massively” mean anything now or is it just a hold over/marketing term?

Building a Web-Based Legislative Editor

Filed under: Editor,Law — Patrick Durusau @ 6:56 pm

Building a Web-Based Legislative Editor by Grant Vergottini.

From the post:

I built the legislative drafting tool used by the Office of the Legislative Counsel in California. It was a long and arduous process and took several years to complete. Issues like redlining, page & line numbers, and the complexities of tables really turned an effort that, while looking quite simple at the surface, into a very difficult task. We used XMetaL as the base tool and customized it from there, developing what has to be the most sophisticated implementation of XMetaL out there. We even had to have a special API added to XMetaL to allow us to drive the change tracking mechanism to support the very specialized redlining needs one finds in legislation.

…With HTML5, it is now possible to build a full fledged browser-based legislative editor. For the past few months I have been building a prototype legislative editor in HTML5 that uses Akoma Ntoso as its XML schema. The results have been most gratifying. Certainly, building such an editor is no easy task. Having been working in this subject for 10 years now I have all the issues well internalized and can navigate the difficulties that arise. But I have come a long way towards achieving the holy grail of legislative editors – a web-based, standards-based, browser-neutral solution.

Not even out in beta, yet, but a promising report from someone who knows the ends and outs of legislation editors.

Why is that relevant for topic maps?

A web-based editor could, not necessarily will, lead to custom editors that are configured for work flows in the production of topic map work products.

If you think about it, we interact with work flows based on our recognition of subjects and taking actions based on the subjects we recognize.

Not a big step for software to record which subjects we have recognized, while our machinery silently adds identifiers, updates indexes of associations and performs other tasks.

PS: I originally saw this mentioned at the Legal Informatics blog.

HCIR 2012 Symposium

Filed under: Conferences,HCIR — Patrick Durusau @ 6:56 pm

HCIR 2012 Symposium

Important Dates:

  • Submission deadline (position and research papers): Sunday, July 29
  • HCIR Challenge:
    • Request access to corpus: Friday, June 1
    • Freeze system and submit brief description: Friday, August 31
    • Submit videos or screenshots demonstrating systems on example tasks: Friday, September 14
    • Live demonstrations at symposium: October 4-5
  • Notification date for position and research papers:
    Thursday, September 6
  • Final versions of accepted papers due: Sunday, September 16
  • Presentations and poster session at symposium: Thursday, October 4-5

Gene Golovchinsky writes:

We are happy to announce that the 2012 Human-Computer Information Retrieval Symposium (HCIR 2012) will be held in Cambridge, Massachusetts October 4 – 5, 2012. The HCIR series of workshops has provided a venue for discussion of ongoing research on a range of topics related to interactive information retrieval, including interaction techniques, evaluation, models and algorithms for information retrieval, visual design, user modeling, etc. The focus of these meetings has been to bring together people from industry and academia for short presentations and in-depth discussion. Attendance has grown steadily since the first meeting, and as a result this year we have decided to modify the structure of the meeting to accommodate the increasing demand for participation.

To this end, this year’s event has been expanded to two days to allow more time for presentations and for discussion. In addition to the position papers and challenge reports from previous years, we are introducing a new submission category, the archival paper. Archival papers will be peer-reviewed to a rigorous standard comparable to first-tier conference submissions, and the accepted papers will be published on arXiv.org and indexed in the ACM Digital Library.

It’s Massachusetts in October (think Fall colors) and it sounds like a great conference.

Announcing Apache Hive 0.9.0

Filed under: Hive,NoSQL — Patrick Durusau @ 6:55 pm

Announcing Apache Hive 0.9.0 by Carl Steinbach.

From the post:

This past Monday marked the official release of Apache Hive 0.9.0. Users interested in taking this release of Hive for a spin can download a copy from the Apache archive site. The following post is a quick summary of new features and improvements users can expect to find in this update of the popular data warehousing system for Hadoop.

The 0.9.0 release continues the trend of extending Hive’s SQL support. Hive now understands the BETWEEN operator and the NULL-safe equality operator, plus several new user defined functions (UDF) have now been added. New UDFs include printf(), sort_array(), and java_method(). Also, the concat_ws() function has been modified to support input parameters consisting of arrays of strings.

This Hive release also includes several significant improvements to the query compiler and execution engine. HIVE-2642 improved Hive’s ability to optimize UNION queries, HIVE-2881 made the the map-side JOIN algorithm more efficient, and Hive’s ability to generate optimized execution plans for queries that contain multiple GROUP BY clauses was significantly improved in HIVE-2621.

The database world just keeps getting better!

DARPA system to blend AI, machine learning to understand mountain of text

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 6:55 pm

DARPA system to blend AI, machine learning to understand mountain of text

From the post:

The Defense Advanced Research Projects Agency (DARPA) will next this month detail the union of advanced technologies from artificial intelligence, computational linguistics, machine learning, natural-language fields it hopes to bring together to build an automated system that will let analysts and others better grasp meanings from large volumes of text documents.

From DARPA: “Automated, deep natural-language understanding technology may hold a solution for more efficiently processing text information. When processed at its most basic level without ingrained cultural filters, language offers the key to understanding connections in text that might not be readily apparent to humans. Sophisticated artificial intelligence of this nature has the potential to enable defense analysts to efficiently investigate orders of magnitude more documents so they can discover implicitly expressed, actionable information contained within them.”

DARPA is holding a proposers day, May 16, 2012 in Arlington, VA, on the Deep Exploration and Filtering of Text (DEFT) project.

I won’t be attending but am interested in what you learn about the project.

What has me curious is that assuming DEFT is successful, how do they intend to capture the insights of analysts who describe the data and their conclusions differently? Particularly over time or from the perspective of different intelligence agencies? Or document the trails a particular analyst has followed through a mountain of data? Seems like those would be important issues as well.

Issues that are uniquely suited for subject-centric approaches like topic maps.

Avengers, Assembled (and Visualized) – Part 1

Filed under: Graphics,Visualization — Patrick Durusau @ 6:55 pm

Avengers, Assembled (and Visualized) – Part 1

Jer writes:

This post is about comics. It’s also about superheroes, robots, Norse gods, shrinking men, and women made of light – so it makes sense that it was inspired in the first place by a 10 year-old.

Last week, I was pointed by Santiago Ortiz to this excellent chart made by Theo Zaballos, in which he plots the relative interestingness in Avengers characters from the animated series, over time. It’s a fantastic example of the power of visualization to help us understand things – or, put another way, the power of building systems to think about systems. It’s also a reminder that visualization doesn’t always need to be pitted against huge, world-changing tasks – it can be useful in exploring small, fun, even seemingly frivolous things.

I started reading comics in 1985 (coincidentally, when I was 10). For years, I’d visit the comic shop every Wednesday, and pick up a stack of titles – and The Avengers was a real mainstay on my list. I was always more of a reader than a collector; my longboxes were full of dog-eared issues from incomplete series, which I revisited over and over again until the stories imprinted themselves in my brain.

There’s a huge storehouse of mythology, cultural touchstones, and real historical events contained in the pages of the 570 issues of the Avengers.

Inspired by Theo, and using comicvine.com’s API, I’ve put together a few datasets and some tools that I can use to visually explore some of this leotarded history.

I finally had to stop looking at various Avenger stuff and write this post. 😉

Very addictive visualization and a good illustration that you can practice/learn visualization skills with data sets that interest you.

Or that you find entertaining.

Are there any similar (dissimilar?) data sets that you would like to suggest?

PS: I like the “Part 1” in the title. Promises more to come.

Context models and out-of-context objects

Filed under: Context,Context Models — Patrick Durusau @ 6:55 pm

Context models and out-of-context objects by Myung Jin Choia, Antonio Torralbab, Alan S. Willskyc.

Abstract:

The context of an image encapsulates rich information about how natural scenes and objects are related to each other. Such contextual information has the potential to enable a coherent understanding of natural scenes and images. However, context models have been evaluated mostly based on the improvement of object recognition performance even though it is only one of many ways to exploit contextual information. In this paper, we present a new scene understanding problem for evaluating and applying context models. We are interested in finding scenes and objects that are “out-of-context”. Detecting “out-of-context” objects and scenes is challenging because context violations can be detected only if the relationships between objects are carefully and precisely modeled. To address this problem, we evaluate different sources of context information, and present a graphical model that combines these sources. We show that physical support relationships between objects can provide useful contextual information for both object recognition and out-of-context detection.

The authors distinguish object recognition in surveillance video versus still photographs, the subject of the investigation here. A “snapshot” if you will.

Subjects in digital media, assuming you don’t have the authoring data stream, exist in “snapshots” of a sort don’t they?

To start with they are bound up in a digital artifact, which among other things lives in a file system, with a last modified date, amongst many other files.

There may be more “context” for subjects in digital files that appears at first blush. Will have to give that some thought.

May 4, 2012

Titles from Springer collection cover wide range of disciplines on Apple’s iBookstore

Filed under: Books,Data,Springer — Patrick Durusau @ 3:44 pm

Titles from Springer collection cover wide range of disciplines on Apple’s iBookstore

From the post:

Springer Science+Business Media now offers one of the largest scientific, technical and medical (STM) book collections on the iBookstore with more than 20,000 individual Springer titles. Cornerstone works in disciplines like mathematics, medicine and engineering are now available, along with selections in other fields such as business and economics. Titles include the Springer Handbook of Nanotechnology, Pattern Recognition and Machine Learning, Bergey’s Manual of Systematic Bacteriology and the highly regarded book series Graduate Texts in Mathematics.

Springer is currently undertaking an exhaustive effort to digitize all of its books dating back to the mid-nineteenth century. By making most of its entire collection – both new and archived titles – available through its SpringerLink platform, Springer offers STM researchers far more opportunities than ever to obtain and apply content.

Gee, do you think the nomenclature has changed since the mid-nineteenth century until now? Just a bit? To say nothing across languages.

Prime topic map territory, both for traditional build and sell versions as well as topic trails through literature.

Will have to check to see how far back the current Spring API goes.

Dempsy – a New Real-time Framework for Processing BigData

Filed under: Akka,Apache S4,Dempsy,Erlang,Esper,HStreaming,Storm,Streambase — Patrick Durusau @ 3:43 pm

Dempsy – a New Real-time Framework for Processing BigData by Boris Lublinsky.

From the post:

Real time processing of BigData seems to be one of the hottest topics today. Nokia has just released a new open-source project – Dempsy. Dempsy is comparable to Storm, Esper, Streambase, HStreaming and Apache S4. The code is released under the Apache 2 license

Dempsy is meant to solve the problem of processing large amounts of "near real time" stream data with the lowest lag possible; problems where latency is more important that "guaranteed delivery." This class of problems includes use cases such as:

  • Real time monitoring of large distributed systems
  • Processing complete rich streams of social networking data
  • Real time analytics on log information generated from widely distributed systems
  • Statistical analytics on real-time vehicle traffic information on a global basis

The important properties of Dempsy are:

  • It is Distributed. That is to say a Dempsy application can run on multiple JVMs on multiple physical machines.
  • It is Elastic. That is, it is relatively simple to scale an application to more (or fewer) nodes. This does not require code or configuration changes but done by dynamic insertion or removal of processing nodes.
  • It implements Message Processing. Dempsy is based on message passing. It moves messages between Message processors, which act on the messages to perform simple atomic operations such as enrichment, transformation, etc. In general, an application is intended to be broken down into more smaller simpler processors rather than fewer large complex processors.
  • It is a Framework. It is not an application container like a J2EE container, nor a simple library. Instead, like the Spring Framework, it is a collection of patterns, the libraries to enable those patterns, and the interfaces one must implement to use those libraries to implement the patterns.

Dempsy’ programming model is based on message processors communicating via messages and resembles a distributed actor framework . While not strictly speaking an actor framework in the sense of Erlang or Akka actors, where actors explicitely direct messages to other actors, Dempsy’s Message Processors are "actor like POJOs" similar to Processor Elements in S4 and to some extent Bolts in Storm. Message processors are similar to actors in that they operate on a single message at a time, and need not deal with concurrency directly. Unlike actors, Message Processors also are relieved of the the need to know the destination(s) for their output messages, as this is handled inside by Dempsy based on the message properties.

In short Dempsy is a framework to enable the decomposing of a large class of message processing problems into flows of messages between relatively simple processing units implemented as POJOs. 

The Dempsy Tutorial contains more information.

See the post for an interview with Dempsy’s creator, NAVTEQ Fellow Jim Carroll.

Will the “age of data” mean that applications and their code will also be viewed and processed as data? The capabilities you have are those you request for a particular data set? Would like to see topic maps on the leading (and not dragging) edge of that change.

Choosing reference levels

Filed under: Graphics,Humor — Patrick Durusau @ 3:43 pm

Choosing reference levels

Amusing graphics for you to remember the next time you are designing the perfect graphic.

Would not hurt to get someone new to the subject/graphic to take a look.

How do you compare two text classfiers?

Filed under: Classification,Classifier — Patrick Durusau @ 3:43 pm

How do you compare two text classfiers?

Tony Russell-Rose writes:

I need to compare two text classifiers – one human, one machine. They are assigning multiple tags from an ontology. We have an initial corpus of ~700 records tagged by both classifiers. The goal is to measure the ‘value added’ by the human. However, we don’t yet have any ground truth data (i.e. agreed annotations).

Any ideas on how best to approach this problem in a commercial environment (i.e. quickly, simply, with minimum fuss), or indeed what’s possible?

I thought of measuring the absolute delta between the two profiles (regardless of polarity) to give a ceiling on the value added, and/or comparing the profile of tags added by each human coder against the centroid to give a crude measure of inter-coder agreement (and hence difficulty of the task). But neither really measures the ‘value added’ that I’m looking for, so I’m sure there must better solutions.

Suggestions, anyone? Or is this as far as we can go without ground truth data?

Some useful comments have been made. Do you have others?

PS: I wrote at Tony’s blog in a comment:

Tony,

The ‘value added’ by human taggers concept is unclear. The tagging in both cases is the result of human adding of semantics. Once through the rules for the machine tagger and once via the “human” taggers.

Can you say a bit more about what you see as a separate ‘value added’ by the human taggers?

What do you think? Is Tony’s question clear enough?

Bridging the Data Science Gap (DataKind)

Filed under: Data,Data Analysis,Data Science,Data Without Borders,DataKind — Patrick Durusau @ 3:43 pm

Bridging the Data Science Gap

From the post:

Data Without Borders connects data scientists with social organizations to maximize their impact.

Data scientists want to contribute to the public good. Social organizations often boast large caches of data but neither the resources nor the skills to glean insights from them. In the worst case scenario, the information becomes data exhaust, lost to neglect, lack of space, or outdated formats. Jake Porway, Data Without Borders [DataKind] founder and The New York Times data scientist, explored how to bridge this gap during the second Big Data for the Public Good seminar, hosted by Code for America and sponsored by Greenplum, a division of EMC.

Code for America founder Jennifer Pahlka opened the seminar with an appeal to the data practitioners in the room to volunteer for social organizations and civic coding projects. She pointed to hackathons such the ones organized during the nationwide event Code Across America as being examples of the emergence of a new kind of “third place”, referencing sociologist Ray Oldenburg’s theory that the health of a civic society depends upon shared public spaces that are neither home nor work. Hackathons, civic action networks like the recently announced Code for America Brigade, and social organizations are all tangible third spaces where data scientists can connect with community while contributing to the public good.

These principles are core to the Data Without Borders [DataKind] mission. “Anytime there’s a process, there’s data,” Porway emphasized to the audience. Yet much of what is generated is lost, particularly in the third world, where a great amount of information goes unrecorded. In some cases, the social organizations that often operate on shoestring budgets may not even appreciate the value of what they’re losing. Meanwhile, many data scientists working in the private sector want to contribute their skills for the social good in their off-time. “On the one hand, we have a group of people who are really good at looking at data, really good at analyzing things, but don’t have a lot of social outputs for it,” Porway said. “On the other hand, we have social organizations that are surrounded by data and are trying to do really good things for the world but don’t have anybody to look at it.”

The surplus of free work to be done is endless but thought you might find this interesting.

Data Without Borders – name change -> DataKind, Facebook page, @datakind on Twitter.

Good opportunity to show off your topic mappings skills!

Machine Learning in Python Has Never Been Easier!

Filed under: Machine Learning,Python — Patrick Durusau @ 3:41 pm

Machine Learning in Python Has Never Been Easier!

From the post:

At BigML we believe that over the next few years automated, data-driven decisions and data-driven applications are going to change the world. In fact, we think it will be the biggest shift in business efficiency since the dawn of the office calculator, when individuals had “Computer” listed as the title on their business card. We want to help people rapidly and easily create predictive models using their datasets, no matter what size they are. Our easy-to-use, public API is a great step in that direction but a few bindings for popular languages is obviously a big bonus.

Thus, we are very happy to announce an open source Python binding to BigML.io, the BigML REST API. You can find it and fork it at Github.

The BigML Python module makes it extremely easy to programmatically manage BigML sources, datasets, models and predictions. The snippet below sketches how you can create a source, dataset, model and then a prediction for a new object.

The “business efficiency” argument sounds like the “paperless office” to me.

Certain we will be able to do different, interesting and quite possibly useful things with machine learning and data. That we will become more “efficient,” is a separate question. By what measure?

If you look at scholarship from the 19th century, where people lacked many of the time saving devices of today, you will find authors who published hundreds of books, not articles, books. And not short books either. Were they more “efficient” than we are?

Rather than promise “efficiency,” promote machine learning as a means to do a particular task and do it well. If there is interest in the task and/or the result, that will be sufficient without all the superlatives.

Your Random Numbers – Getting Started with Processing and Data Visualization

Filed under: Graphics,Processing,Visualization — Patrick Durusau @ 3:41 pm

Your Random Numbers – Getting Started with Processing and Data Visualization

Jer writes:

Over the last year or so, I’ve spent almost as much time thinking about how to teach data visualization as I’ve spent working with data. I’ve been a teacher for 10 years – for better or for worse this means that as I learn new techniques and concepts, I’m usually thinking about pedagogy at the same time. Lately, I’ve also become convinced that this massive ‘open data’ movement that we are currently in the midst of is sorely lacking in educational components. The amount of available data, I think, is quickly outpacing our ability to use it in useful and novel ways. How can basic data visualization techniques be taught in an easy, engaging manner?

This post, then, is a first sketch of what a lesson plan for teaching Processing and data visualization might look like. I’m going to start from scratch, work through some examples, and (hopefully) make some interesting stuff. One of the nice things, I think, about this process, is that we’re going to start with fresh, new data – I’m not sure what kind of things we’re going to find once we start to get our hands dirty. This is what is really exciting about data visualization; the chance to find answers to your own, possibly novel questions.

Were I able to teach topic maps so well they would be about to take over the world!

See what you think.

Processing & Twitter

Filed under: Graphics,Processing,Tweets,Visualization — Patrick Durusau @ 3:41 pm

Processing & Twitter

Jer writes:

** Since I first released this tutorial in 2009, it has received thousands of views and has hopefully helped some of you get started with building projects incorporating Twitter with Processing. In late 2010, Twitter changed the way that authorization works, so I’ve updated the tutorial to get it inline with the new Twitter API functionality.

Accessing information from the Twitter API with Processing is (reasonably) easy. A few people have sent me e-mails asking how it all works, so I thought I’d write a very quick tutorial to get everyone up on their feet.

We don’t need to know too much about how the Twitter API functions, because someone has put together a very useful Java library to do all of the dirty work for us. It’s called twitter4j, and you can download it here. We’ll be using this in the first step of the building section of this tutorial.

A nice introduction to Twitter (an information stream) and Processing (a visualization language).

Both of which may find their way into your topic maps.

Snooze

Filed under: Cloud Computing,Snooze — Patrick Durusau @ 3:41 pm

Snooze

From the website:

Snooze is an open-source scalable, autonomic, and energy-efficient virtual machine (VM) management framework for private clouds. It allows users to build compute infrastructures from virtualized resources and manage a large number of VMs. Snooze is one of the core results of Eugen Feller`s PhD thesis under the supervision of Dr. Christine Morin at the INRIA MYRIADS project-team. The prototype is now used within the MYRIADS project-team in various cloud computing research activities.

For scalability Snooze employs a self-organizing hierarchical architecture and performs distributed VM management. Particularly, VM management is achieved by multiple managers, with each manager being in charge of a subset of nodes. In addition, fault tolerance is provided at all levels of the hierarchy. Finally, VM monitoring and live migration is integrated into the framework and a generic VM scheduler exists to facilitate the development of advanced VM placement algorithms. Last but not least once idle, servers are automatically transitioned into the system administrator specified power-state (e.g. suspend) to save energy. To facilitate idle times Snooze integrates dynamic VM relocation and consolidation.

Just in case you need to build a private cloud for your topic map or want to work on application of topic maps to a cloud and its components.

PS: Do note the additional subject identified by the string “snooze.”

Machine See, Machine Do

Filed under: Games,Machine Learning,Multimedia,Music,Music Retrieval — Patrick Durusau @ 3:40 pm

While we wait for maid service robots, news that computers can be trained as human mimics for labeling of multimedia resources. Game-powered machine learning reports success with game based training for music labeling.

The authors, Luke Barrington, Douglas Turnbull, and Gert Lanckriet, neatly summarize music labeling as a problem of volume:

…Pandora, a popular Internet radio service, employs musicologists to annotate songs with a fixed vocabulary of about five hundred tags. Pandora then creates personalized music playlists by finding songs that share a large number of tags with a user-specified seed song. After 10 y of effort by up to 50 full time musicologists, less than 1 million songs have been manually annotated (5), representing less than 5% of the current iTunes catalog.

A problem that extends to the “…7 billion images are uploaded to Facebook each month (1), YouTube users upload 24 h of video content per minute….”

The authors created www.HerdIt.org to:

… investigate and answer two important questions. First, we demonstrate that the collective wisdom of Herd It’s crowd of nonexperts can train machine learning algorithms as well as expert annotations by paid musicologists. In addition, our approach offers distinct advantages over training based on static expert annotations: it is cost-effective, scalable, and has the flexibility to model demographic and temporal changes in the semantics of music. Second, we show that integrating Herd It in an active learning loop trains accurate tag models more effectively; i.e., with less human effort, compared to a passive approach.

The approach promises an augmentation (not replacement) of human judgement with regard to classification of music. An augmentation that would enable human judgement to reach further across the musical corpus than ever before:

…while a human-only approach requires the same labeling effort for the first song as for the millionth, our game-powered machine learning solution needs only a small, reliable training set before all future examples can be labeled automatically, improving efficiency and cost by orders of magnitude. Tagging a new song takes 4 s on a modern CPU: in just a week, eight parallel processors could tag 1 million songs or annotate Pandora’s complete song collection, which required a decade of effort from dozens of trained musicologists.

A promising technique for IR with regard to multimedia resources.

What I wonder about is the extension of the technique, games designed to train machine learning for:

  • e-discovery in legal proceedings
  • “tagging” or indexing if you will, text resources
  • vocabulary expansion for searching
  • contexts for semantic matching
  • etc.

A first person shooter game that annotates the New York Times archives would be really cool!

May 3, 2012

Neo4j 1.8.M01 Release – Vindeln Vy

Filed under: Cypher,Neo4j — Patrick Durusau @ 6:24 pm

Neo4j 1.8.M01 Release – Vindeln Vy

From the post:

Neo4j 1.8 has an eye for expansive views, painting a picture with data and hanging it on the web. In this first milestone release, artful work on the Cypher query language is complemented with live views in the Neo4j documentation.

Take a few minutes to read the interview with “Lead Cypherologist” (their words, not mine) Andrés Taylor. Sets high expectations for the future of Cypher!

Then jump to the download page!

😉

Argumentation 2012

Filed under: Conferences,Law,Semantic Diversity — Patrick Durusau @ 6:24 pm

Argumentation 2012: International Conference on Alternative Methods of Argumentation in Law


07-09-2012 Full paper submission deadline

21-09-2012 Notice of acceptance deadline

12-10-2012 Paper camera-ready deadline

26-10-2012 Main event, Masaryk University in Brno, Czech Republic

From the listing of topics for papers, semantic diversity going to run riot at this conference.

Checking around the website I was disappointed the papers from Argumentation 2011 are not online.

TH*:Scalable Distributed Trie Hashing

Filed under: Hashing,Tries — Patrick Durusau @ 6:24 pm

TH*:Scalable Distributed Trie Hashing by Aridj Mohamed and Zegour Djamel Eddine.

In today’s world of computers, dealing with huge amounts of data is not unusual. The need to distribute this data in order to increase its availability and increase the performance of accessing it is more urgent than ever. For these reasons it is necessary to develop scalable distributed data structures. In this paper we propose a TH* distributed variant of the Trie Hashing data structure. First we propose Thsw new version of TH without node Nil in digital tree (trie), then this version will be adapted to multicomputer environment. The simulation results reveal that TH* is scalable in the sense that it grows gracefully, one bucket at a time, to a large number of servers, also TH* offers a good storage space utilization and high query efficiency special for ordering operations.

I ran across this article today on tries, which dates from 2010 (original publication date).

Can anyone point me to a recent survey of literature on tries?

Thanks!

Parallel Sets for categorical data, D3 port

Filed under: D3,Graphics,Parallel Sets,Visualization — Patrick Durusau @ 6:23 pm

Parallel Sets for categorical data, D3 port

If you are working with categorical data, you need to become familiar with parallel sets.

Starting here isn’t a bad idea.

Very impressive graphics/visualization.

20 More Reasons You Need Topic Maps

Filed under: Identification,Identifiers,Identity,Marketing,Topic Maps — Patrick Durusau @ 6:23 pm

Well, Ed Lindsey did call his column 20 Commom Data Errors and Variation but when you see the PNG of the 20 errors, here, you will agree my title works better (for topic maps anyway).

Not only that, but Ed’s opening paragraphs work for identifying a subject by more than one attribute (although this is “subject” in the police sense of the word):

A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag.

So what does this sergeant have to do? Start looking at the women of Chicago. He only needs the women. Actually, he would start with women with blond hair (but judging from my daughter’s constant change of hair color he might skip that attribute). So he might start with women in a certain height range and in a certain weight group. He would bring those women in to the station for questioning.

As it turns out, when they finally arrested the woman at her son’s soccer game, she had brown hair, was 5’5″ tall and weighed 120 pounds. She was wearing an Oklahoma University sweatshirt, jeans and sneakers. When the original witness saw her she said yes that’s the same woman. Iit turns out she was wearing four inch heels and the pantsuit made her look bigger.

So what can we learn from this episode that has to do with matching? Well the first thing we need to understand is that each of the attributes of the witness can be used in matching the suspect and then immediately we must also recognize that not all the attributes that the witness gave the sergeant were extremely accurate. So later on when we start talking about matching, will use the term fuzzy matching. This means that when you look at an address, there could be a number of different types of errors in the address from one system that are not identical to an address in another system. Figure 1 shows a number of the common errors that can happen.

So, there you have it: 20 more reasons to use topic maps, a lesson on identifying a subject and proof that yes, a pinstripped pantsuit can make you look bigger.

Look what I found: two amazing charts

Filed under: Graphics,Graphs,Visualization — Patrick Durusau @ 6:23 pm

Look what I found: two amazing charts

Kaiser Fung points to all too rare an occasion – two finely done charts – in different styles – same subject.

Can you point to a topic map visualization that is this clean and compelling?

Something to think about when displaying data.

Geeks like the springy graph stuff.

Take a walk around the next corporate office you visit.

How many people are using springy graph displays for their work flow?

Write that number down.

After a 3 month period, total up the number of people outside your office you saw using springy graph displays.

What is your day job?

😉

Don’t get me wrong, I think graphs data structures are the next big wave.

The question in my mind is how effectively can the average office dweller use graphs, displayed as graphs, for their work flow?

I have no doubt a word processor could be written with a graph data structure and even display. Would be hard to persuade me to try to sell that in a business market.

To be effective, like Kaiser’s examples, a chart (or software) has to appeal to more people than just you.

Hyperbolic lots

Filed under: Humor,Language,Machine Learning — Patrick Durusau @ 6:23 pm

Hyperbolic lots by Ben Zimmer.

From the post:

For the past couple of years, Google has provided automatic captioning for all YouTube videos, using a speech-recognition system similar to the one that creates transcriptions for Google Voice messages. It’s certainly a boon to the deaf and hearing-impaired. But as with Google’s other ventures in natural language processing (notably Google Translate), this is imperfect technology that is gradually becoming less imperfect over time. In the meantime, however, the imperfections can be quite entertaining.

I gave the auto-captioning an admittedly unfair challenge: the multilingual trailer that Michael Erard put together for his latest book, Babel No More: The Search for the World’s Most Extraordinary Language Learners. The trailer features a story from the book told by speakers of a variety of languages (including me), and Erard originally set it up as a contest to see who could identify the most languages. If you go to the original video on YouTube, you can enable the auto-captioning by clicking on the “CC” and selecting “Transcribe Audio” from the menu.

The transcription does a decent job with Erard’s English introduction, though I enjoyed the interpretation of “hyperpolyglots” — the subject of the book — as “hyperbolic lots.” Hyperpolyglot (evidently coined by Dick Hudson) isn’t a word you’ll find in any dictionary, and it’s not that frequent online, so it’s highly unlikely the speech-to-text system could have figured it out. But the real fun begins with the speakers of other languages.

You will find this amusing.

Ben notes the imperfections are becoming fewer.

Curious, since languages are living, social constructs, at what point to we measure the number of “imperfections?”

Or should I say from whose perspective do we measure the number of “imperfections?”

Or should we use both of those measures and others?

Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works

Filed under: Algorithms,Lock-Free Algorithms — Patrick Durusau @ 6:23 pm

Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works

Alex Popescu has links to both presentations and other resources by Martin Thompson on how to write lock-free algorithms.

Just in case you are interested in performance for your semantic/topic map applications.

Harvard as Tipping Point

Filed under: Education,Harvard,MIT — Patrick Durusau @ 6:22 pm

Harvard University made IT news twice this week:

$60 Million Venture To Bring Harvard, MIT Online For The Masses


The new nonprofit venture, dubbed edx, pours a combined $60 million of foundation and endowment capital into the open-source learning platform first developed and announced by MIT earlier this year as MITx.

Edx’s offerings are very different from the long-form lecture videos currently available as “open courseware” from MIT and other universities. Eventually, edx will offer a full slate of courses in all disciplines, created with faculty at MIT and Harvard, using a simple format of short videos and exercises graded largely by computer; students interact on a wiki and message board, as well as on Facebook groups, with peers substituting for TAs. The research arm of the project will continue to develop new tools using machine learning, robotics, and crowdsourcing that allow grading and evaluation of essays, circuit designs, and other types of exercises without endless hours by professors or TAs. Although edx is nonprofit and the courses are free, Agarwal envisions bringing the project to sustainability by one day charging students for official certificates of completion.

Harvard Library to faculty: we’re going broke unless you go open access

Henry sez, “Harvard Library’s Faculty Advisory Council is telling faculty that it’s financially ‘untenable’ for the university to keep on paying extortionate access fees for academic journals. It’s suggesting that faculty make their research publicly available, switch to publishing in open access journals and consider resigning from the boards of journals that don’t allow open access.”

The avalanche of flagship education and open content has begun.

Arguments about online content/delivery not being “good enough” will no longer carry any weight, or not much.

The opponents of online content/delivery, who made those arguments, will fight to preserve systems that benefited themselves and a few others. They will be routed soon enough and their fate is not my concern.

Information systems to meet the needs of the coming generation of world wide scholars, on the other hand, should be the concern of us all.

Giving People The Finger

Filed under: Interface Research/Design,Navigation — Patrick Durusau @ 6:22 pm

“Giving people the finger” is how I would headline:

In a paper published in the peer-reviewed journal Perception, researchers at the universities of Exeter and Lincoln showed that biological cues like an outstretched index finger or a pair of eyes looking to one side affect people’s attention even when they are irrelevant to the task at hand. Abstract directional symbols like pointed arrows or the written words “left” and “right” do not have the same effect. Pointing a Finger Work Much Better Than Using Pointed Arrows

I don’t have access to the article but the post reports:

“Interestingly, it was only the cues which were biological — the eye gaze and finger pointing cues — which had this effect,” said Prof. Hodgson, Professor of Cognitive Neuroscience in the School of Psychology at the University of Lincoln. “Road sign arrows and words “left” and “right” had no influence at all. What’s more, the eyes and fingers seemed to affect the participants’ reaction times even when the images were flashed on the screen for only a tenth of a second.”

The authors suggest that the reason that these biological signals may be particularly good at directing attention is because they are used by humans and some other species as forms of non-verbal communication: Where someone is looking or pointing indicates to others not only what they are paying attention to, but also what they might be feeling or what they might be planning on doing next.

I think the commonly quoted figure for the origins of language/symbol manipulation is about 100,000 years ago. Use of biological clues, pointing, eye movement, is far older. That’s off the top of my head so feel free to throw in citations (for or against).

There would be a learning curve in collaboration to use this for UIs. The abstract in question reads:

Pointing with the eyes or the finger occurs frequently in social interaction to indicate direction of attention and one’s intentions. Research with a voluntary saccade task (where saccade direction is instructed by the colour of a fixation point) suggested that gaze cues automatically activate the oculomotor system, but non-biological cues, like arrows, do not. However, other work has failed to support the claim that gaze cues are special. In the current research we introduced biological and non-biological cues into the anti-saccade task, using a range of stimulus onset asynchronies (SOAs). The anti-saccade task recruits both top – down and bottom – up attentional mechanisms, as occurs in naturalistic saccadic behaviour. In experiment 1 gaze, but not arrows, facilitated saccadic reaction times (SRTs) in the opposite direction to the cues over all SOAs, whereas in experiment 2 directional word cues had no effect on saccades. In experiment 3 finger pointing cues caused reduced SRTs in the opposite direction to the cues at short SOAs. These findings suggest that biological cues automatically recruit the oculomotor system whereas non-biological cues do not. Furthermore, the anti-saccade task set appears to facilitate saccadic responses in the opposite direction to the cues. Giving subjects the eye and showing them the finger: Socio-biological cues and saccade generation in the anti-saccade task

May 2, 2012

Müsli Ingredient Network: How Germans like to Eat their Breakfast

Filed under: Graphics,Visualization — Patrick Durusau @ 3:52 pm

Müsli Ingredient Network: How Germans like to Eat their Breakfast

From Information Aesthetics:

Müsli Ingredient Network [stefaner.eu] is a graphic meant for the print medium that shows how the customers of the German start-up MyMuesli tend to combine different müsli ingredients together.

Although one of the smaller projects designed by “truth and beauty operator” Moritz Stefaner, it still offers a small set of information graphics, such as a straight-forward radial network visualization, in which the ingredients are grouped by category, such as base mueslis, fruit, nuts, sweets, and so on. An additional “surprise factor” matrix visualization provides a more helpful view in revealing the links between the ingredients. Here, the circles represent link strengths between pairs of ingredients, while the saturation and darkness of a circle indicates the “unexpectedness” of that respective link strength.

From a topic map perspective, instructive on the visualization of associations between ingredients, both expected and “unexpected.” (I am not sure what is meant by “unexpectedness” or how you would measure it.)

From a personal perspective, I realized there are things, what Germans like for breakfast for example, that I have never been curious about. I remain incurious on that score. 😉 Good graphic though.

12 Ways to Increase Throughput by 32X and Reduce Latency by 20X

Filed under: Java,Messaging,Performance — Patrick Durusau @ 3:31 pm

12 Ways to Increase Throughput by 32X and Reduce Latency by 20X

From the post:

Martin Thompson, a high-performance technology geek, has written an awesome post, Fun with my-Channels Nirvana and Azul Zing. In it Martin shows the process and techniques he used to take an existing messaging product, written in Java, and increase throughput by 32X and reduce latency by 20X. The article is very well written with lots of interesting details that make it well worth reading.

You might want to start with the High Scalability summary before tackling the “real thing.”

Of interest to subject-centric applications that rely on messaging. And anyone interested in performance for the sheer pleasure of it.

« Newer PostsOlder Posts »

Powered by WordPress