Archive for April, 2012

Why Every NoSQL Deployment Should Be Paired with Hadoop (webinar)

Monday, April 30th, 2012

Why Every NoSQL Deployment Should Be Paired with Hadoop (webinar)

May 9, 2012 at 10am Pacific

From the webinar registration page:

In this webinar you will hear from Dr. Amr Awadallah, Co-Founder and CTO of Cloudera and James Phillips, Co-Founder and Senior VP of Products at Couchbase.

Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, versus relational database technology which favors centralized computing. But the “problems” these technologies address are quite different. Hadoop, the Big Data poster child, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases are transactional systems – delivering high-performance, cost-effective data management for modern real-time web and mobile applications; this is the Big User problem. Of course, if you have a lot of users, you are probably going to generate a lot of data. IDC estimates that more than 1.8 trillion gigabytes of information was created in 2011 and that this number will double every two years. The proliferation of user-generated data from interactive web and mobile applications are key contributors to this growth. In this webinar, we will explore why every NoSQL deployment should be paired with a Big Data analytics solution.

In this session you will learn:

  • Why NoSQL and Big Data are similar, but different
  • The categories of NoSQL systems, and the types of applications for which they are best suited
  • How Couchbase and Cloudera’s Distribution Including Apache Hadoop can be used together to build better applications
  • Explore real-world use cases where NoSQL and Hadoop technologies work in concert

Have you ever wanted to suggest a survey to Gartner or the technology desk at the Wall Street Journal?

Asking c-suite types at Fortune 500 firms the following questions among others:

  • Is there a difference between NoSQL and Big Data?
  • What percentage of software projects failed at your company last year?

Could go a long way to explaining the persistent and high failure rate of software projects.

Catch the webinar. Always the chance you will learn how to communicate with c-suite types. Maybe.

A Federated Search Approach to Facilitate Systematic Literature Review in Software Engineering

Monday, April 30th, 2012

A Federated Search Approach to Facilitate Systematic Literature Review in Software Engineering by Mohammad Ghafari ; Mortaza Saleh ; Touraj Ebrahimi.

Abstract:

To impact industry, researchers developing technologies in academia need to provide tangible evidence of the advantages of using them. Nowadays, Systematic Literature Review (SLR) has become a prominent methodology in evidence-based researches. Although adopting SLR in software engineering does not go far in practice, it has been resulted in valuable researches and is going to be more common. However, digital libraries and scientific databases as the best research resources do not provide enough mechanism for SLRs especially in software engineering. On the other hand, any loss of data may change the SLR results and leads to research bias. Accordingly, the search process and evidence collection in SLR is a critical point. This paper provides some tips to enhance the SLR process. The main contribution of this work is presenting a federated search tool which provides an automatic integrated search mechanism in well known Software Engineering databases. Results of case study show that this approach not only reduces required time to do SLR and facilitate its search process, but also improves its reliability and results in the increasing trend to use SLRs.

MongoDB, Python, Synonyms, ACM, IEEEXplore, ScienceDirect, do I have your attention yet?

The author’s federated search strategy will improve Systematic Literature Reviews.

What interests me is the potential for the results of those searches to improve future searches. As the experience of domain expert after domain expert is accumulated in the results of federated searches.

Important work in a rapidly developing area.

What is Umlaut anyway?

Monday, April 30th, 2012

What is Umlaut anyway?

From the webpage:

Umlaut is software for libraries (you know the kind with books), which deals with advertising services for specific known citations. It runs as Ruby on Rails application via an engine gem.

Umlaut could be called an ‘open source front-end for a link resolver’ — Umlaut accepts requests in OpenURL format, but has no knowledge base of it’s own, it can be used as a front-end for an existing knowledge base. (Currently SFX, but other plugins can be written).

And that describes Umlaut’s historical origin and one of it’s prime use cases. But in using and further developing Umlaut, I’ve come to realize that it has a more general purpose, as a new kind of infrastructural component.

Better, although a bit buzzword laden:

Umlaut is a just-in-time aggregator of “last mile” specific citation services, taking input as OpenURL, and providing an HTML UI as well as an api suite for embedding Umlaut services in other applications.

(In truth, that’s just a generalization of what your OpenURL Link Resolver does now, but considered from a different more flexible vantage).

Reading under Last Mile, Specific Citation I find:

Umlaut is not concerned with the search/discovery part of user research. Umlaut’s role begins when a particular item has been identified, with a citation in machine-accessible form (ie, title, author, journal, page number, etc., all in seperate elements).

Umlaut’s role is to provide the user with services that apply to the item of interest. Services provided by the hosting institution, licensed by the hosting institution, or free services the hosting institution wishes to advertise/recommend to it’s users.

Umlaut strives to supply links that take the user in as few clicks as possible to the service listed, without ever listing ‘blind links’ that you first have to click on to find out whether they are available. Umlaut pre-checks things when neccesary to only list services, with any needed contextual info, such that the user knows what they get when they click on it. Save the time of the user.

Starts with a particular subject (nee item) and maps known services to it.

Although links to subscriber services are unlikely to be interchangeable, links to public domain resources or those with public identifiers would be interchangeable. Potential for a mapping syntax? Or transmission of the “discovery” of such resources?

Lucene’s TokenStreams are actually graphs!

Monday, April 30th, 2012

Lucene’s TokenStreams are actually graphs!

Mike McCandless starts:

Lucene’s TokenStream class produces the sequence of tokens to be indexed for a document’s fields. The API is an iterator: you call incrementToken to advance to the next token, and then query specific attributes to obtain the details for that token. For example, CharTermAttribute holds the text of the token; OffsetAttribute has the character start and end offset into the original string corresponding to this token, for highlighting purposes. There are a number of standard token attributes, and some tokenizers add their own attributes.

He continues to illustrate the creation of graphs using SynonymFilter and discusses other aspects of graph creation from tokenstreams.

Including where the production of graphs needs to be added and issues in the current implementation.

If you see any of the Neo4j folks at the graph meetup in Chicago later today, you might want to mention Mike’s post to them.

NSF, NIH to Hold Webinar on Big Data Solicitation

Monday, April 30th, 2012

NSF, NIH to Hold Webinar on Big Data Solicitation by Erwin Gianchandani.

Guidance on BIGDATA Solicitation

<= $25 Million Webinar: Tuesday, May 8th, from 11am to 12pm ET. Registration closes 11:59pm PDT on Monday, May 7th.

From the post:

Late last month, the Administration unveiled a $200 million Big Data R&D Initiative, committing new funding to improve “our ability to extract knowledge and insights from large and complex collections of digital data.” The initiative includes a joint solicitation by the National Science Foundation (NSF) and National Institutes of Health (NIH), providing up to $25 million for Core Techniques and Technologies for Advancing Big Data Science and Engineering (BIGDATA). Now NSF and NIH have announced a webinar “to describe the goals and focus of the BIGDATA solicitation, help investigators understand its scope, and answer any questions potential Principal Investigators (PIs) may have.” The webinar will take place next week — on Tuesday, May 8th, from 11am to 12pm ET.

So, how clever are you really?

(The post has links to other materials you probably need to read before the webinar.)

See California kills by Wildlife Services

Monday, April 30th, 2012

See California kills by Wildlife Services

From the post:

Wildlife Services is a little-known federal agency of the Department of Agriculture charged with managing wildlife, particularly the intersection between humans — ranchers and farmers — and animals.

This map shows where Wildlife Services made the most kills of three commonly-killed animals — beavers, coyotes and bears. The charts below show the type of method used to kill those animals.

You can select beavers, coyotes, or bears, with other display options.

There appears to be no merging on other names for beavers, coyotes or bears, as well as the means of their, ah, control.

A good illustration that sometimes a minimal amount of merging is sufficient for the task at hand.

Mapping locations of control activities onto a map with changeable views is sufficient.

Readers aren’t expecting links into scientific/foreign literature where mapping of identifiers would be an issue.

Good illustrations, including maps, have a purpose.

So should your topic map and its merging.

Text Analytics: Yesterday, Today and Tomorrow

Monday, April 30th, 2012

Text Analytics: Yesterday, Today and Tomorrow

Another Tony Russell-Rose post that I ran across over the weekend:

Here’s something I’ve been meaning to share for a while: the slides for a talk entitled “Text Analytics: Yesterday, Today and Tomorrow”, co-authored with colleagues Vladimir Zelevinsky and Michael Ferretti. In this we outline some of the key challenges in text analytics, describe some of Endeca’s current research in this area, examine the current state of the text analytics market and explore some of the prospects for the future.

I was amused to read on slide 40:

Solutions still not standardized

Users differ in their views of the world of texts, solutions, data, formats, data structures, and analysis.

Anyone offering a “standardized” solution is selling their view of the world.

As a user/potential customer, I am rather attached to my view of the world. You?

Geoff – Easy Graph Data

Monday, April 30th, 2012

Geoff – Easy Graph Data

From the webpage:

Geoff is a declarative notation for representing graph data within concise human-readable text, designed specifically with Neo4j in mind. The format has been built to allow independent subgraphs to be represented outside of a graph database environment in such a way that they may be stored, transmitted and imported easily and efficiently. The basic elements which make up the Geoff format – subgraphs, rules and descriptors – are well defined but there exist several container representations which serve different purposes; commonly, either delimited text or a form of JSON is used.

Updated documentation and development of Geoff.

Topic mappers will be particularly interested in: Indexes and Merging.

Rabbithole, the Neo4j REPL console

Monday, April 30th, 2012

Rabbithole, the Neo4j REPL console

Over the last few days the Neo4j community team worked on the initial iteration for an interactive Neo4j tutorial.

The first result we are proud to publish is a sharable console that runs an in-memory Neo4j instance in a web-session.

It supports Cypher queries of the graph and Geoff for importing and modifying the graph data. The graph itself and the cypher results are visualized in an overlay using d3.js.

You can easily get a link to share your current graph content and even tweet it.

For the web application we use the minimal Spark web-framework.

The app is deployed to Heroku and available on github.

Deeply impressive!

If interactive use of Neo4j, including sharing your graphs, interests you, you need to get involved now with Rabbithole!

Streaming REST API – Interview with Michael Hunger [Neo4j]

Monday, April 30th, 2012

Streaming REST API – Interview with Michael Hunger [Neo4j]

Andreas Kollegger writes:

Recently, Michael Hunger blogged about his lab work to use streaming in Neo4j’s REST interface. On lab days, everyone on the Neo4j team gets to bump the priority of any engineering work that had been lingering in a background thread. I chatted with Michael about his work with streaming.

ABK: What inspired you to focus on streaming for Neo4j?
MH: Because it is a major aspect for Neo4j to behave as performant as possible, especially with so many languages / stacks connecting via the REST API. The existing approach is several orders of magnitude slower than embedded [note: Neo4j is embeddable on the JVM] and not just one as was originally envisioned.

ABK: What do you mean by “streaming” in this context, is this http streaming?
MH: Yes, it is http streaming combined with json streaming and having the internal calls to Neo4j generate lazy results (Iterables) instead of pulling all results from the db in one go. So writing to the stream will advance the database operations (or their “cursors”). This applies to: indexing, cypher, and traversals.

The difference in approaches:

the streaming took 10 seconds to return a complete result transferring between 8 to 15 MB/s for 130MB of data. The normal non-streaming result took 1 minute, 8 seconds to provide the same result and a Heap of 2GB.

The interview and Michael’s post should be on your reading list for this week!

Prostitutes Appeal to Pope: Text Analytics applied to Search

Sunday, April 29th, 2012

Prostitutes Appeal to Pope: Text Analytics applied to Search by Tony Russell-Rose.

It is hard for me to visit Tony’s site and not come away with several posts he has written that I want to mention. Today was no different.

Here is a sampling of what Tony talks about in this post:

Consider the following newspaper headlines, all of which appeared unambiguous to the original writer:

  • DRUNK GETS NINE YEARS IN VIOLIN CASE
  • PROSTITUTES APPEAL TO POPE
  • STOLEN PAINTING FOUND BY TREE
  • RED TAPE HOLDS UP NEW BRIDGE
  • DEER KILL 300,000
  • RESIDENTS CAN DROP OFF TREES
  • INCLUDE CHILDREN WHEN BAKING COOKIES
  • MINERS REFUSE TO WORK AFTER DEATH

Although humorous, they illustrate much of the ambiguity in natural language, and just how much pragmatic and linguistic knowledge must be employed by NLP tools to function accurately.

A very informative and highly amusing post.

What better way to start the week?

Enjoy!

46 Research APIs: DataUnison, Mendeley, LexisNexis and Zotero

Sunday, April 29th, 2012

46 Research APIs: DataUnison, Mendeley, LexisNexis and Zotero by Wendell Santos.

From the post:

Our API directory now includes 46 research APIs. The newest is the Globus Online Transfer API. The most popular, in terms of mashups, is the Mendeley API. We list 3 Mendeley mashups. Below you’ll find some more stats from the directory, including the entire list of research APIs.

I did see an API that accepts Greek strings and returns Latin transliteration. Oh, doesn’t interest you. 😉

There are a number of bibliography, search and related tools.

I am sure you will find something to enhance an academic application of topic maps.

HBase Real-time Analytics & Rollbacks via Append-based Updates

Sunday, April 29th, 2012

HBase Real-time Analytics & Rollbacks via Append-based Updates by Alex Baranau.

From the post:

In this part 1 of a 3-part post series we’ll describe how we use HBase at Sematext for real-time analytics and how we can perform data rollbacks by using an append-only updates approach.

Some bits of this topic were already covered in Deferring Processing Updates to Increase HBase Write Performance and some were briefly presented at BerlinBuzzwords 2011 (video). We will also talk about some of the ideas below during HBaseCon-2012 in late May (see Real-time Analytics with HBase). The approach described in this post is used in our production systems (SPM & SA) and the implementation was open-sourced as HBaseHUT project.

Problem we are Solving

While HDFS & MapReduce are designed for massive batch processing and with the idea of data being immutable (write once, read many times), HBase includes support for additional operations such as real-time and random read/write/delete access to data records. HBase performs its basic job very well, but there are times when developers have to think at a higher level about how to utilize HBase capabilities for specific use-cases. HBase is a great tool with good core functionality and implementation, but it does require one to do some thinking to ensure this core functionality is used properly and optimally. The use-case we’ll be working with in this post is a typical data analytics system where:

  • new data are continuously streaming in
  • data are processed and stored in HBase, usually as time-series data
  • processed data are served to users who can navigate through most recent data as well as dig deep into historical data

Although the above points frame the use-case relatively narrowly, the approach and its implementation that we’ll describe here are really more general and applicable to a number of other systems, too. The basic issues we want to solve are the following:

  • increase record update throughput. Ideally, despite high volume of incoming data changes can be applied in real-time . Usually. due to the limitations of the “normal HBase update”, which requires Get+Put operations, updates are applied using batch-processing approach (e.g. as MapReduce jobs). This, of course, is anything but real-time: incoming data is not immediately seen. It is seen only after it has been processed.
  • ability to roll back changes in the served data. Human errors or any other issues should not permanently corrupt data that system serves.
  • ability to fetch data interactively (i.e. fast enough for inpatient humans). When one navigates through a small amount of recent data, as well as when selected time interval spans years, the retrieval should be fast.

Here is what we consider an “update”:

  • addition of a new record if no records with same key exists
  • update of an existing record with a particular key

See anything familiar? That resembles your use cases?

The proffered solution may not fit your use case(s) but this is an example of exploring a solution. Not fitting a problem to a solution. Not the same thing.

HBase Real-time Analytics & Rollbacks via Append-based Updates Part 2 is available. Solution uses HBaseHUT. Really informative graphics in part 2 as well.

Very interested in seeing Part 3!

Movement in Manhattan: Mapping the Speed and Direction of Twitter Users

Sunday, April 29th, 2012

Movement in Manhattan: Mapping the Speed and Direction of Twitter Users

Information Aesthetics writes:

Inspired by the animated wind map that was posted a little while ago, professional programmer Jeff Clark has explored how people move about in a city. The result, titled Movement in Manhattan [neoformix.com], visualizes the speed and direction of Twitter users in Manhattan, New York.

The visualization is based on a large collection of geo-located tweets that were sent in a 4-hour time-window by the same users. These tweets were used as samples that together construct a vector field representing the average flow of people within a specific area. Particles, representing people, were released at locations where actual tweets were recorded and their subsequent movement was determined by the flow field.

Interesting in its own right but combined with other data:

  • events, natural and/or man-made
  • location/movement of authorities
  • location/movement of other groups
  • location/movement of civilians
  • etc.

it could be part of a real-time tactical display.

The advantage of a topic map being that the type and range of data isn’t hard wired in.

Once you are “in country,” wherever you define that to be, here, there, etc., your information feeds can fit you situation. And change when your situation changes.

Sans long development cycles, contract negotiations and the usual mid-level management antics.

All of which is amusing except when you are the one being shelled.

Legal Entity Identifier – Preparing for the Inevitable

Sunday, April 29th, 2012

Legal Entity Identifier – Preparing for the Inevitable by Peter Ku.

From the post:

Most of the buzz around the water cooler for those responsible for enterprise reference data in financial services has been around the recent G20 meeting in Switzerland on the details of the proposed Legal Entity Identifier (LEI). The LEI is designed to help regulators manage and monitor systemic risk in the financial markets by creating a unique ID to recognize legal entities/counterparties shared by the global financial companies and government regulators. Agreement to adoption is expected to be decided at the G20 leaders’ summit coming up in June in Mexico as regulators decide the details as to the administration, implementation and enforcement of the standard. Will the new LEI solve the issues that led to the recent financial crisis?

Looking back at history, this is not the first time the financial industry has attempted to create a unique ID system for legal entities, remember the Data Universal Numbering System (DUNS) identifier as an example? What is different from the past is that the new LEI standard is set at a global vs. regional level which had caused past attempts to fail. Unfortunately, the LEI standard will not replace existing IDs that firms deal with every day. Instead, it creates further challenges requiring companies to map existing IDs to the new LEI, reconciling naming differences, maintain legal hierarchy relationships between parent and subsidiary entities from ongoing corporate actions, and also link it to the securities and loans to the legal entities.

….

While many within the industry are waiting to see what the regulators decide in June, existing issues related to the quality, consistency, and delivery of counterparty reference data and the downstream impact on managing risk needs to be dealt with regardless if LEI is passed. In the same report, I shared the challenges firms will face incorporating the LEI including:

  • Accessing, reconciling, and relating existing counterparty information and IDs to the new LEI
  • Effectively identifying and resolving data quality issues from external and internal systems
  • Accurately identifying legal hierarchy relationships which LEI will not maintain in its first instantiation.
  • Cross referencing legal entities with financial and securities instruments
  • Extending both counterparty and securities instruments to downstream front, mid, and back office systems.

As a topic map person, do any of these issues sound familiar to you?

In particular creating a new identifier to solve problems with resolving multiple “old” ones?

Being mindful that all data systems are capable of and/or contain errors, intentional (dishonest) and otherwise.

Presuming perfect records, and perfect data in those records, not only guarantees failure, but avenues for abuse.

Peter cites resources you will need to read.

Visualization Olympian

Sunday, April 29th, 2012

I thought the Representing the First 4,000,000 Decimals of Pi in a Single Image was awesome.

I backed up the URL to TWO-N.com to get an email to suggest a “find next” for the Pi visualization.

There I found forty-five (45) visualizations that you will need to see to appreciate.

Not billable time to a particular client/project.

Some experiences sharpen your talents without being billable time.

Representing the First 4,000,000 Decimals of Pi in a Single Image

Sunday, April 29th, 2012

Representing the First 4,000,000 Decimals of Pi in a Single Image

From the post:

The online visualization titled “3.1415926535897932384626…” [two-n.com] by design studio TWO-N represents the first 4,000,000 decimals of the number Pi within a single image.

Each unique digit of Pi corresponds to a specific color, and is rendered as a 1×1 pixel dot. The result is a long, random-looking pixel carpet image. Next to a dedicated slider that allows up/down scrolling through the resulting image, one can also search for the first occurrences of any specific decimal combination.

A tribute to the art of visualization!

The browse for up to six (6) digit strings is addictive! Be careful!

How would you search for unique six (6) digit strings in the first 4,000,000 decimals of the number Pi?

Or you would use it as a bar bet on unique occurrences. (I know one but I am saving that to win a cup of coffee off of Lars.)

Text Analytics Summit Europe – highlights and reflections

Sunday, April 29th, 2012

Text Analytics Summit Europe – highlights and reflections by Tony Russell-Rose.

Earlier this week I had the privilege of attending the Text Analytics Summit Europe at the Royal Garden Hotel in Kensington. Some of you may of course recognise this hotel as the base for Justin Bieber’s recent visit to London, but sadly (or is that fortunately?) he didn’t join us. Next time, maybe…

Ranking reasons to attend:

  • #1 Text Analytics Summit Europe – meet other attendees, presentations
  • #2 Kensington Gardens and Hyde Park (been there, it is more impressive than you can imagine)
  • #N +1 Justin Bieber being in London (or any other location)

I was disappointed by the lack of links to slides or videos of the presentations.

Tony’s post does have pointers to people and resources you may have missed.

Question: Do you think “text analytics” and “data mining” are different? If so, how?

Semantically Diverse Christenings

Sunday, April 29th, 2012

Mark Liberman in Neutral Xi_b^star, Xi(b)^{*0}, Ξb*0, whatever at Language Log reports semantically diverse christenings of the same new subatomic particle.

I count eight or nine distinct names in Liberman’s report.

How many do you see?

This is just days after its discovery at the CERN.

Largely in the scientific literature. (It will get far worse if you include non-technical literature. Is non-technical literature/discussion relevant?)

Question for science librarians:

How many names for this new subatomic particle will you use in searches?

Data Journalism Handbook

Sunday, April 29th, 2012

Data Journalism Handbook

From the website:

The Data Journalism Handbook is a free, open source reference book for anyone interested in the emerging field of data journalism.

It was born at a 48 hour workshop at MozFest 2011 in London. It subsequently spilled over into an international, collaborative effort involving dozens of data journalism’s leading advocates and best practitioners – including from the Australian Broadcasting Corporation, the BBC, the Chicago Tribune, Deutsche Welle, the Guardian, the Financial Times, Helsingin Sanomat, La Nacion, the New York Times, ProPublica, the Washington Post, the Texas Tribune, Verdens Gang, Wales Online, Zeit Online and many others.

Superlatives fail to describe the Data Journalism Handbook.

Pick a section, any section, to be delighted, informed, and amazed.

Front Matter

Introduction

In The Newsroom

Case studies

Getting Data

Understanding data

Delivering Data

City Dashboard: Aggregating All Spatial Data for Cities in the UK

Saturday, April 28th, 2012

City Dashboard: Aggregating All Spatial Data for Cities in the UK

You need to try this out for yourself before reading the rest of this post.

Go ahead, I’ll wait…, …, …, ok.

To some extent this “aggregation” may reflect on the sort of questions we ask users about topic maps.

It’s possible to aggregate data about anything number of things. But even if you could, would you want to?

Take the “aggregation” for Birmingham, UK, this evening. One of the components informed me a choir director was arrested for rape. Concerns the choir director a good bit but why it would interest me?

Isn’t that the problem of aggregation? The definition of “useful” aggregation varies from person to person, even task to task.

Try London while you are at the site. There is a Slightly Unhappier/Significantly Unhappier, “Mood” indicator. It has what turns out to be a “count down” timer, for the next reset on the indicator.

I thought the changing count reflected people becoming more and more unhappy.

Looked like London was going to “flatline” while I was watching. 😉

Fortunately turned out to not be the case.

There are dangers to personalization but aggregation without relevance just pumps up the noise.

Not sure that helps either.

Suggestions?

COCOON 2012

Saturday, April 28th, 2012

COCOON 2012 18th Annual International Computing and Combinatorics Conference

Sydney, Australia – August 20-22, 2012.

So you don’t have to check your calendar, Balisage ends on August 10th, leaving you plenty of time to make COCOON 2012.

If this listing of papers doesn’t motivate you to attend, it should at least give you some ideas of to be thinking about in your work with graphs and topic maps.

keeptheweb#OPEN

Saturday, April 28th, 2012

keeptheweb#OPEN

Have you seen this?

Leaving the obvious politics to one side, what interests me is the ability to add comments to legislation.

Thinking of it in the context of standards work, particularly for topic maps.

Standardized mappings for taxonomies sounds to me like a useful topic map type activity. Having the ability to comment and process comments on drafts in a public fashion, sounds good to me.

Comments?

Scalability of Topic Map Systems

Saturday, April 28th, 2012

Scalability of Topic Map Systems, thesis by Marcel Hoyer.

Abstract:

The purpose of this thesis was to find approaches solving major performance and scalability issues for Topic Maps-related data access and the merging process. Especially regarding the management of multiple, heterogeneous topic maps with different sizes and structures. Hence the scope of the research was mainly focused on the Maiana web application with its underlying MaJorToM and TMQL4J back-end.

In the first instance the actual problems were determined by profiling the application runtime, creating benchmarks and discussing the current architecture of the Maiana stack. By presenting different distribution technologies afterwards the issues around a single-process instance, slow data access and concurrent request handling were investigated to determine possible solutions. Next to technological aspects (i. e. frameworks or applications) this discussion included fundamental reflection of design patterns for distributed environments that indicated requirements for changes in the use of the Topic Maps API and data flow between components. With the development of the JSON Topic Maps Query Result format and simple query-focused interfaces the essential concept for an prototypical implementation was established. To concentrate on scalability for query processing basic principles and benefits of message-oriented middleware were presented. Those were used in combination with previous results to create a distributed Topic Maps query service and to present ideas about optimizing virtual merging of topic maps.

Finally this work gave multiple insights to improve the architecture and performance of Topic Maps-related applications by depicting concrete bottlenecks and providing prototypical implementations that show the feasibility of the approaches. But it also pointed out remaining performance issues in the persisting data layer.

I have just started reading Marcel’s thesis but I am already impressed by the evaluation of Maiana. I am sure this work will be useful in planning options for future topic map stacks.

Commend it to you for reading and discussion, perhaps on the relatively quiet topic map discussion lists?

Agreement Groups in the United States Senate

Saturday, April 28th, 2012

Agreement Groups in the United States Senate by Adrien Friggeri.

From the webpage:

The United States Senate is the upper house of the United States legislature and contrary to the House of Representative which seats are up for election every two years, Senators serve terms of six years each. Those terms are however staggered so that approximately one-third of the Senate is renewed every two years. This means that each pair of consecutive sessions of the Senate share a large number of common Senators.

Being interested in social networks and communities, it was only natural to look at how those Senators were linked one to another and the natural community structure which emerged from their interactions. Thanks to GovTrack.us, we were able to construct for each session of the Senate an agreement graph between Senators.

We then clustered the Senators into overlapping groups of agreement using a new community detection algorithm called C3. This page presents supplemental material to our submission to ESA 2012. We provide a visualization of those groups for the last eight Congresses and discuss various aspects of the results.

From Jack Park, notice of an interesting visualization of the United States Senate.

Makes me wonder what a mapping of donor or special interest group would look like against this visualization?

Certainly computing resources have developed to the point that visualization, unless the data sets are quite large, need not be static.

That is to say that the age of visual and interactive exploration, not static display, of data sets may be upon us.

I know there has been some work along those lines in bioinformatics but am unaware of examples political science. That may simply be due to inattention on my part. Suggestions?

…such as the eXtensible Business Reporting Language (XBRL).

Saturday, April 28th, 2012

Now there is a shout-out! Better than Steve Cobert or Jon Steward? Possibly, possibly. 😉

Where? The DATA act, recently passed by the House of Representatives (US), reads in part:

EXISTING DATA REPORTING STANDARDS.—In designating reporting standards under this subsection, the Commission shall, to the extent practicable, incorporate existing nonproprietary standards, such as the eXtensible Business Reporting Language (XBRL). [Title 31, Section 3611(b)(3). Doesn’t really roll off the tongue does it?]

No guarantees but what do you think the odds are that XBRL will be used by the commission? (That’s what I thought.)

With that in mind:

XBRL

Homepage for XBRL.org and apparently the starting point for all things XBRL. You will find the specifications, taxonomies, best practices and other materials on XBRL.

Enough reading material to keep you busy while waiting for organizations to adopt or to be required to adopt XBRL.

Topic maps are relevant to this transition for several reasons, among others:

  1. Some organizations will have legacy accounting systems that require mapping to XBRL.
  2. Even organizations that have transitioned to XBRL will have legacy data that has not.
  3. Transitions to XBRL by different organizations may not reflect the same underlying semantics.

Workflow for statistical data analysis

Saturday, April 28th, 2012

Workflow for statistical data analysis by Christophe Lalanne.

A short summary of Oliver Kirchkamp’s Workflow of statistical data analysis, which takes the reader from data to paper.

Christophe says a more detailed review is likely to follow but at eighty-six (86) pages, you could read it yourself and make detailed comments as well.

Akaros – an open source operating system for manycore architectures

Saturday, April 28th, 2012

Akaros – an open source operating system for manycore architectures

From the post:

If you are interested in future foward OS designs then you might find Akaros worth a look. It’s an operating system designed for many-core architectures and large-scale SMP systems, with the goals of:

  • Providing better support for parallel and high-performance applications
  • Scaling the operating system to a large number of cores

A more indepth explanation of the motiviation behind Akaros can be found in Improving Per-Node Efficiency in the Datacenter with NewOS Abstractions by Barret Rhoden, Kevin Klues, David Zhu, and Eric Brewer.

From the paper abstract:

Traditional operating system abstractions are ill-suited for high performance and parallel applications, especially on large-scale SMP and many-core architectures. We propose four key ideas that help to overcome these limitations. These ideas are built on a philosophy of exposing as much information to applications as possible and giving them the tools necessary to take advantage of that information to run more efficiently. In short, high-performance applications need to be able to peer through layers of virtualization in the software stack to optimize their behavior. We explore abstractions based on these ideas and discuss how we build them in the context of a new operating system called Akaros.

Rather than “layers of virtualization” I would say: “layers of identifiable subjects.” That’s hardly surprising but it has implications for this paper and future successors on the same issue.

Issues of inefficiency aren’t due to a lack of programming talent, as the authors ably demonstrate, but rather the limitations placed upon that talent by the subjects our operating systems identify and permit to be addressed.

The paper is an exercise in identifying different subjects than those identified in contemporary operating systems. That abstraction may assist future researchers in positing different subjects for identification and consequences that flow from identifying different subjects.

First Light – MS Open Tech: Redis on Windows

Saturday, April 28th, 2012

First Light – MS Open Tech: Redis on Windows

Claudio Caldato writes:

The past few weeks have been very busy in our offices as we announced the creation of Microsoft Open Technologies, Inc. Now that the dust has settled it’s time for us to resume our regular cadence in releasing code, and we are happy to share with you the very first deliverable from our new company: a new and significant iteration of our work on Redis on Windows, the open-source, networked, in-memory, key-value data store.

The major improvements in this latest version involve the process of saving data on disk. Redis on Linux uses an OS feature called Fork/Copy On Write. This feature is not available on Windows, so we had to find a way to be able to mimic the same behavior without changing completely the save on disk process so as to avoid any future integration issues with the Redis code.

Excellent news!

BTW, Microsoft Open Technologies has a presence on Github. Just the one project (Redis on Windows) but I am sure more will follow.

How do I know if my figure is too complicated?

Saturday, April 28th, 2012

How do I know if my figure is too complicated?

From the post:

One of the key things every statistician needs to learn is how to create informative figures and graphs. Sometimes, it is easy to use off-the-shelf plots like barplots, histograms, or if one is truly desperate a pie-chart.

But sometimes the information you are trying to communicate requires the development of a new graphic. I am currently working on a project with a graduate student where the standard illustration are Venn Diagrams – including complicated Venn Diagrams with 5 or 10 circles.

The post goes onto give some good suggestions for evaluating a figure for being “…too complicated.”

One of the comments takes up the refrain:

Simplicity is good, but you shouldn’t sacrifice the possibility of depth of understanding in order to save your readers from the pain of having to think.

True, but when we make a diagram, don’t we usually make it of our current, rather complete understanding? We omit the near misses, torn up drafts, slips of the pen, consequential and otherwise, and present the reader with a polished final product.

Which bears little resemblance to how we got there, but looks impressive now that we have arrived.

I don’t think readers mind thinking, but on the other hand, we should not expect readers to be smarter than we are.

There isn’t any reason, other than habit, that we can’t say: “When I saw graphic X, which is incorrect but it made me think of…” and make our publications more closely resemble how we actually do research.