Archive for July, 2012

Processing Rat Brain Neuronal Signals Using A Hadoop Computing Cluster – Part I

Tuesday, July 31st, 2012

Processing Rat Brain Neuronal Signals Using A Hadoop Computing Cluster – Part I by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the introduction:

In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Hadoop and Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.


Problem Statement

Prior to starting this work, Jadin had data previously gathered by himself and from neuroscience researchers who are interested in the role of the brain region called the hippocampus. In both rats and humans, this region is responsible for both spatial processing and memory storage and retrieval. For example, as a rat runs a maze, neurons in the hippocampus, each representing a point in space, fire in sequence. When the rat revisits a path, and pauses to make decisions about how to proceed, those same neurons fire in similar sequences as the rat considers the previous consequences of taking one path versus another. In addition to this binary-like firing of neurons, brain waves, produced by ensembles of neurons, are present in different frequency bands. These act somewhat like clock signals, and the phase relationships of these signals correlate to specific brain signal pathways that provide input to this sub-region of the hippocampus.

The goal of the underlying neuroscience research is to correlate the physical state of the rat with specific characteristics of the signals coming from the neural circuitry in the hippocampus. Those signal differences reflect the origin of signals to the hippocampus. Signals that arise within the hippocampus indicate actions based on memory input, such as reencountering previously encountered situations. Signals that arise outside the hippocampus correspond to other cognitive processing. In this work, we digitally signal process the individual neuronal signal output and turn it into spectral information related to the brain region of origin for the signal input.

If this doesn’t sound like a topic map related problem on your first read, what would you call the “…brain region of origin for the signal input[?]”

That is if you wanted to say something about it. Or wanted to associate information, oh, I don’t know, captured from a signal processing application with it?

Hmmm, that’s what I thought too.

Besides, it is a good opportunity for you to exercise your Hadoop skills. Never a bad thing to work on the unfamiliar.

Running a UIMA Analysis Engine in a Lucene Analyzer Chain

Tuesday, July 31st, 2012

Running a UIMA Analysis Engine in a Lucene Analyzer Chain by Sujit Pal.

From the post:

Last week, I wrote about a UIMA Aggregate Analysis Engine (AE) that annotates keywords in a body of text, optionally inserting synonyms, using a combination of pattern matching and dictionary lookups. The idea is that this analysis will be done on text on its way into a Lucene index. So this week, I describe the Lucene Analyzer chain that I built around the AE I described last week.

A picture is worth a thousand words, so here is one that shows what I am (or will be soon, in much greater detail) talking about.

[Graphic omitted]

As you can imagine, most of the work happens in the UimaAETokenizer. The tokenizer is a buffering (non-streaming) Tokenizer, ie, the entire text is read from the Reader and analyzed by the UIMA AE, then individual tokens returned on successive calls to its incrementToken() method. I decided to use the new (to me) AttributeSource.State object to keep track of the tokenizer’s state between calls to incrementToken() (found out about it by grokking through the Synonym filter example in the LIA2 book).

After (UIMA) analysis, the annotated tokens are marked as Keyword, any transformed values for the annotation are set into the SynonymMap (for use by the synonym filter, next in the chain). Text that is not annotated are split up (by punctuation and whitespace) and returned as plain Lucene Term (or CharTerm since Lucene 3.x) tokens. Here is the code for the Tokenizer class.

The second of two posts from Jack Park.

Part of my continuing interest in indexing. In part because we know that indexing scales. Seriously scales.

UIMA Analysis Engine for Keyword Recognition and Transformation

Tuesday, July 31st, 2012

UIMA Analysis Engine for Keyword Recognition and Transformation by Sujit Pal.

From the post:

You have probably noticed that I’ve been playing with UIMA lately, perhaps a bit aimlessly. One of my goals with UIMA is to create an Analysis Engine (AE) that I can plug into the front of the Lucene analyzer chain for one of my applications. The AE would detect and mark keywords in the input stream so they would be exempt from stemming by downstream Lucene analyzers.

So couple of weeks ago, I picked up the bits and pieces of UIMA code that I had written and started to refactor them to form a sequence of primitive AEs that detected keywords in text using pattern and dictionary recognition. Each primitive AE places new KeywordAnnotation objects into an annotation index.

The primitive AEs I came up with are pretty basic, but offers a surprising amount of bang for the buck. There are just two annotators – the PatternAnnotator and DictionaryAnnotator – that do the processing for my primitive AEs listed below. Obviously, more can be added (and will, eventually) as required.

  • Pattern based keyword recognition
  • Pattern based keyword recognition and transformation
  • Dictionary based keyword recognition, case sensitive
  • Dictionary based keyword recognition and transformation, case sensitive
  • Dictionary based keyword recognition, case insensitive
  • Dictionary based keyword recognition and transformation, case insensitive

The first of two posts that I missed from last year, recently brought to my attention by Jack Park.

The ability to annotate, implying, among other things, the ability to create synonym annotations for keywords.

Data Shaping in Google Refine

Tuesday, July 31st, 2012

Data Shaping in Google Refine by AJ Hirst.

From the post:

One of the things I’ve kept stumbling over in Google Refine is how to use it to reshape a data set, so I had a little play last week and worked out a couple of new (to me) recipes.

The first relates to reshaping data by creating new rows based on columns. For example, suppose we have a data set that has rows relating to Olympics events, and columns relating to Medals, with cell entries detailing the country that won each medal type:

A bit practical but I was in a conversation earlier today about re-shaping a topic map so “practical” things are on my mind.

With the amount of poorly structured data on the web, you will find this useful.

I first saw this at: Dzone.

WDM 2012 : Special Session on Web Data Matching

Tuesday, July 31st, 2012

Call for Papers of the Special Session on Web Data Matching – WDM 2012

When Nov 21, 2012 – Nov 23, 2012
Where São Carlos, Brazil
Submission Deadline Aug 15, 2012
Notification Due Sep 15, 2012
Final Version Due Sep 30, 2012

From the call for papers:

Under the framework of the 8th International Conference on Next Generation Web Services Practices (NWeSP 2012), 21-23 November 2012 in São Carlos, Brazil


In recent years, research in area of web mining and web searching grows rapidly, mainly thanks to growing complexity of digital data and the huge quantity of new data available every day. A Web user wishing to find information on a particular subject must usually guess the keywords under which that information might be classified by a standard search engine. There are also new approaches such as the various methods of the classification of web data based an analysis of unstructured and structured web data and use of human and social factors. WDM workshop focuses mainly (but not only) on methods of analysis of web data leading to their classification and use to improve user orientation at Web.

Specific topics of interest

To address the aforementioned aspects of evolution of social networks, the preferred topics for this special session are (but not limited to):

  • Web pattern recognition and matching
  • Web information extraction
  • Web content mining
  • Web genre detection
  • Deep web analysis
  • Relevance and ranking of web data
  • Web search systems and applications
  • Mapping structured and unstructured web data

I realize it is fashionable to sprinkle “web” or “web scale” in papers and calls for papers but is the object of our study really any different?

Does it matter for authorship, genre, entity extraction, data mining, whether the complete texts of Shakespeare are on your local hard drive or some website?

Or to put it another way, should the default starting point be to consider all the data on the Web?

How would you create a lens or filter to enable a user to start with “relevant” resources for a query?

Cypher Query Language and Neo4j [Webinar]

Tuesday, July 31st, 2012

Cypher Query Language and Neo4j [Webinar]

Thursday August 30 10:00 PDT / 19:00 CEST

From the registration page:

The Neo4j graph database is all about relationships. It allows to model domains of connected data easily. Querying using a imperative API is cumbersome and bloated. So the Neo Technology team decided to develop a query language more suited to query graph data.

Taking inspiration from SQL, SparQL and others and using Scala to implement it turned out to be a good decision. The parser-combinator library, functional composition and lazy evaluation helped us to easily go ahead. Join us to learn the journey of its inception to a being usable tool.

Speaker: Michael Hunger, Community Lead and Head of Spring Integration, Neo Technology.

Take a look at the documentation before the webinar. Look for “Cypher Query Language” in the table of contents.

Political Moneyball

Tuesday, July 31st, 2012

Nathan Yau points out the Wall Street Journal’s “Political Moneyball” visualization in Network of political contributions.

You will probably benefit from starting with Nathan’s comments and then navigating the WSJ visualization.

I like the honesty of the Wall Street Journal. They have chosen a side and yet see humor in its excesses.

Nathan mentions the difficulty with unfamiliar names and organizations.

An example of where topic maps could enable knowledgeable users to gather information together for the benefit of subsequent, less knowledgeable users of the map.

Creating the potential for a collaborative, evolutionary information resource that improves with usage.

Records Labels in cool Neo4j Graph Visualization

Tuesday, July 31st, 2012

Records Labels in cool Neo4j Graph Visualization

From the post:

Corey Farwell presented at the SF Graph Database Meetup in July, where he discussed his app RIAARadar, that lets you search for any album, single or band and see if they are affiliated with the RIAA. While still in alpha stages, the dataset caused an animated discussion. Coincidentally, Farwell’s visualization of his dataset was also shown in another presentation that night, by Mathieu Bastian, the co-founder of Gephi and data scientist at LinkedIn for their InMaps graph visualization tool.

BTW, Corey has taken over the RIAARadar domain and is in the process of rebuilding it.

See: RIAA Radar for ways you can help/contribute.

Vertical Scaling made easy through high-performance actors

Tuesday, July 31st, 2012

Vertical Scaling made easy through high-performance actors

From the webpage:

Vertical scaling is today a major issue when writing server code. Threads and locks are the traditional approach to making full utilization of fat (multi-core) computers, but result is code that is difficult to maintain and which to often does not run much faster than single-threaded code.

Actors make good use of fat computers but tend to be slow as messages are passed between threads. Attempts to optimize actor-based programs results in actors with multiple concerns (loss of modularity) and lots of spaghetti code.

The approach used by JActor is to minimize the messages passed between threads by executing the messages sent to idle actors on the same thread used by the actor which sent the message. Message buffering is used when messages must be sent between threads, and two-way messaging is used for implicit flow control. The result is an approach that is easy to maintain and which, with a bit of care to the architecture, provides extremely high rates of throughput.

On an intel i7, 250 million messages can be passed between actors in the same JVM per second–several orders of magnitude faster than comparable actor frameworks.

Hmmm, 250 million messages a second? On the topic map (TM) scale, that’s what?, about 1/4 TM? 😉

Seriously, if you are writing topic map server software, you need to take a look at JActor.

Big Data Machine Learning: Patterns for Predictive Analytics

Tuesday, July 31st, 2012

Big Data Machine Learning: Patterns for Predictive Analytics by Ricky Ho.

A DZone “refcard” and as you might expect, a bit “slim” to cover predictive analytics. Still, printed in full color it would make a nice handout on predictive analytics for a general audience.

What would you add to make a “refcard” on a particular method?

Or for that matter, what would you include to make a “refcard” on popular government resources? Can you name all the fields on the campaign disclosure files? Thought not.

Ignorance by Stuart Firestein; It’s Not Rocket Science by Ben Miller – review

Tuesday, July 31st, 2012

Ignorance by Stuart Firestein; It’s Not Rocket Science by Ben Miller – review by Adam Rutherford

From the review, speaking of “Ignorance” by Stuart Firestein, Adam writes:

Stuart Firestein, a teacher and neuroscientist, has written a splendid and admirably short book about the pleasure of finding things out using the scientific method. He smartly outlines how science works in reality rather than in stereotype. His MacGuffin – the plot device to explore what science is – is ignorance, on which he runs a course at Columbia University in New York. Although the word “science” is derived from the Latin scire (to know), this misrepresents why it is the foundation and deliverer of civilisation. Science is to not know but have a method to find out. It is a way of knowing.

Firestein is also quick to dispel the popular notion of the scientific method, more often than not portrayed as a singular thing enshrined in stone. The scientific method is more of a utility belt for ignorance. Certainly, falsification and inductive reasoning are cornerstones of converting unknowns to knowns. But much published research is not hypothesis-driven, or even experimental, and yet can generate robust knowledge. We also invent, build, take apart, think and simply observe. It is, Firestein says, akin to looking for a black cat in a darkened room, with no guarantee the moggy is even present. But the structure of ignorance is crucial, and not merely blind feline fumbling.

The size of your questions is important, and will be determined by how much you know. Therein lies a conundrum of teaching science. Questions based on pure ignorance can be answered with knowledge. Scientific research has to be born of informed ignorance, otherwise you are not finding new stuff out. Packed with real examples and deep practical knowledge, Ignorance is a thoughtful introduction to the nature of knowing, and the joy of curiosity.

Not to slight “It’s Not Rocket Science,” but I am much more sympathetic to discussions of the “…structure of ignorance…” and how we model those structures.

If you are interested in such arguments, consider the Oxford Handbook of Skepticism. I don’t have a copy (you can fix that if you like) but it is reported to have good coverage of the subject of ignorance.

SharePoint Module 3.2 Hotfix 4 now available

Monday, July 30th, 2012

SharePoint Module 3.2 Hotfix 4 now available

From the post:

A new hotfix package is available for version 3.2 of the TMCore SharePoint Module.

Systems Affected

This hotfix should be applied to any installation of the TMCore SharePoint Module 3.2 downloaded before 30th July 2012. If you downloaded your copy of the software from our site on or after this date, the hotfix is included in the package and you do not need to apply it again.

To determine if your system is affected, check the File Version property of the assembly NetworkedPlanet.SharePoint in the GAC (browse to C:\Windows\ASSEMBLY, locate the NetworkedPlanet.SharePoint assembly, right-click and choose Properties. The File Version can be found on the Version tab above Description and Copyright). This hotfix updates the File Version of the NetworkedPlanet.SharePoint assembly to – if the file version shown is greater than or equal to, then you do not need to apply this hotfix.

I assume of interest mostly to Windows installations.

I don’t know of anyone running MS SharePoint on a Linux-based VM. Do you?

Chaos Monkey released into the wild

Monday, July 30th, 2012

Chaos Monkey released into the wild by Cory Bennett and Ariel Tseitlin

From the post:

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.

We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.

Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.

What is Chaos Monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don’t, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

At first I was unsure if NetFlix is hopeful its competitors will run Chaos Monkey or if they really run it internally. 😉

It certainly is a way to test your infrastructure. And quite possibly a selling point to clients who want more than projected or historical robustness.

Makes me curious, allowing for different infrastructures, how would you stress test a topic map installation?

And do so on a regular basis?

I first saw this at Alex Popescu’s myNoSQL.

Map of connections in the human brain

Monday, July 30th, 2012

Map of connections in the human brain by Nathan Yau.

From the post:

Using a new kind of MRI scanner, scientists at the National Institutes of Health mapped the connections in the human brain, revealing an intricate, grid-like structure.

Nathan creates and points to some of the finest graphics on the Net. This is just an example.

Before you run off to create a grid structure in hardware/software, remember the messaging protocols we can’t name, much less model in the brain.

This is a good step, but a small one.

Big Data: The Good, the Bad and the Ugly

Monday, July 30th, 2012

Not the exact title of the Pew Internet Project‘s latest survey on “big data.”

The project homepage reports: The Future of Big Data by Janna Anderson and Lee Rainie.

You won’t discover that title reading the report.

As a matter of fact, the phrase, “future of big data” occurs only twice and never in a title.

The formalities of report writing to one side, the report has quotes for whatever side you can to take on the future of “big data” debate. Forward to your PR department.

I get a sense from the report there will be winners and losers in a future that includes “big data.”

But no more so than the integrated circuit brought to an end the short lived TV repair industry based on vacuum tubes.

I suggest planning on being on the side of, if not one of, the winners in the coming changes wrought by “big data.”

In that regard, the techniques and technologies are changing too rapidly to make long term bets.

Be flexible and remember that technologies are only as good as they are useful (or thought to be useful by users or governments).

Find religion outside of technology. You will stay on the cutting edge longer.

U.S. Census Bureau Offers Public API for Data Apps

Monday, July 30th, 2012

U.S. Census Bureau Offers Public API for Data Apps by Nick Kolakowski.

From the post:

For any software developers with an urge to play around with demographic or socio-economic data: the U.S. Census Bureau has launched an API for Web and mobile apps that can slice that statistical information in all sorts of nifty ways.

The API draws data from two sets: the 2010 Census (statistics include population, age, sex, and race) and the 2006-2010 American Community Survey (offers information on education, income, occupation, commuting, and more). In theory, developers could use those datasets to analyze housing prices for a particular neighborhood, or gain insights into a city’s employment cycles.

The APIs include no information that could identify an individual. (emphasis added)

Suppose it should say: “Some assembly required.”

Similar resources at and Google Public Data Explorer.

I first saw this at: Dashboard Insight.

HBase Replication Overview

Monday, July 30th, 2012

HBase Replication Overview by Himanshu Vashishtha.

From the post:

HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously, and these transactions are batched in a configurable size (default is 64MB). Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.

This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.

Use cases

HBase replication supports replicating data across datacenters. This can be used for disaster recovery scenarios, where we can have the slave cluster serve real time traffic in case the master site is down. Since HBase replication is not intended for automatic failover, the act of switching from the master to the slave cluster in order to start serving traffic is done by the user. Afterwards, once the master cluster is up again, one can do a CopyTable job to copy the deltas to the master cluster (by providing the start/stop timestamps) as described in the CopyTable blogpost.

Another replication use case is when a user wants to run load intensive MapReduce jobs on their HBase cluster; one can do so on the slave cluster while bearing a slight performance decrease on the master cluster.

So there is a non-romantic, sysadmin side to “big data.” I understand, no one ever even speaks unless something has gone wrong with the system. Sysadmins either get no contacts (a good thing) or pages, tweets, emails, phone calls and physical visits from irate users, managers, etc.

This post is a start towards always having the first case, no contacts. Leaves you more time for things that interest sysadmins. I won’t tell if you don’t.

Search Solutions 2012: your opportunity to shape this year’s event

Monday, July 30th, 2012

Search Solutions 2012: your opportunity to shape this year’s event by Tony Russell-Rose.

From the post:

We’re just in the process of drafting the programme for Search Solutions 2012, to be held on November 28-29 at BCS London. As in previous years, we aim to offer a topical selection of presentations, panels and keynote talks by influential industry leaders on novel and emerging applications in search and information retrieval, whilst maintaining the collegiate spirit of a community event. If you’ve never been before, take a look at last year’s programme.

We don’t normally issue a formal “call for papers” as such, but if you’d like to get involved (perhaps as a panellist or speaker) and have an interesting case study or demo to present, then drop me a line. In the meantime, save the date: tutorials day on November 28, main event on November 29.

The 2011 programme page has presentation downloads if you are having trouble making up your mind. 😉

BioExtract Server

Monday, July 30th, 2012

BioExtract Server: data access, analysis, storage and workflow creation

From “About us:”

BioExtract harnesses the power of online informatics tools for creating and customizing workflows. Users can query online sequence data, analyze it using an array of informatics tools (web service and desktop), create and share custom workflows for repeated analysis, and save the resulting data and workflows in standardized reports. This work was initially supported by NSF grant 0090732. Current work is being supported by NSF DBI-0606909.

A great tool for sequence data researchers and a good example of what is possible with other structured data sets.

Much has been made (and rightly so) of the need for and difficulties of processing unstructured data.

But we should not ignore the structured data dumps being released by governments and other groups around the world.

And we should recognize that hosted workflows and processing can make insights into data a matter of skill, rather than ownership of enough hardware.

Neo4j and Bioinformatics [Webinar]

Monday, July 30th, 2012

Neo4j and Bioinformatics [Webinar]

Thursday August 9 10:00 PDT / 19:00 CEST

From the webpage:

The world of data is changing. Big Data and NOSQL are bringing new ways of understanding your data.

This opens a whole new world of possibilities for a wide range of fields, and bioinformatics is no exception. This paradigm provides bioinformaticians with a powerful and intuitive framework, to deal with biological data that is naturally interconnected.

Pablo Pareja will give an overview of Bio4j project, and then move to some of its recent applications.

  • BG7: a new system for bacterial genome annotation designed for NGS data
  • MG7: metagenomics + taxonomy integration
  • Evolutionary studies, transcriptional networks, network analysis..
  • Future directions

Speaker: Pablo Pareja, Project Leader of Bio4j

If you are thinking about “scale,” consider the current stats on Bio4j:

The current version of Bio4j includes:

Relationships: 530.642.683

Nodes: 76.071.411

Relationship types: 139

Node types: 38

With room to spare!

Video about a Problem of Inductive Arguments

Sunday, July 29th, 2012

Video about a Problem of Inductive Arguments from Dr. Adam Wyner.

From the post:

A nice cartoon illustration of the problem with inductive arguments in a social context. A video on youtube, so there is an ad popup. Best watched as a loop to appreciate the full point:

Makes me wish I knew how to do animation.

Will make you re-consider the use of induction in your topic map!


Sunday, July 29th, 2012


From the webpage:

The purpose of this web site is to act as a community knowledge base for performing astronomy research with Python. It provides lists of useful resources, a forum for general discussion, advice, or relevant news items, collecting users’ code snippets or scripts, and longer tutorials on specific topics. The topics within these pages are presented in a list view with the ability to sort by date or topic. A traditional “blog” view of the most recently posted topics is visible from the site Home page.

Along with the other astronomy applications I have mentioned this weekend I thought you might find this useful.

Skills with Python, data processing and subject identification/mapping skills transfer across disciplines.

OSCON 2012

Sunday, July 29th, 2012

OSCON 2012

Over 4,000 photographs were taken at the MS booth.

I wonder how many of them include Doug?

Drop by the OSCON website after you count photos of Doug.

Your efforts at topic mapping will improve from the experience.

From the OSCON site visit.

What you get from counting photos of Doug is unknown. 😉

SAOImage DS9

Sunday, July 29th, 2012

SAOImage DS9

From the webpage:

SAOImage DS9 is an astronomical imaging and data visualization application. DS9 supports FITS images and binary tables, multiple frame buffers, region manipulation, and many scale algorithms and colormaps. It provides for easy communication with external analysis tasks and is highly configurable and extensible via XPA and SAMP.

DS9 is a stand-alone application. It requires no installation or support files. All versions and platforms support a consistent set of GUI and functional capabilities.

DS9 supports advanced features such as 2-D, 3-D and RGB frame buffers, mosaic images, tiling, blinking, geometric markers, colormap manipulation, scaling, arbitrary zoom, cropping, rotation, pan, and a variety of coordinate systems.

The GUI for DS9 is user configurable. GUI elements such as the coordinate display, panner, magnifier, horizontal and vertical graphs, button bar, and color bar can be configured via menus or the command line.

New in Version 7

3-D Data Visualization

Previous versions of SAOImage DS9 would allow users to load 3-D data into traditional 2-D frames, and would allow users to step through successive z-dimension pixel slices of the data cube. To visualize 3-D data in DS9 v. 7.0, a new module, encompassed by the new Frame 3D option, allows users to load and view data cubes in multiple dimensions.

The new module implements a simple ray-trace algorithm. For each pixel on the screen, a ray is projected back into the view volume, based on the current viewing parameters, returning a data value if the ray intersects the FITS data cube. To determine the value returned, there are 2 methods available, Maximum Intensity Projection (MIP) and Average Intensity Projection (AIP). MIP returns the maximum value encountered, AIP returns an average of all values encountered. At this point, normal DS9 operations are applied, such as scaling, clipping and applying a color map.

Color Tags

The purpose of color tags are to highlight (or hide) certain values of data, regardless of the color map selected. The user creates, edits, and deletes color tags via the GUI. From the color parameters dialog, the user can load, save, and delete all color tags for that frame.


DS9 now supports cropping the current image, via the GUI, command line, or XPA/SAMP in both 2-D and 3-D. The user may specify a rectangular region of the image data as a center and width/height in any coordinate system via the Crop Dialog, or can interactively select the region of the image to display by clicking and dragging while in Crop Mode.

I encountered SAOImage DS9 in the links section of an astroinformatics blog.

Good example of very high end image/data cube exploration/processing application.

You are likely to encounter a number of subjects worthy of comment using this application.

Exploring the rationality of some syntactic merging operators (extended version)

Sunday, July 29th, 2012

Exploring the rationality of some syntactic merging operators (extended version) by José Luis Chacón and Ramón Pino Pérez


Most merging operators are defined by semantics methods which have very high computational complexity. In order to have operators with a lower computational complexity, some merging operators defined in a syntactical way have be proposed. In this work we define some syntactical merging operators and exploring its rationality properties. To do that we constrain the belief bases to be sets of formulas very close to logic programs and the underlying logic is defined through forward chaining rule (Modus Ponens). We propose two types of operators: arbitration operators when the inputs are only two bases and fusion with integrity constraints operators. We introduce a set of postulates inspired of postulates LS, proposed by Liberatore and Shaerf and then we analyzed the first class of operators through these postulates. We also introduce a set of postulates inspired of postulates KP, proposed by Konieczny and Pino P\’erez and then we analyzed the second class of operators through these postulates.

Another paper on logic based merging.

I created a separate tag, “merging operators,” to distinguish this from the merging we experience with TMDM based topic maps.

The merging here refers to merging of beliefs to form a coherent view of the world.

A topic map, not subject to other constraints, can “merge” data about a subject that leads to different inferences or is even factually contradictory.

Even if logical consistency post-merging isn’t your requirement, this is a profitable paper to read.

I will see what other resources I can find on logic based merging.

Elaborating Intersection and Union Types

Sunday, July 29th, 2012

Elaborating Intersection and Union Types by Joshua Dunfield.


Designing and implementing typed programming languages is hard. Every new type system feature requires extending the metatheory and implementation, which are often complicated and fragile. To ease this process, we would like to provide general mechanisms that subsume many different features.

In modern type systems, parametric polymorphism is fundamental, but intersection polymorphism has gained little traction in programming languages. Most practical intersection type systems have supported only refinement intersections, which increase the expressiveness of types (more precise properties can be checked) without altering the expressiveness of terms; refinement intersections can simply be erased during compilation. In contrast, unrestricted intersections increase the expressiveness of terms, and can be used to encode diverse language features, promising an economy of both theory and implementation.

We describe a foundation for compiling unrestricted intersection and union types: an elaboration type system that generates ordinary lambda-calculus terms. The key feature is a Forsythe-like merge construct. With this construct, not all reductions of the source program preserve types; however, we prove that ordinary call-by-value evaluation of the elaborated program corresponds to a type-preserving evaluation of the source program.

We also describe a prototype implementation and applications of unrestricted intersections and unions: records, operator overloading, and simulating dynamic typing.

Definitely a paper to read if you are interested in merging issues.

I will be mining its citations to provide pointers to more of the literature in this area.

A semantically diverse (from topic maps) effort to address semantic diversity.

Not ironic, but encouraging.

Knowing that around the next paper, conference, footnote or conversation, new semantic riches await.

Building a new Lucene postings format

Sunday, July 29th, 2012

Building a new Lucene postings format by Mike McCandless.

From the post:

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene’s index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

Current testing of formats requires the entire format be specified, which means errors are hard to diagnose.

Mike addresses that problem by creating a layered testing mechanism.

Great stuff!

PS: I think it will also be useful as an educational tool. Changing defined formats and testing as changes are made.

Open Services for Lifecycle Collaboration (OSLC)

Sunday, July 29th, 2012

Open Services for Lifecycle Collaboration (OSLC)

This is one of the efforts mentioned in: Linked Data: Esperanto for APIs?.

From the about page:

Open Services for Lifecycle Collaboration (OSLC) is a community of software developers and organizations that is working to standardize the way that software lifecycle tools can share data (for example, requirements, defects, test cases, plans, or code) with one another.

We want to make integrating lifecycle tools a practical reality. (emphasis in original)

That’s a far cry from:

At the very least, however, a generally accepted approach to linking data within applications that make the whole programmable Web concept more accessible to developers of almost every skill level should not be all that far off from here.

It has an ambitious but well-defined scope, which will lend itself to the development and testing of standards for the interchange of information.

Despite semantic diversity, those are tasks that can be identified and that would benefit from standardization.

There is measurable ROI for participants who use the standard in a software lifecycle. They are giving up semantic diversity in exchange for other tangible benefits.

An effort to watch as a possible basis for integrating older software lifecycle tools.

Linked Data: Esperanto for APIs?

Sunday, July 29th, 2012

Michael Vizard writes in: Linked Data to Take Programmable Web to a Higher Level:

The whole concept of a programmable Web may just be too important to rely solely on APIs. That’s the thinking behind a Linked Data Working Group initiative led by the W3C that expects to create a standard for embedding URLs directly within application code to more naturally integrate applications. Backed by vendors such as IBM and EMC, the core idea is to create more reliable method for integrating applications that more easily scales by not creating unnecessary dependencies of APIs and middleware.

At the moment most of the hopes for a truly programmable Web are tied to an API model that is inherently flawed. That doesn’t necessarily mean that Linked Data approaches will eliminate the need for APIs. But in terms of making the Web a programmable resource, Linked Data represents a significant advance in terms of both simplifying the process of actually integrating data while simultaneously reducing dependencies on cumbersome middleware technologies that are expensive to deploy and manage.

Conceptually, linked data is obvious idea. But getting everybody to agree on an actual standard is another matter. At the very least, however, a generally accepted approach to linking data within applications that make the whole programmable Web concept more accessible to developers of almost every skill level should not be all that far off from here. (emphasis added)

I am often critical of Linked Data efforts so let’s be clear:

Linked Data, as a semantic identification method, has strengths and weaknesses, just like any other semantic identification method. If it works for your particular application, great!

One of my objections to Linked Data is its near religious promotion as a remedy for semantic diversity. I don’t think a remedy for semantic diversity is possible, nor is is desirable.

The semantic diversity in IT is like the genetic diversity in the plant and animal kingdoms. It is responsible for robustness and innovation.

Not the fault of Linked Data but it is often paired with explanations for the failure of the Semantic Web to thrive.

The first Scientific American “puff piece” on the semantic was more than a decade ago now. We suddenly learn that it hasn’t been a failure of user interest, adoption, etc., that have defeated the Semantic Web, but a flawed web API model. Cure that and semantic nirvana is just around the corner.

The Semantic Web has failed to thrive because the forces of semantic diversity are more powerful than any effort at semantic sameness.

The history of natural languages and near daily appearance of new programming languages, to say nothing of the changing semantics of both, are evidence for “forces of semantic diversity.”

To paraphrase Johnny Cash, “do we kick against the pricks (semantic diversity)” or build systems that take it into account?

Neo4jPHP Available as a Composer Package

Sunday, July 29th, 2012

Neo4jPHP Available as a Composer Package

Announcement of and brief instructions on Neo4jPHP as a Composer package.

Pass along to your friends but to libraries in particular. PHP and Neo4j make a particularly good combination for complex but not “web scale” information environments.