Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 27, 2012

SunPy [Choosing Specific Subject Identity Issues]

Filed under: Astroinformatics,Subject Identity,Topic Maps — Patrick Durusau @ 10:57 am

SunPy: A Community Python Library for Solar Physics

From the homepage:

The SunPy project is an effort to create an open-source software library for solar physics using the Python programming language.

As you have seen in your own experience or read about in my other posting on astronomical data, like elsewhere, subject identity issues abound.

This is another area that may spark someone’s interest in using topic maps to mitigate against specific subject identity issues.

“Specific subject identity issues” because the act of mitigation always creates more subjects which could be the sources of subject identity issues. It’s not a problem so long as you choose the issues most important to you.

If and when those other potential subject identity issues become relevant, they can be addressed later. The logic approach pretends such issues don’t exist at all. I prefer the former. It’s less fragile.

Neat jQuery Plugins…Customize the Search Box [Assisted TM Authoring?]

Filed under: JQuery,Searching — Patrick Durusau @ 10:41 am

Neat jQuery Plugins That Will Help You Customize The Search Box

From the post:

A search box is an important element of a site and for some websites it is the most used one. It is often neglected even though it is really useful for a visitor, but when it isn’t neglected, it is customized according to the site’s needs with visual elements and really good coding.

They can be simple with just design customizations, but with the help of jQuery plugins they can be quite complex and offer a lot of extra functionality that makes searching easier. Delivering to your visitor what he is searching is really important and you should make this function easier for them.

Useful for customizing a search box but shouldn’t autocomplete be useful for form based authoring as well?

Say you know a “name” but don’t remember the subject identifier for that subject. You type in the name and one or more subject identifiers are presented as a pick list for subject you may have in mind. Choosing a subject identifier displays what is known about that subject in map to allow you to self-proof your choice of subject identifier.

I am sure there are other uses for autocomplete in topic map authoring but that is the one that comes to mind.

Others?

Sky Survey Data Lacks Standardization [Heterogeneous Big Data]

Filed under: Astroinformatics,BigData,Heterogeneous Data,Information Retrieval — Patrick Durusau @ 5:51 am

Sky Survey Data Lacks Standardization by Ian Armas Foster.

From the post:

The Sloan Digital Sky Survey is at the forefront of astronomical research, compiling data from observatories around the world in an effort to truly pinpoint where we lie on the universal map. In order to do that, they must aggregate data from several observatories across the world, an intensive data operation.

According to a report written by researchers at UCLA, even though the SDSS is a data intensive astronomical mapping survey, it has yet to lay down a standardized foundation for retrieving and storing scientific data.

Per sdss.org, the first two projects were responsible for observing “a quarter of the sky” and picking out nearly a million galaxies and over 100,000 quasars. The project started at the Apache Point observatory in New Mexico and has since grown to include 25 observatories across the globe. The SDSS gained recognition in2009 with the Nobel Prize in physics awarded to the advancement of optical fibers and digital imaging detectors (or CCDs) that allowed the project to grow in scale.

The point is that the datasets that the scientists used seemed to be scattered. Some would come about through informal social contacts such as email while others would simply search for necessary datasets on Google. Further, once these datasets were found, there was even an inconsistency in how they were stored before they could be used. However, this may have had to do with the varying sizes of the sets and how quickly the researchers wished to use the data. The entire SDSS dataset consists of over 130 TB, according to the report, and that volume can be slightly unwieldy.

“Large sky surveys, including the SDSS, have significantly shaped research practices in the field of astronomy,” the report concluded. “However, these large data sources have not served to homogenize information retrieval in the field. There is no single, standardized method for discovering, locating, retrieving, and storing astronomy data.”

So, big data isn’t going to be homogeneous big data but heterogeneous big data.

That sounds like an opportunity for topic maps to me.

You?

Adobe CQ5 – OpenCalais Integration [Drupal too!]

Filed under: Content Management System (CMS),Drupal,OpenCalais — Patrick Durusau @ 5:35 am

Adobe CQ5 – OpenCalais Integration by Mateusz Kula.

From the post:

In the massive amount of information available on the Internet it is getting more and more difficult to find relevant and valuable content and categorize it in one way or another. No doubt tagging this overwhelming amount of data is becoming more and more crucial from the SEO and digital marketing point of view as it plays important role in site positioning and allows end users a keyword search. Problems appear when editors are not scrupulous enough to add tags for new pages, press releases, blogs and tweets and to update them when content significantly changes. The worst case scenario is when there is a CMS filled with a whole bunch of untagged content. Then it may take too much time and resources to catch up with tagging. OpenCalais turns out to be a great solutions to such problems and what is more it allows for auto-tagging and can be easily integrated with other services.

An interesting take on integrating OpenCalais with Adobe’s enterprise content management system, CQ5.

Suspect there are topic map authoring lessons here as well.

Rather than seeing topic map editing as always a separate activity, integrating it into content management workflow, automated to the degree possible, could be a move in the right direction.

BTW, there is an OpenCalais module for Drupal, in case you are interested.

Linking your resources to the Data Web

Filed under: AGROVOC,Linked Data,RDF — Patrick Durusau @ 4:56 am

First LOD@AIMS Webinar with Tom Baker on “Linking your resources to the Data Web”

4th December 2012 – 16:00 Rome Time

From the post:

The AIMS Metadata Community of Practice is glad to announce the first Linked Open Data @ AIMS webinar entitled Linking your resources to the Data Web. The session will take place on 4th December 2012 – 16:00 Rome Time – and will be presented by Tom Baker, chief information officer (CIO) of the Dublin Core Metadata Initiative (DCMI).

This event is part of the series of webinars Linked Open Data @ AIMS that will take place from December 2012 to February 2013. A total of 6 specialists will talk about Linked Open Data and the Semantic Web to the agricultural information management community. The webinars will be in the 6 languages used on AIMS – English, French, Spanish, Arabic, Chinese and Russian.

The objective of Linked Open Data @ AIMS webinars is to help individuals and organizations to understand better the initiatives related to the Semantic Web that are currently taking place within the AIMS Communities of Practice.


Linking data into the Semantic Web means more than just making data available on a Web server. It means using Web addresses (URIs) in data as names for things; tagging resources using those URIs – for example, URIs for agricultural topics from AGROVOC; and using URIs to point to related resources.

This talk walks through a simple example to show how linking works in practice, illustrating RDF technology with animated graphics. It concludes with a recipe for linking your data: Decide what bits of your data are most important, such as Subject, Author, and Publisher. Use URIs in your data, whenever possible, such as Subject terms from AGROVOC. Then publish your data in RDF on the Web where others can link to it. Simple solutions can be enough to yield good results.

Tom Baker of the Dublin Core Metadata Initiative will be an excellent speaker but when I saw:

Tom Baker on “Linking your resources to the Data Web”

my first thoughts were of another Tom Baker and wondering how he had gotten involved with Linked Data. 😉

In the body of the announcement, a URL identifies the “Tom Baker” in the text as another “Tom Baker” than the one I was thinking about.

Interesting. It didn’t take Linked Data or RDF to make the distinction, only the <a> element plus an href attribute. Something to think about.

November 26, 2012

More on #sandy social interactions [100K Tweets/GraphInsight]

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 7:32 pm

More on #sandy social interactions

From the post:

We collected #sandy tweets for a few hours on Tuesday. Dots are Twitter users and connections are retweets. This network connects more than 100,000 users. There are many small disconnected components. The main cluster contains interesting patterns to explore..

Another highly visual post. You need to see the images to get a sense of the exploration of the data.

BigData using Erlang, C and Lisp to Fight the Tsunami of Mobile Data

Filed under: BigData,Erlang,Lisp — Patrick Durusau @ 7:23 pm

BigData using Erlang, C and Lisp to Fight the Tsunami of Mobile Data by Jon Vlachogiannis.

From the post:

BugSense, is an error-reporting and quality metrics service that tracks thousand of apps every day. When mobile apps crash, BugSense helps developers pinpoint and fix the problem. The startup delivers first-class service to its customers, which include VMWare, Samsung, Skype and thousands of independent app developers. Tracking more than 200M devices requires fast, fault tolerant and cheap infrastructure.

The last six months, we’ve decided to use our BigData infrastructure, to provide the users with metrics about their apps performance and stability and let them know how the errors affect their user base and revenues.

We knew that our solution should be scalable from day one, because more than 4% of the smartphones out there, will start DDOSing us with data.

A number of lessons to consider if you want a system that scales.

Danger! Danger! Will Robinson! Snow Globe!

Filed under: Security,Topic Maps — Patrick Durusau @ 6:46 pm

I can’t decide which of these are the most annoying:

  • Inane carry-on rules for air travel that ban snow globes larger than a tennis ball.
  • TSA staff creating inane rules to give the appearance of activity.
  • A group as irrelevant to law enforcement or prevention of terrorism as the TSA exists at all.

During the summer, when I guess no one was looking, the TSA issued rules on snow globes in carry on luggage. TSA notes some new carry-on rules for the holidays by Mark Rockwell.

Fondling children is bad enough but now children will be forced to abandon their snow globes at security as well. That will be an experience to remember, being molested and robbed at the same airport.

I keep trying to think of ways that topic maps could be useful in public policy debates (and non-debates like airport security).

Here are my suggestions (please add yours):

  • Airports, dates, names, and news coverage of TSA excesses.
  • Known TSA policies and procedures, like the “behaviour watchers” who have yet to identify a single terrorist. Their watching also keeps pink elephants off planes so it may not be a complete waste.
  • Possible attack vectors on airports and/or aircraft. The focus of the TSA on explosives is almost quaint. That is when they aren’t worried about death rays and other imaginary devices.
  • Identifying TSA employees. It really isn’t the “TSA” that is fondling your child, its Mr./Mrs./Ms. *** of *** airport. Let put some accountability in place and not let them hide behind the TSA. For that matter, take photos of them as they arrive and leave from the airport. Public areas only.

I am sure I am missing a number of other opportunities to convince policy makers to free us from the charade called the TSA.

What do you suggest?

Shark (Hive on Spark)

Filed under: Shark,Spark — Patrick Durusau @ 4:57 pm

Shark (Hive on Spark)

From the webpage:

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 70 times faster than Hive without modification to the existing data nor queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.

We released Shark 0.2 on Oct 15, 2012. The new version is much more stable and also features significant performance improvements.

Getting Started

See our documentation on Github to get started. It takes around 5 mins to set up Shark on a single node for a quick spin, and about 20 mins on an Amazon EC2 cluster.

Fast Execution Engine

Shark is built on top of Spark, a data-parallel execution engine that is fast and fault-tolerant. Even if data are on disk, Shark can be noticeably faster than Hive because of the fast execution engine. It avoids the high task launching overhead of Hadoop MapReduce and does not require materializing intermediate data between stages on disk. Thanks to this fast engine, Shark can answer queries in sub-second latency.

They say that imitation is the sincerest form of flattery.

In software, do claims of compatibility with your software mean the same thing?

It isn’t possible to know which database solutions will be around in five years but the rapid emergence of alternative solutions certainly is exciting!

UILLD 2013 — User interaction built on library linked data

Filed under: Interface Research/Design,Library,Linked Data,Usability,Users — Patrick Durusau @ 4:48 pm

UILLD 2013: Workshop on User interaction built on library linked data (UILLD) Pre-conference to the 79th World Library and Information Conference, Jurong Regional Library, Singapore.

Important Dates:

Paper submission deadline: February 28, 2013
Acceptance notification: May 15, 2013
Camera-ready versions of accepted papers: June 30, 2013
Workshop date: August 16, 2013

From the webpage:

The quantity of Linked Data published by libraries is increasing dramatically: Following the lead of the National Library of Sweden (2008), several libraries and library networks have begun to publish authority files and bibliographic information as linked (open) data. However, applications that consume this data are not yet widespread. Particularly, there is a lack of methods for integration of Linked Data from multiple sources and its presentation in appropriate end user interfaces. Existing services tend to build on one or two well integrated datasets – often from the same data supplier – and do not actively use the links provided to other datasets within or outside of the library or cultural heritage sector to provide a better user experience.

CALL FOR PAPERS

The main objective of this workshop/pre-conference is to provide a platform for discussion of deployed services, concepts, and approaches for consuming Linked Data from libraries and other cultural heritage institutions. Special attention will be given to papers presenting working end user interfaces using Linked Data from both cultural heritage institutions (including libraries) and other datasets.

For further information about the workshop, please contact the workshops chairs at uilld2013@gmail.com

In connection with this workshop, see also: IFLA World Library and Information Congress 79th IFLA General Conference and Assembly.

I first saw this in a tweet by Ivan Herman.

Bibliographic Framework as a Web of Data:…

Filed under: BIBFRAME,Library,Linked Data — Patrick Durusau @ 9:53 am

Bibliographic Framework as a Web of Data: Linked Data Model and Supporting Services (PDF)

From the introduction:

The new, proposed model is simply called BIBFRAME, short for Bibliographic Framework. The new model is more than a mere replacement for the library community’s current model/format, MARC. It is the foundation for the future of bibliographic description that happens on, in, and as part of the web and the networked world we live in. It is designed to integrate with and engage in the wider information community while also serving the very specific needs of its maintenance community – libraries and similar memory organizations. It will realize these objectives in several ways:

  1. Differentiate clearly between conceptual content and its physical manifestation(s) (e.g., works and instances)
  2. Focus on unambiguously identifying information entities (e.g., authorities)
  3. Leverage and expose relationships between and among entities

In a web-scale world, it is imperative to be able to cite library data in a way that not only differentiates the conceptual work (a title and author) from the physical details about that work’s manifestation (page numbers, whether it has illustrations) but also clearly identifies entities involved in the creation of a resource (authors, publishers) and the concepts (subjects) associated with a resource. Standard library description practices, at least until now, have focused on creating catalog records that are independently understandable, by aggregating information about the conceptual work and its physical carrier and by relying heavily on the use of lexical strings for identifiers, such as the name of an author. The proposed BIBFRAME model encourages the creation of clearly identified entities and the use of machine-friendly identifiers which lend themselves to machine interpretation for those entities.

An important draft from the Library of Congress on the BIBFRAME proposal.

Please review and comment. (Plus forward to your library friends.)

I first saw this in a tweet by Ivan Herman.

Collaborative biocuration… [Pre-Topic Map Tasks]

Filed under: Authoring Topic Maps,Bioinformatics,Biomedical,Curation,Genomics,Searching — Patrick Durusau @ 9:22 am

Collaborative biocuration—text-mining development task for document prioritization for curation by Thomas C. Wiegers, Allan Peter Davis and Carolyn J. Mattingly. (Database (2012) 2012 : bas037 doi: 10.1093/database/bas037)

Abstract:

The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems for the biological domain. The ‘BioCreative Workshop 2012’ subcommittee identified three areas, or tracks, that comprised independent, but complementary aspects of data curation in which they sought community input: literature triage (Track I); curation workflow (Track II) and text mining/natural language processing (NLP) systems (Track III). Track I participants were invited to develop tools or systems that would effectively triage and prioritize articles for curation and present results in a prototype web interface. Training and test datasets were derived from the Comparative Toxicogenomics Database (CTD; http://ctdbase.org) and consisted of manuscripts from which chemical–gene–disease data were manually curated. A total of seven groups participated in Track I. For the triage component, the effectiveness of participant systems was measured by aggregate gene, disease and chemical ‘named-entity recognition’ (NER) across articles; the effectiveness of ‘information retrieval’ (IR) was also measured based on ‘mean average precision’ (MAP). Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%. Each participating group also developed a prototype web interface; these interfaces were evaluated based on functionality and ease-of-use by CTD’s biocuration project manager. In this article, we present a detailed description of the challenge and a summary of the results.

The results:

“Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%.”

indicate there is plenty of room for improvement. Perhaps even commercially viable improvement.

In hindsight, not talking about how to make a topic map along with ISO 13250, may have been a mistake. Even admitting there are multiple ways to get there, a technical report outlining one or two ways would have made the process more transparent.

Answering the question: “What can you say with a topic map?” with “Anything you want.” was, a truthful answer but not a helpful one.

I should try to crib something from one of those “how to write a research paper” guides. I haven’t looked at one in years but the process is remarkably similar to what would result in a topic map.

Some of the mechanics are different but the underlying intellectual process is quite similar. Everyone who has been to college (at least of my age), had a course that talked about writing research papers. So it should be familiar terminology.

Thoughts/suggestions?

Analyzing the Twitter Conversation and Interest Graphs

Filed under: BigData,Graphs,Tweets,Visualization — Patrick Durusau @ 5:42 am

Analyzing the Twitter Conversation and Interest Graphs by Marti Hearst.

From the post:

For assignment 3, students analyzed and compared a portion of the Twitter “conversation graph” and the “interest graph”. Conversations were found by looking for Twitter “@mentions” and interest graph by looking at the friend/follow graphs for a user (finding friends of friends, taking a k-core analysis, and closing the triangles). The attached document highlights many of the students’ work.

One of the most impressive graphs was made by Achal Soni. He used Java and the Twitter4J library to obtain 3000 tweets for 4 rappers (Drake, Kendrick Lamar, J Cole, and Big Sean). He extracted @mentions from these tweets, and created a graph recording edges were between the celebrities and who they were conversing with.

A clever choice of colors makes this network representation work very well.

Climate Data Guide:…

Filed under: Climate Data,Data — Patrick Durusau @ 5:35 am

Climate Data Guide: Climate data strengths, limitations and applications

From the homepage:

Like an insider’s guidebook to an unexplored country, the Climate Data Guide provides the key insights needed to select the data that best align with your goals, including critiques of data sets by experts from the research community. We invite you to learn from their insights and share your own.

There are one hundred and eleven data sets as of today on this site. Some satellite based sets, other from other sources.

Another resource that you may want to map together with other resources.

Produced by the National Center for Atmospheric Research.

Public FLUXNET Dataset Information

Filed under: Climate Data,Data — Patrick Durusau @ 5:33 am

Public FLUXNET Dataset Information

From the webpage:

Flux and meteorological data, collected world‐wide, are submitted to this central database (www.fluxdata.org). These data are: a) checked for quality; b) gaps are filled; c) valueadded products, like ecosystem photosynthesis and respiration, are produced; and d) daily and annual sums, or averages, are computed [Agarwal et al., 2010]. The resulting datasets are available through this site for data synthesis. This page provides information about the FLUXNET synthesis datasets, the sites that contributed data, how to use the datasets, and the synthesis efforts using the datasets.

I encountered this while searching for more information on biological flux data and thought I should pass it along.

If you are interested in climate data, definitely a stop you want to make!

November 25, 2012

FluxMap: visual exploration of flux distributions in biological networks [Import/Graphs]

Filed under: Bioinformatics,Biomedical,Graphs,Networks,Visualization — Patrick Durusau @ 2:29 pm

FluxMap: visual exploration of flux distributions in biological networks.

From the webpage:

FluxMap is an easy to use tool for the advanced visualisation of simulated or measured flux data in biological networks. Flux data import is achieved via a structured template basing on intuitive reaction equations. Flux data is mapped onto any network and visualised using edge thickness. Various visualisation options and interaction possibilities enable comparison and visual analysis of complex experimental setups in an interactive way.

Manuals and tutorials here.

Another easy to create graphs from data application. This one importing spreadsheet based data.

Wonder why some highly touted commercial graph databases don’t offer the same ease of use?

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

Filed under: Bioinformatics,Biomedical,Graphs,Mapping,Merging,Visualization — Patrick Durusau @ 2:05 pm

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

From the webpage:

HIVE is an Add-on for the VANTED system. VANTED is a graph editor extended for the visualisation and analysis of biological experimental data in context of pathways/networks. HIVE stands for

Handy Integration and Visualisation of multimodal Experimental Data

and extends the functionality of VANTED by adding the handling of volumes and images, together with a workspace approach, allowing one to integrate data of different biological data domains.

You need to see the demo video to appreciate this application!

It offers import of data, mapping rules to merge data from different data sets, easy visualization as a graph and other features.

Did I mention it also has 3-D image techniques as well?

PS: Yes, it is another example of “Who moved my acronym?”

A first failed attempt at Natural Language Processing

Filed under: Natural Language Processing,Requirements — Patrick Durusau @ 1:40 pm

A first failed attempt at Natural Language Processing by Mark Needham

From the post:

One of the things I find fascinating about dating websites is that the profiles of people are almost identical so I thought it would be an interesting exercise to grab some of the free text that people write about themselves and prove the similarity.

I’d been talking to Matt Biddulph about some Natural Language Processing (NLP) stuff he’d been working on and he wrote up a bunch of libraries, articles and books that he’d found useful.

I started out by plugging the text into one of the many NLP libraries that Matt listed with the vague idea that it would come back with something useful.

I’m not sure exactly what I was expecting the result to be but after 5/6 hours of playing around with different libraries I’d got nowhere and parked the problem not really knowing where I’d gone wrong.

Last week I came across a paper titled “That’s What She Said: Double Entendre Identification” whose authors wanted to work out when a sentence could legitimately be followed by the phrase “that’s what she said”.

While the subject matter is a bit risque I found that reading about the way the authors went about solving their problem was very interesting and it allowed me to see some mistakes I’d made.

Vague problem statement

Unfortunately I didn’t do a good job of working out exactly what problem I wanted to solve – my problem statement was too general.

Question: How do you teach people how to create useful problem statements?

Pointers, suggestions?

Fast rule-based bioactivity prediction using associative classification mining

Filed under: Associations,Associative Classification Mining,Classification,Data Mining — Patrick Durusau @ 1:24 pm

Fast rule-based bioactivity prediction using associative classification mining by Pulan Yu and David J Wild. (Journal of Cheminformatics 2012, 4:29 )

Who moved my acronym? continues: ACM = Association for Computing Machinery or associative classification mining.

Abstract:

Relating chemical features to bioactivities is critical in molecular design and is used extensively in lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, the classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) method, and produce highly interpretable models.

An interesting lead on investigation of associations in large data sets. Pass on those meeting a threshold on for further evaluation?

Code Maven and programming for teens [TMs for pre-teens/teens?]

Filed under: Games,Interface Research/Design,Programming,Teaching — Patrick Durusau @ 1:01 pm

Code Maven and programming for teens by Greg Linden.

From the post:

I recently launched Code Maven from Crunchzilla. It helps teens learn a little about what they can do if they learn more about programming.

A lot of teens are curious about programming these days, but don’t end up doing any. And, it’s true, if you are a teen who wants to learn programming, you either have to use tutorials, books, and classes made for adults (which have a heavy focus on syntax and are slow to let you do anything) or high level tools that let you build games but teach a specialized programming language you can’t use anywhere else. Maybe something else might be useful to help more teens get started and get interested.

Code Maven lets teens learn a little about how to program, starting with basic concepts such as loops then rapidly getting into fractals, animation, physics, and games. In every lesson, all the code is there — in some cases, a complete physics engine with gravity, frame rate, friction, and other code you can modify — and it is all live Javascript, so the impact of any change is immediate. It’s a fun way to explore what programming can do.

Code Maven is a curious blend of a game and a tutorial. Like a tutorial, it’s step-by-step, and there’s not-too-big, not-too-small challenges at each step. Like a game, it’s fun, addictive, and experimentation can yield exciting (and often very cool) results. I hope you and your friends like it. Please try Code Maven, tell your friends about it, and, if you have suggestions or feedback, please e-mail me at maven@crunchzilla.com

Greg is also responsible for Code Monster, appropriate for introducing programming to kids 9-14. Code Maven, teens, 13-18 plus adults.

Curious if you know of other projects of this type?

Suspect it is effective in part because of the immediate feedback. Not to mention effective authoring/creation of the interface!

Something you should share with others.

Reminds me of the reason OS vendors almost give away academic software. If a student knows “your” system and not another, which one has the easier learning curve when they leave school?

What does that suggest to you about promoting a semantic technology like topic maps?

Infinite Jukebox plays your favorite songs forever

Filed under: Interface Research/Design,Music,Navigation,Similarity — Patrick Durusau @ 11:51 am

Infinite Jukebox plays your favorite songs forever by Nathan Yau.

From the post:

You know those songs that you love so much that you cry because they’re over? Well, cry no more with the Inifinite Jukebox by Paul Lamere. Inspired by Infinite Gangnam Style, the Infinite Jukebox lets you upload a song, and it’ll figure out how to cut the beats and piece them back together for a version of that song that goes forever.

Requires advanced web audio so you need to fire up a late version of Chrome or Safari. (I am on Ubuntu so can tell you about IE. In a VM?)

I tried it with Metallica’s Unforgiven.

Very impressive, although that assessment will vary based on your taste in music.

Would make an interesting interface for exploring textual features.

To have calculation of features and automatic navigation based on some pseudo-randomness. So you encounter data or text you would not otherwise have seen.

Many would argue we navigate with intention and rational purpose, but to be honest, that’s comfort analysis. It’s an explanation we use to compliment ourselves. (see, Thinking, Fast and Slow) Research suggests decision making is complex and almost entirely non-rational.

Graham’s Guide to Learning Scala

Filed under: Programming,Scala — Patrick Durusau @ 11:21 am

Graham’s Guide to Learning Scala by Graham Lee.

From the post:

It’s a pretty widely-accepted view that, as a programmer, learning new languages is a Good Idea™ . Most people with more than one language under their belt would say that learning new languages broadens your mind in ways that will positively affect the way you work, even if you never use that language again.

With the Christmas holidays coming up and many people likely to take some time off work, this end of the year presents a great opportunity to take some time out from your week-to-week programming grind and do some learning.

With that in mind, I present “Graham’s Guide to Learning Scala”. There are many, many resources on the web for learning about Scala. In fact, I think there’s probably too many! It would be quite easy to start in the wrong place and quickly get discouraged.

So this is not yet another resource to add to the pile. Rather, this is a guided course through what I believe are some of the best resources for learning Scala, and in an order that I think will help a complete newbie pick it up quickly but without feeling overwhelmed.

And, best of all, it has 9 Steps!

As Graham says, the holidays are coming up.

One way to avoid nosey family members, ravenous cousins and in-laws, almost off-key (you would have to know the key to be off-key) singing, is to spend some quality time with your laptop.

Graham offers a good selection of resources to fill a week, either now or at some other down time of the year.

Complexificaton: Is ElasticSearch Making a Case for a Google Search Solution?

Filed under: ElasticSearch,Interface Research/Design,Search Interface,Searching — Patrick Durusau @ 10:15 am

Complexificaton: Is ElasticSearch Making a Case for a Google Search Solution? by Stephen Arnold.

From the post:

I don’t have any dealings with Google, the GOOG, or Googzilla (a word I coined in the years before the installation of the predator skeleton on the wizard zone campus). In the briefings I once endured about the GSA (Google speak for the Google Search Appliance), I recall three business principles imparted to me; to wit:

  1. Search is far too complicated. The Google business proposition was and is that the GSA and other Googley things are easy to install, maintain, use, and love.
  2. Information technology people in organizations can often be like a stuck brake on a sports car. The institutionalized approach to enterprise software drags down the performance of the organization information technology is supposed to serve.
  3. The enterprise search vendors are behind the curve.

Now the assertions from the 2004 salad days of Google are only partially correct today. As everyone with a colleague under 25 years of age knows, Google is the go to solution for information. A number of large companies have embraced Google’s all-knowing, paternalistic approach to digital information. However, others—many others, in fact—have not.

I won’t repeat Stephen’s barbs at ElasticSearch but his point applies to search interfaces and approaches in general.

Is your search application driving business towards simpler solutions? (If the simpler solution isn’t yours, isn’t that the wrong direction?)

STAR: ultrafast universal RNA-seq aligner

Filed under: Bioinformatics,Genomics,String Matching — Patrick Durusau @ 9:32 am

STAR: ultrafast universal RNA-seq aligner
by Stephen Turner.

From the post:

There’s a new kid on the block for RNA-seq alignment.

Dobin, Alexander, et al. “STAR: ultrafast universal RNA-seq aligner.” Bioinformatics (2012).

Aligning RNA-seq data is challenging because reads can overlap splice junctions. Many other RNA-seq alignment algorithms (e.g. Tophat) are built on top of DNA sequence aligners. STAR (Spliced Transcripts Alignment to a Reference) is a standalone RNA-seq alignment algorithm that uses uncompressed suffix arrays and a mapping algorithm similar to those used in large-scale genome alignment tools to align RNA-seq reads to a genomic reference. STAR is over 50 times faster than any other previously published RNA-seq aligner, and outperforms other aligners in both sensitivity and specificity using both simulated and real (replicated) RNA-seq data.

I had a brief exchange of comments with Lars Marius Garshol on string matching recently. Another example of a string processing approach you may adapt to different circumstances.

Designing for Consumer Search Behaviour [Descriptive vs. Prescriptive]

Filed under: Interface Research/Design,Search Behavior,Usability,Users — Patrick Durusau @ 9:24 am

Designing for Consumer Search Behaviour by Tony Russell-Rose.

From the post:

A short while ago I posted the slides to my talk at HCIR 2012 on Designing for Consumer Search Behaviour. Finally, as promised, here is the associated paper, which is co-authored with Stephann Makri (and is available as a pdf in the proceedings). This paper takes the ideas and concepts introduced in A Model of Consumer Search Behaviour and explores their practical design implications. As always, comments and feedback welcome :)

ABSTRACT

In order to design better search experiences, we need to understand the complexities of human information-seeking behaviour. In this paper, we propose a model of information behavior based on the needs of users of consumer-oriented websites and search applications. The model consists of a set of search modes users employ to satisfy their information search and discovery goals. We present design suggestions for how each of these modes can be supported in existing interactive systems, focusing in particular on those that have been supported in interesting or novel ways.

Tony uses nine (9) categories to classify consumer search behavior:

1. Locate….

2. Verify….

3. Monitor….

4. Compare….

5. Comprehend….

6. Explore….

7. Analyze….

8. Evaluate….

9. Synthesize….

The details will help you be a better search interface designer so see Tony’s post for the details on each category.

My point is that his nine categories are based on observation of and research on, consumer behaviour. A descriptive approach to consumer search behaviour. Not a prescriptive approach to consumer search behaviour.

In some ideal world, perhaps consumers would understand why X is a better approach to Y, but attracting users is done in present world, not an ideal one.

Think of it this way:

Every time an interface requires training of or explanation to a consumer, you have lost a percentage of the potential audience share. Some you may recover but a certain percentage is lost forever.

Ready to go through your latest interface, pencil and paper in hand to add up the training/explanation points?

November 24, 2012

Futures in literature from the past

Filed under: Graphics,Time,Timelines,Visualization — Patrick Durusau @ 7:58 pm

Futures in literature from the past by Nathan Yau.

Another very graphic post that merits your attention. In part because of the visualization and Nathan’s suggestions about it. How would you recast the data?

But in a topic map context, how would you represent past projections about the future, both when the future is the present, but also against other projected futures?

I ask because the “Dark Ages” weren’t called that at the time. And in fact, they were a fairly lively time of invention and innovation.

The term was coined in the Renaissance to distinguish their “enlightened” civilization from the “dark” times between them and the fall of the Roman Empire.

It is an old trick but none the less effective for being an old one.

Recent political elections offered a number of examples that will be recognized as such in the fullness of time.

GraphChi visual toolkit – or understanding your data

Filed under: D3,GraphChi,Graphs,Visualization — Patrick Durusau @ 7:46 pm

GraphChi visual toolkit – or understanding your data by Danny Bickson.

Danny walks through using GraphChi to visual the Orange d4d data set. (cell phone usage).

Easy instructions and heavy on interesting graphics so you need to read the original post.

Very cool!

Tools for Data-Intensive Astronomy – a VO Community Day [Webcast]

Filed under: Astroinformatics,BigData — Patrick Durusau @ 7:38 pm

Tools for Data-Intensive Astronomy – a VO Community Day in Baltimore, MD

Thursday, November 29, 2012
10AM-2PM
Location: Bahcall Auditorium, Space Telescope Science Institute

From the post:

Experts from the VAO will demonstrate tools and services for data-intensive astronomy in the context of a range of science use cases and tutorials including:

  • Data discovery and access
  • Catalog cross comparison
  • Constructing and modeling spectral energy distributions
  • Time series analysis tools
  • Distributed database queries
  • …and more

In the morning we will be showing a number of demonstrations of VO science applications and tools. Lunch will be provided for all participants and there will be informal discussions and Q&A over lunch. Afterwards, from ~12:45 to 2:00pm, there will be some hands-on time with some typical science use cases. You are welcome to bring your laptop and try things out for yourself.

Register now at usvao.org/voday@baltimore

This event will also be webcast live:

For video see: https://webcast.stsci.edu/webcast/
For audio: 1 877-951-4490, Passcode is 4015008

And I thought I would have to miss because of distance!

Thank goodness for webcasts!

It should have a warning label:

Warning: This webcast contains new or different ideas, which may result in questioning of current ideas or even having new ones. Viewer discretion is advised.

NuoDB [Everything in NuoDB is an Atom]

Filed under: NoSQL,NuoDB,P2P — Patrick Durusau @ 7:09 pm

NuoDB

I last wrote about NuoDB in February of 2012, it was still in private beta release.

You can now download a free community edition with a limit of two nodes for the usual platforms.

The “under the hood” page reads (in part):

Everything in NuoDB is an Atom

Under the hood, NuoDB is an asynchronous, decentralized, peer-to-peer database. The NuoDB system is also object-oriented. Objects in NuoDB know how to perform various actions that create specific behaviors in the overall database. And at the heart of every object in NuoDB is the Atom. An Atom in NuoDB is like a single bird in a flock.

Atoms are self-describing objects (data and metadata) that together comprise the database. Everything in the NuoDB database is an Atom, including the schema, the indexes, and even the underlying data. For example, each table is an Atom that describes the metadata for the table and can reference other Atoms; such as Atoms that describe ranges of records in the table and their versions.

Atoms are Powerful

Atoms are intelligent, powerful, self-describing objects that together form the NuoDB database. Atoms know how to perform many actions, like these:

  • Atoms know how to make copies of themselves.
  • Atoms keep all copies of themselves up to date.
  • Atoms can broadcast messages. Atoms listen for events and changes from other Atoms.
  • Atoms can request data from other Atoms.
  • Atoms can serialize themselves to persistent storage.
  • Atoms can retrieve data from storage.

The Atoms are the Database

Everything in the database is an Atom, and the Atoms are the database. The Atoms work in concert to form both the Transaction (or Compute) Tier, and the Storage Tier.

A NuoDB Transaction Engine is a process that executes the SQL layer and is comprised completely of Atoms. The Transaction Engine operates on Atoms, listens for changes, and communicates changes with other Transaction Engines in the database.

A NuoDB Storage Manager is simply a special kind of Transaction Engine that allows Atoms to serialize themselves to permanent storage (such as a local disk or Amazon S3, for example).

A NuoDB database can be as simple as a single Transaction Engine and a single Storage Manager, or can be as complex as tens of Transaction Engines and Storage Managers distributed across dozens of computer hosts.

Some wag in a report that reminded me to look at NuoDB again was whining about how NuoDB would perform in query intensive environments? I guess downloading a free copy to see was too much effort.

Of course, you would have to define “query intensive” environment and that would be no easy task. Lots of users with simple queries? (Define “lots” and “simple.”)

Just a suspicion as I wait for my download URL to arrive, “query” in a atom based system may not have the same internal processes as a traditional relational database. Or perhaps not entirely.

For example, what if the notion of “retrieval” from a location in memory is no longer operative? That is a query is composed of atoms that begin messaging as they are composed and so receiving information before the user reaches the end of a query string?

And more than that, query atoms that occur frequently could be persisted so the creation cost is not incurred in subsequent queries.

Hard to say without knowing more about it but it definitely should be on your short list of products to watch.

A thrift to CQL3 upgrade guide [Cassandra Query Language]

Filed under: Cassandra,Query Language — Patrick Durusau @ 2:43 pm

A thrift to CQL3 upgrade guide by Sylvain Lebresne.

From the post:

CQL3 (the Cassandra Query Language) provides a new API to work with Cassandra. Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure. This is A Good Thing as it allows hiding from the API a number of distracting and useless implementation details (such as range ghosts) and allows to provide native syntaxes for common encodings/idioms (like the CQL3 collections as we’ll discuss below), instead of letting each client or client library reimplement them in their own, different and thus incompatible, way. However, the fact that CQL3 provides a thin layer of abstraction means that thrift users will have to understand the basics of this abstraction if they want to move existing application to CQL3. This is what this post tries to address. It explains how to translate thrift to CQL3. In doing so, this post also explains the basics of the implementation of CQL3 and can thus be of interest for those that want to understand that.

But before getting to the crux of the matter, let us have a word about when one should use CQL3. As described above, we believe that CQL3 is a simpler and overall better API for Cassandra than the thrift API is. Therefore, new projects/applications are encouraged to use CQL3 (though remember that CQL3 is not final yet, and so this statement will only be fully valid with Cassandra 1.2). But the thrift API is not going anywhere. Existing applications do not have to upgrade to CQL3. Internally, both CQL3 and thrift use the same storage engine, so all future improvements to this engine will impact both of them equally. Thus, this guide is for those that 1) have an existing application using thrift and 2) want to port it to CQL3.

Finally, let us remark that CQL3 does not claim to fundamentally change how to model applications for Cassandra. The main modeling principles are the same than they always have been: efficient modeling is still based on collocating that data that are accessed together through denormalization and the use of the ordering the storage engine provides, and is thus largely driven by the queries. If anything, CQL3 claims to make it easier to model along those principles by providing a simpler and more intuitive syntax to implement a number of idioms that this kind of modeling requires.

If you are using Cassandra, definitely the time to sit up and take notice. CQL3 is coming.

I first saw this at Alex Popescu’s myNoSQL.

« Newer PostsOlder Posts »

Powered by WordPress