Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 11, 2012

Weave open-source data visualization offers power, flexibility

Filed under: Government Data,Weave — Patrick Durusau @ 7:41 pm

Weave open-source data visualization offers power, flexibility by Sharon Machlis.

From the post:

When two Boston-area organizations rolled out an interactive data visualization website last month, it represented one of the largest public uses yet for the open-source project Weave — and more are on the way.

Three years in development so far and still in beta, Weave is designed so government agencies, non-profits and corporate users can offer the public an easy-to-use platform for examining information. Want to see the relationship between low household incomes and student reading scores in eastern Mass.? How housing and transportation costs compare with income? Or maybe how obesity rates have changed over time? Load some data to generate a table, scatter plot and map.

In addition to viewing data, mousing over various entries lets you highlight items on multiple visualizations at once: map, map legend, bar chart and scatter plot, for example. Users can also add visualization elements or change data sets, as well as right-click to look up related information on the Web.

This story about Weave highlights different data sets than the last one I reported. Is this a where there’s smoke there’s fire type situation? That is to say that public access and manipulation of data has the potential to make a real difference?

If so, in what way? Will open access to data result in closure of secret courts? Or secret indictments and evidence? The evidence that has come to light via diplomatic cables, for example, is embarrassing for incompetent or crude individuals. Hardly the stuff of “national security.” (Sorry, don’t know how to embed a drum roll in the page, maybe in HTML5 I can.)

Flotr2

Filed under: Graphics,HTML5 — Patrick Durusau @ 7:29 pm

Flotr2

From the documentation:

Flotr2 is a library for drawing HTML5 charts and graphs. It is a branch of flotr which removes the Prototype dependency and includes many improvements.

Features:

  • mobile support
  • framework independent
  • extensible plugin framework
  • custom chart types
  • FF, Chrome, IE6+, Android, iOS
  • lines
  • bars
  • candles
  • pies
  • bubbles

See the website for some interesting (non-static) examples.

February 10, 2012

Reading club on Graph databases and distributed systems

Filed under: Graphs — Patrick Durusau @ 4:15 pm

Reading club on Graph databases and distributed systems by René Pickhardt.

The short version is that René is organizing a graph reading club that meets:

The reading club will take place in D116 the “Kreuzverweisraum” and will take place every wednesday 15 o’clock.

For next week I expect from anyone who wants to join that the Map Reduce paper will be read by Wednesday.

I will keep anyone up to date with the results from the reading club and the anouncements for next weeks readings. If anyone from the web wants to join, I am pretty sure that a skype call or google hangout is very possible to set up!

Starts 15 February 2012, at 3:00 PM Germany, 9 AM East Coast USA, with the Map Reduce paper.

I have written to confirm the details and to request remote attendance.

SSTable and Log Structured Storage: LevelDB

Filed under: leveldb,SSTable — Patrick Durusau @ 4:14 pm

SSTable and Log Structured Storage: LevelDB by Ilya Grigorik.

From the post:

If Protocol Buffers is the lingua franca of individual data record at Google, then the Sorted String Table (SSTable) is one of the most popular outputs for storing, processing, and exchanging datasets. As the name itself implies, an SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.

Unfortunately, the SSTable name itself has also been overloaded by the industry to refer to services that go well beyond just the sorted table, which has only added unnecessary confusion to what is a very simple and a useful data structure on its own. Let’s take a closer look under the hood of an SSTable and how LevelDB makes use of it.

How important is this data structure? It or a variant used by Google’s BigTable, Hadoop’s HBase, and Cassandra, among others.

Whether this will work for your purposes is another question, but it never hurts to know more today than you did yesterday.

Slides and replay for “A backstage tour of ggplot2”

Filed under: Ggplot2,Graphics — Patrick Durusau @ 4:12 pm

Slides and replay for “A backstage tour of ggplot2”

From the post:

Many thanks to Hadley Wickham for his informative and entertaining webinar yesterday, “A backstage tour of ggplot2”. Thanks also to everyone who submitted questions — with more than 800 attendees live on the line we had many more questions than we had time to answer.

Pointers to lots of goodies and video of the presentation!

Wolfram Alpha Pro democratizes data analysis:…

Filed under: Data Analysis — Patrick Durusau @ 4:12 pm

Wolfram Alpha Pro democratizes data analysis: an in-depth look at the $4.99 a month service by Dieter Bohn.

From the post:

On Wednesday, February 8th, Wolfram Alpha will be adding a new, “Pro” option to its already existing services. Priced at a very reasonable $4.99 a month ($2.99 for students), the new services includes the ability to use images, files, and even your own data as inputs instead of simple text entry. The “reports” that Wolfram Alpha kicks out as a result of these (or any) query are also beefed up for Pro users, some will actually become interactive charts and all of them can be more easily exported in a variety of formats. We sat down with Stephen Wolfram himself to get a tour of the new features and to discuss what they mean for his goal of “making the world’s knowledge computable.”

Computers have certainly played a leading role in the hard sciences over the last seventy or so years but I remain sceptical about their role in the difficult sciences. It is true that computers can assist in quickly locating all the uses of a particular string in Greek, Hebrew or Ugaritic. But determining the semantics of such a string requires more than the ability to count quickly.

Still, Wolfram created a significant tool for mathematical research (Mathematica) so his work on broader areas of human knowledge merits a close look.

Semantic based web search engines- Changing the world of Search

Filed under: Search Engines — Patrick Durusau @ 4:11 pm

Semantic based web search engines- Changing the world of Search by Prachi Nagpal.

From the post:

An important quality that the majority of search engines functional today lack is the ability to take into account the intention of the user behind the overall query. Basing the matching of web pages on keyword frequency and a ranking metric such as Pagerank return various results that may be of high ranking but still irrelevant to the users intended context. Therefore I explored and realized that there is a need to add semantics to the web search.

There are some semantic search engines that have already come up in the markets eg. Hakia, Swoogle, Kosmix, etc. that takes a semantic based approach which is different from the traditional search engines. I really liked their idea of implementing and adding semantics to the web search. This provoked me to do more research in this field and tried to think of different ways to add semantics.

Following is an Algorithm that can be used in Semantic Based Web Search Engines :-

So, to find web pages on the internet that match a user’s query based not only on the important keywords in a user query, but also based on the intention of the user, behind that query, first the user’s entered query is expanded using WordNet ontology.

This Algorithm focuses on work that uses the Hypernym/Hyponymy and Synset relations in WordNet for query expansion algorithm. A set of words that are highly related to the words in the user query, determined by the frequency of their occurrence in the Synset and Hyponym tree for each of the user query terms is created. This set is now refined using the same relations to get a more precise and accurate expanded query.

Interesting approach but as the comments indicate, a lack of use of RDF makes it problematic.

I would rephrase the problem statement from:

…the majority of search engines functional today lack is the ability to take into account the intention of the user behind the overall query.

to: …the majority of search engines lack the ability to accurately interpret the semantics to web accessible content.

I think we would all agree that web accessible content has semantics.

The problem is how to bring those semantics to the attention of search engines?

Or perhaps better, how do we take advantage of those semantics with current search engines, which are semantically deaf and dumb?

Unicode’s “Pile of Poo” character

Filed under: Humor — Patrick Durusau @ 4:09 pm

Unicode’s “Pile of Poo” character by Cory Doctorow.

From the post:

For many years, most of the Internet ran on ASCII, a character set that had a limited number of accents and diacriticals, and which didn’t support non-Roman script at all. Unicode, a massive, sprawling replacement, has room for all sorts of characters and alphabets, and can be extended with “private use areas” that include support for Klingon.

But for all that, I never dreamt that Unicode was so vast as to contain a special character for a “pile of poo.”

See Cory’s post for the glyph in question.

I don’t know which of the following is the least surprising:

  1. That Unicode has a character for a “pile of poo.”
  2. That Cory Doctorow found a character in Unicode for a “pile of poo.”
  3. That Cory Doctorow routinely checks character sets for “pile of poo” characters.

Perhaps equally unsurprising. 😉

I will put this down as another fact I would not know but for Cory Doctorow.

(Truth be told, Cory is a gifted writer, transparency advocate, in addition to being a “poo hunter.”)

Tauberer on Open Legislative Data

Filed under: Legal Informatics,Transparency — Patrick Durusau @ 4:07 pm

Tauberer on Open Legislative Data

From the post:

Dr. Joshua Tauberer of GovTrack and POPVOX has posted his House Legislative Data and Transparency Conference presentation, entitled “Data Impact and Understandability”: click here for the text; click here for the slides.

See the post for more resources from the conference and other materials.

SACO: Subject Authority Cooperative Program of the PPC

Filed under: Subject Authority,Subject Headings,Subject Identifiers — Patrick Durusau @ 4:06 pm

SACO: Subject Authority Cooperative Program of the PPC

SACO was established to allow libraries to contribute proposed subject headings to the Library of Congress.

Of particular interest is: Web Resources for SACO Proposals by Adam L. Schiff.

It is a very rich source of reference materials that you may find useful in developing subject heading proposals or subject classifications for other uses (such as topic maps).

But don’t neglect the materials you find on the SACO homepage.

ODLIS: Online Dictionary for Library and Information Science

Filed under: Dictionary,Information Science,Library — Patrick Durusau @ 4:05 pm

ODLIS: Online Dictionary for Library and Information Science by Joan M. Reitz.

ODLIS is known to all librarians and graduate school library students but perhaps not to those of us who abuse library terminology in CS and related pursuits. Can’t promise it will make our usage any better but certainly won’t make it any worse. 😉

This would make a very interesting “term for a day” type resource.

Certainly one you should bookmark and browse at your leisure.

History of the Dictionary

ODLIS began at the Haas Library in 1994 as a four-page printed handout titled Library Lingo, intended for undergraduates not fluent in English and for English-speaking students unfamiliar with basic library terminology. In 1996, the text was expanded and converted to HTML format for installation on the WCSU Libraries Homepage under the title Hypertext Library Lingo: A Glossary of Library Terminology. In 1997, many more hypertext links were added and the format improved in response to suggestions from users. During the summer of 1999, several hundred terms and definitions were added, and a generic version was created that omitted all reference to specific conditions and practices at the Haas Library.

In the fall of 1999, the glossary was expanded to 1,800 terms, renamed to reflect its extended scope, and copyrighted. In February, 2000, ODLIS was indexed in Yahoo! under “Reference – Dictionaries – Subject.” It was also indexed in the WorldCat database, available via OCLC FirstSearch. During the year 2000, the dictionary was expanded to 2,600 terms and by 2002 an additional 800 terms had been added. From 2002 to 2004, the dictionary was expanded to 4,200 terms and cross-references were added, in preparation for the print edition. Since 2004, an additional 600 terms and definitions have been added.

Purpose of the Dictionary

ODLIS is designed as a hypertext reference resource for library and information science professionals, university students and faculty, and users of all types of libraries. The primary criterion for including a term is whether a librarian or other information professional might reasonably be expected to know its meaning in the context of his or her work. A newly coined term is added when, in the author’s judgment, it is likely to become a permanent addition to the lexicon of library and information science. The dictionary reflects North American practice; however, because ODLIS was first developed as an online resource available worldwide, with an e-mail contact address for feedback, users from many countries have contributed to its growth, often suggesting additional terms and commenting on existing definitions. Expansion of the dictionary is an ongoing process.

Broad in scope, ODLIS includes not only the terminology of the various specializations within library science and information studies but also the vocabulary of publishing, printing, binding, the book trade, graphic arts, book history, literature, bibliography, telecommunications, and computer science when, in the author’s judgment, a definition might prove useful to librarians and information specialists in their work. Entries are descriptive, with examples provided when appropriate. The definitions of terms used in the Anglo-American Cataloging Rules follow AACR2 closely and are therefore intended to be prescriptive. The dictionary includes some slang terms and idioms and a few obsolete terms, often as See references to the term in current use. When the meaning of a term varies according to the field in which it is used, priority is given to the definition that applies within the field with which it is most closely associated. Definitions unrelated to library and information science are generally omitted. As a rule, definition is given under an acronym only when it is generally used in preference to the full term. Alphabetization is letter-by-letter. The authority for spelling and hyphenation is Webster’s New World Dictionary of the American Language (College Edition). URLs, current as of date of publication, are updated annually.

Dragsters, Drag Cars & Drag Racing Cars

I still remember the cover of Hot Rod magazine that announced (from memory) “The 6’s are here!” Don “The Snake” Prudhomme had broken the 200 mph barrier in a drag race. Other memories follow on from that one but I mention it to explain my interest in a recent Subject Authority Cooperative Program decision to not have a cross-reference from dragster (the term I would have used) to more recent terms, drag cars or drag racing cars.

The expected search (in this order) due to this decision is:

Cars (Automobiles) -> redirect to Automobiles -> Automobiles -> narrower term -> Automobiles, racing -> narrower term -> Dragsters

Adam L. Schiff, proposer of drag cars & drag racing cars says below “This just is not likely to happen.”

Question: Is there a relationship between users “work[ing] their way up and down hierarchies” and display of relationships methods? Who chooses which items will be the starting point to lead to other items? How do you integrate a keyword search into such a system?

Question: And what of the full phrase/sentence AI systems where keywords work less well? How does that work with relationship display systems?

Question: I wonder if the relationship display methods are closer to the up and down hierarchies, but with less guidance?

Adam’s Dragster proposal post in full:

Dragsters

Automobiles has a UF Cars (Automobiles). Since the UF already exists on the basic heading, it is not necessary to add it to Dragsters. The proposal was not approved.

Our proposal was to add two additional cross-references to Dragsters: Drag cars, and Drag racing cars. While I understand, in principle, the reasoning behind the rejection of these additional references, I do not see how it serves users. A user coming to a catalog to search for the subject “Drag cars” will now get nothing, no redirection to the established heading. I don’t see how the presence of a reference from Cars (Automobiles) to Automobiles helps any user who starts a search with “Drag cars”. Only if they begin their search with Cars would they get led to Automobiles, and then only if they pursue narrower terms under that heading would they find Automobiles, Racing, which they would then have to follow further down to Dragsters. This just is not likely to happen. Instead they will probably start with a keyword search on “Drag cars” and find nothing, or if lucky, find one or two resources and think they have it all. And if they are astute enough to look at the subject headings on one of the records and see “Dragsters”, perhaps they will then redo their search.

Since the proposed cross-refs do not begin with the word Cars, I do not at all see how a decision like this is in the service of users of our catalogs. I think that LCSH rules for references were developed when it was expected that users would consult the big red books and work their way up and down hierarchies. While some online systems do provide for such navigation, it is doubtful that many users take this approach. Keyword searching is predominant in our catalogs and on the Web. Providing as many cross-refs to established headings as we can would be desirable. If the worry is that the printed red books will grow to too many volumes if we add more variant forms that weren’t made in the card environment, then perhaps there needs to be a way to include some references in authority records but mark them as not suitable for printing in printed products.

PS: According to ODLIS: Online Dictionary for Library and Information Science by Joan M. Reitz, UF, has the following definition:

used for (UF)

A phrase indicating a term (or terms) synonymous with an authorized subject heading or descriptor, not used in cataloging or indexing to avoid scatter. In a subject headings list or thesaurus of controlled vocabulary, synonyms are given immediately following the official heading. In the alphabetical list of indexing terms, they are included as lead-in vocabulary followed by a see or USE cross-reference directing the user to the correct heading. See also: syndetic structure.

I did not attempt to reproduce the extremely rich cross-linking in this entry but commend the entire resource to your attention, particularly if you are a library science student.

February 9, 2012

Persistent Graphs with OrientDB

Filed under: Graphs,OrientDB — Patrick Durusau @ 4:31 pm

Persistent Graphs with OrientDB by Luca Molino.

Description:

This talk will present OrientDB open source project and its capability to handle persistent graphs in different ways. OrientDB presentation Java Graph Native API SQL+graph extensions HTTP API Blueprints API Gremlin usage Console tool Studio web tool.

Having the slides would make this presentation much easier to follow.

The phrase “persistent graph” is used in this and other presentations with no readily apparent definition.

Wikipedia was up today so I checked the term Persistent Data Structure, but nothing in that article had anything in common (other than data, data structure) with the presentation.

I suspect that “persistent graph” is being used to indicate that data is being stored and different queries can be run against the data (without changing the data). I am not sure that merits an undefined term.

OrientDB: http://www.orientechnologies.com

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework

Filed under: Aggregation,Cheminformatics,MongoDB — Patrick Durusau @ 4:30 pm

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework by Davy Suvee.

From the post:

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the capabilities of the map-reduce implementation.

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities through Tanimoto coefficients. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, part 2 of this article describes the use of MongoDB’s build-in map-reduce functionality to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.

Does it occur to you that aggregation results in one or more aggregates? And if we are presented with one or more aggregates, we could persist those aggregates and add properties to them. Or have relationships between aggregates. Or point to occurrences of aggregates.

Kristina Chodorow demonstrated use of aggregation in MongoDB in Hacking Chess with the MongoDB Pipeline for analysis of chess games. Rather that summing the number of games in which the move “e4” is the first move for White, links to all 696 games could be treated as occurrences of that subject. Which would support discovery of the player of White as well as Black.

Think of aggregation as a flexible means for merging information about subjects and their relationships. (Blind interchange requires more but this is a step in the right direction.)

Spring and Scala (Scala User Group London talk)

Filed under: Scala,Spring — Patrick Durusau @ 4:29 pm

Spring and Scala (Scala User Group London talk) by Jan Machacek.

From the post:

Many thanks to all who came to my Spring in Scala talk. The video is now available at Skills Matters website, I am adding the slides in PDF the source code on Github and links to the other posts that explain in more detail the topics I mentioned in the talk.

It would be very nice if this becomes a tradition for Skills Matters presentations. Video, slides, source code and a post with links to further resources.

Watch the presentation, download the slides and source code and read this post carefully. You won’t be disappointed.

Description of the presentation:

In this Spring in Scala talk, Jan Machacek will start by comparing Scala to the other languages on the Java platform. Find out that Scala code gets compiled to regular Java bytecode, making it accessible to your Spring code. You will also learn what functional programming means and how to see & apply the patterns of functional programming in what we would call enterprise code. In addition to being functional language, Scala is strongly typed language.

The second part of the talk will therefore explore the principles of type systems. You will find out what polymorphic functions are, and what the Scala wizards mean when they talk about type covariance and contravariance. Throughout the talk, there will be plenty of code examples comparing the Spring bean in Java with their new form in Scala; together with plentiful references to the ever-growing Scala ecosystem, the talk will give you inspiration & guidance on using Scala in your Spring applications. Come over and find your functional mojo!

ElasticSearch vs. Apache Solr

Filed under: ElasticSearch,Solr — Patrick Durusau @ 4:28 pm

ElasticSearch vs. Apache Solr

Good for the information it does present but both pros and cons are primarily focused on ElasticSearch. Which is fine because the author did a very good job trying to present both the pros and cons on ElasticSearch.

It would be a better (read different) presentation if it walked through the pros and cons with ElasticSearch and Apache Solr side by side. (But an author is entitled to write the post they want, not the one desired by others.)

I don’t know if it would be worth the effort but a common interface that searched the JIRA (or other issue tracking mechanisms) for common search engines and presented issues grouped together, that could be quite useful.

For comparison purposes but also for cross-pollination of solutions.

A First Exploration Of SolrCloud

Filed under: Solr,SolrCloud — Patrick Durusau @ 4:26 pm

A First Exploration Of SolrCloud

From the post:

SolrCloud has recently been in the news and was merged into Solr trunk, so it was high time to have a fresh look at it.

The SolrCloud wiki page gives various examples but left a few things unclear for me. The examples only show Solr instances which host one core/shard, and it doesn’t go deep on the relation between cores, collections and shards, or how to manage configurations.

In this blog, we will have a look at an example where we host multiple shards per instance, and explain some details along the way.

If you have any interest in SolrCloud, this is a post for you. Forward it to your friends if they are interested in Solr. And family. Well, maybe not that last one. 😉

I have a weakness for posts that take the time to call out “…shard and slice are often used in ambiguous ways…,” establish a difference and then use those terms consistently.

One of the primary weaknesses of software projects is that “documentation” is treated with about the same concern as “requirements.”

The problem is that the original programmers may understand ambiguities and if you want a cult program, that’s great. The problem is that to be successful software, that is software that is widely used, it has to be understood by as many programmers as possible. Possibly even users if it is an end use product.

Think of it this way: You don’t want to be distracted from the next great software project by endless questions that you have to stop and answer. Do the documentation along the way and you won’t lose time on the next great project. Which we are all waiting to see. Documentation is a win-win situation.

Lucene-3759: Support joining in a distributed environment

Filed under: Lucene,Query Expansion,Sharding — Patrick Durusau @ 4:26 pm

Support joining in a distributed environment.

From the description:

Add two more methods in JoinUtil to support joining in a distributed manner.

  • Method to retrieve all from values.
  • Method to create a TermsQuery based on a set of from terms.

With these two methods distributed joining can be supported following these steps:

  1. Retrieve from values from each shard
  2. Merge the retrieved from values.
  3. Create a TermsQuery based on the merged from terms and send this query to all shards.

Topic maps that have been split into shards could have values that would trigger merging if present in a single shard.

This appears to be a way to address that issue.

Time spent with Lucene is time well spent.

Using Solr’s Dismax Tie Parameter

Filed under: Solr — Patrick Durusau @ 4:25 pm

Using Solr’s Dismax Tie Parameter by Rafał Kuć.

From the post:

The Dismax query parser has been with Solr for a long time. Most of the time we use parameters like qf, pf or mm forgetting about a very useful parameter which allows us to control how the lower scoring fields are treated – the tie parameter.

Tie

The tie parameter allows one to control how the lower scoring fields affects score for a given word. If we set the tie parameter to a 0.0 value, during the score calculation, only the fields that were scored highest will matter. However if we set it to 0.99 the fields scoring lower will have almost the same impact on the score as the highest scoring field. So let’s check if that actually works.

Remember that the Solr Wiki defaults to title searches when you are looking for parameters. Choose the text button next to the search box.

For qf (Query Fields), pf (Phrase Fields), mm (Minimum ‘Should’ Match), and tie (Tie Breaker), see: the Solr Wiki DisMaxQParserPlugin.

Just in case you are new to Dismax, see: What’s a “Dismax”?
from Lucid Imagination.

ExtendedDismax (place holder homepage) is covered in SOLR-2368

A Peek Inside the Erlang Compiler

Filed under: Erlang — Patrick Durusau @ 4:24 pm

A Peek Inside the Erlang Compiler

From the post:

Erlang is a complex system, and I can’t do its inner workings justice in a short article, but I wanted to give some insight into what goes on when a module is compiled and loaded. As with most compilers, the first step is to convert the textual source to an abstract syntax tree, but that’s unremarkable. What is interesting is that the code goes through three major representations, and you can look at each of them.

Covers the following transformations:

  • Syntax trees to Core Erlang
  • Core Erlang to code for the register-based BEAM virtual machine (final output of compiler)
  • BEAM bytecode into threaded code (loader output)

Just in case you wanted to know more about Erlang than you found in the crash course. 😉

A deeper understanding of any language is useful. Understanding “why” a construction works is the first step to writing a better one.

Crash Course in Erlang

Filed under: Erlang — Patrick Durusau @ 4:23 pm

Crash Course in Erlang (PDF file) by Roy Deal Simon.

“If your language is not functional, it’s dysfunctional baby.”

I suppose I look at Erlang (and other) intros just to see if the graphics/illustrations are different from other presentations. 😉 Not enough detail to really teach you much but sometimes the graphics are worth remembering.

Not any time soon but it would be interesting to review presentations for common illustrations. Perhaps even a way to find the ones that are the best to use with particular audiences. Something to think about.

RSS River Plugin (ElasticSearch)

Filed under: ElasticSearch,RSS River Plugin — Patrick Durusau @ 4:22 pm

RSS River Plugin (ElasticSearch)

Here the non-technical stuff:

RSS River Plugin offers a simple way to index RSS feeds into Elasticsearch.

It reads your feeds with a regular period and index content.

As all rivers, it’s quite simple to create an RSS River :

  • Install the plugin and start Elasticsearch
  • Create your index (with mapping if needed)
  • Define the river
  • Search for RSS content

Rivers enable you to automatically index content as it arrives.

ElasticSearch offers default rivers for CouchDB, RabbitMQ, Twitter and Wikipedia.

February 8, 2012

Suffering-Oriented Programming

Filed under: Marketing,Programming,Use Cases — Patrick Durusau @ 5:14 pm

Suffering-Oriented Programming by Nathan Marz.

From the post:

Someone asked me an interesting question the other day: “How did you justify taking such a huge risk on building Storm while working on a startup?” (Storm is a realtime computation system). I can see how from an outsider’s perspective investing in such a massive project seems extremely risky for a startup. From my perspective, though, building Storm wasn’t risky at all. It was challenging, but not risky.

I follow a style of development that greatly reduces the risk of big projects like Storm. I call this style “suffering-oriented programming.” Suffering-oriented programming can be summarized like so: don’t build technology unless you feel the pain of not having it. It applies to the big, architectural decisions as well as the smaller everyday programming decisions. Suffering-oriented programming greatly reduces risk by ensuring that you’re always working on something important, and it ensures that you are well-versed in a problem space before attempting a large investment.

I have a mantra for suffering-oriented programming: “First make it possible. Then make it beautiful. Then make it fast.”

First make it possible

When encountering a problem domain with which you’re unfamiliar, it’s a mistake to try to build a “general” or “extensible” solution right off the bat. You just don’t understand the problem domain well enough to anticipate what your needs will be in the future. You’ll make things generic that needn’t be, adding complexity and wasting time.

Different perspective than Semantic Web proposals for problems that users don’t realize they have. (Topic maps have the same issue.)

I was going on, probably tiresomely, to my wife about a paper on transient hypernodes/hyperedges and she asked: “Is anyone using it now?” I had to admit 22 years after publication, it had not swept the field of IR.

She continued: “If it was so good, why isn’t everyone using it?” A question to which I don’t have a good answer.

RDF and OWL interest the W3C and funders of research projects but few others. There is no ground swell demand for an ontologically enabled WWW. Never has been.

At least not to compare to the demand for iPads, iPhones, photos of Madonna/Lady Gaga, NoSQL databases, etc. All of those do quite well without public support.

But then there is real demand for those goods/services.

Contrast that with the Semantic Web which started off by constructing universal and rigid (read fragile) solutions to semantic issues that are in a constant process of dynamic evolution. Does anyone see a problem here?

Not to excuse my stream of writing about topic maps. Which posits that everyone would be better off with mappings between representatives for subjects and their relationships to other subjects. Maybe, maybe not. And maybe if they would be better off, they have no interest.

For example, for sufficient investment, the World Bank could enforce transparency down to the level of national banks or lower for its disbursements. That begs the question whether anyone would accept funding without the usual and customary opportunities for graft and corruption? I suspect the answer both within and without the World Bank would be no.

A little bit closer to home, a topic map that maps “equivalent” terms in a foreign language to subject headings in a library catalog. Composed by local members of a minority language community. Not technically difficult, albeit it would require web interfaces for the editing and updating. How many libraries would welcome non-librarians making LC Subject Classifications accessible to a minority language community?

Here’s a question for suffering-oriented programming: How do we discover the suffering of others? So our programming doesn’t satisfy the suffering of an audience of one?

Noun choice, sex, lies, and video

Filed under: Humor,News — Patrick Durusau @ 5:13 pm

Noun choice, sex, lies, and video by Geoffrey K. Pullum.

From the post:

Three linguistic offenses in the UK to report on this week: an injudicious noun choice, a highly illegal false assertion, and an obscene racist epithet. The latter two have led to criminal charges.

Deeply amusing post on language use in the UK.

While reading it I thought of how topic maps could be used to map the language used in either national or local news reporting.

But, then I thought such a topic map would simply be adding to the debasing of the English language. I don’t think there has been an automobile accident reported within my hearing that wasn’t “tragic” if there was a fatality. There are any number of terms that can be applied to automobile accidents but “tragic” isn’t one of them.

Any more than the “on location” reporters who say with straight faces that SUV’s have “knifejacked” due to ice, snow, rain, whatever. (The link has a great animation to show reporters in those cases are simply mouthing noises that resemble meaningful speech.)

Not to say you should not create a topic map of language usage in the news, just keep it a secret so it doesn’t act as a bad influence on others.

PSEUDOMARKER: a powerful program for joint linkage…

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 5:13 pm

PSEUDOMARKER: a powerful program for joint linkage and/or linkage disequilibrium analysis on mixtures of singletons and related individuals. By Hiekkalinna T, Schäffer AA, Lambert B, Norrgrann P, Göring HH, Terwilliger JD.

Abstract:

A decade ago, there was widespread enthusiasm for the prospects of genome-wide association studies to identify common variants related to common chronic diseases using samples of unrelated individuals from populations. Although technological advancements allow us to query more than a million SNPs across the genome at low cost, a disappointingly small fraction of the genetic portion of common disease etiology has been uncovered. This has led to the hypothesis that less frequent variants might be involved, stimulating a renaissance of the traditional approach of seeking genes using multiplex families from less diverse populations. However, by using the modern genotyping and sequencing technology, we can now look not just at linkage, but jointly at linkage and linkage disequilibrium (LD) in such samples. Software methods that can look simultaneously at linkage and LD in a powerful and robust manner have been lacking. Most algorithms cannot jointly analyze datasets involving families of varying structures in a statistically or computationally efficient manner. We have implemented previously proposed statistical algorithms in a user-friendly software package, PSEUDOMARKER. This paper is an announcement of this software package. We describe the motivation behind the approach, the statistical methods, and software, and we briefly demonstrate PSEUDOMARKER’s advantages over other packages by example.

I didn’t set out to find this particular article but was trying to update references on Cri-Map, which is now somewhat data software for:

… rapid, largely automated construction of multilocus linkage maps (and facilitate the attendant tasks of assessing support relative to alternative locus orders, generating LOD tables, and detecting data errors). Although originally designed to handle codominant loci (e.g. RFLPs) scored on pedigrees “without missing individuals”, such as CEPH or nuclear families, it can now (with some caveats described below) be used on general pedigrees, and some disease loci.

Just as background, you may wish to see:

CRI-MAP – Introduction

And, Multilocus linkage analysis

With multilocus linkage analysis, more than two loci are simultaneously considered for linkage. When mapping a disease gene relative to a group of markers with known intermarker recombination fractions, it is possible to perform parametric (lod score) as well as nonparametric analysis.

My interest being in the use of additional information (in the lead article “linkage and linkage disequilibrium”) in determining linkage issues.

Not that every issue of subject identification needs or should be probabilistic or richly nuanced.

In a prison there are “free men” and prisoners.

Rather sharp and useful distinction. Doesn’t require a URL. Or a subject identifier. What does your use case require?

Calculating In-Degree using R MapReduce over Hadoop

Filed under: Hadoop,MapReduce,R — Patrick Durusau @ 5:12 pm

Calculating In-Degree using R MapReduce over Hadoop

Marko Rodriguez, the source of so many neat graph resources demonstrates:

How to use an R package for MapReduce over Hadoop to calculate vertex in-degree and concludes with a question: “Can you tell me how to calculate a graph’s degree distribution? — HINT: its this MapReduce job composed with another.”

Never one to allow a question to sit for very long, ;-), Marko supplies the answer and plots the results using R. (By the posting times, which may be horribly incorrect, Marko waited less than an hour before posting the answer. Moral here is that if Marko asks a question, answer early and often.)

The hypernode model and its associated query language

Filed under: Graphs,Hypergraphs,Hypernodes — Patrick Durusau @ 5:12 pm

The hypernode model and its associated query language

Abstract:

A data model called the hypernode model, whose single data structure is the hypernode, is introduced. Hypernodes are graphs whose node set can contain graphs in addition to primitive nodes. Hypernodes can be used to represent arbitrarily complex objects and can support the encapsulation of information, to any level. A declarative logic-based language for the hypernode model is introduced and shown to be query complete. It is then shown that hypernodes can represent extensional functions, nested relations, and composite objects. Thus, the model is at least as expressive as the functional and nested relational database models. It is demonstrated that the hypernode model can be regarded as an object-oriented one.

Interesting departure from hypergraphs with hyperedges, the latter being replaced in this model with hypernodes. Hypernodes consists of a unique label, nodes, which may be primitive or hypernodes, and, edges between nodes.

The authors went on to create an implementation and storage model for this model.

Extending Data Beyond the Database – The Notion of “State”

Filed under: Data,Database — Patrick Durusau @ 5:12 pm

Extending Data Beyond the Database – The Notion of “State” by David Loshin

From the post:

In my last post, I essentially suggested that there is a difference between merging two static data sets and merging static data sets with dynamic ones. It is worth providing a more concrete example to demonstrate what I really mean by this idea: let’s say you had a single analytical database containing customer profile information (we’ll call this data set “Profiles”), but at the same time had access to a stream of web page transactions performed by individuals identified as customers (we can refer to this one as “WebStream”).

The challenge is that the WebStream data set may contain information with different degrees of believability. If an event can be verified as the result of a sequence of web transactions within a limited time frame, the resulting data should lead to an update of the Profiles data set. On the other hand, if the sequence does not take place, or takes place over an extended time frame, there is not enough “support” for the update and therefore the potential modification is dropped. For example, if a visitor places a set of items into a shopping cart and completes a purchase, the customer’s preferences are updated based on the items selected and purchased. But if the cart is abandoned and not picked up within 2 hours, the customer’s preferences may not be updated.

Because the update is conditional on a number of different variables, the system must hold into some data until it can either be determined that the preferences are updated or not. We can refer to this as maintaining some temporary state that either resolves into a modification to the Profiles data set or is thrown out after 2 hours.

Are your data sets static or dynamic? And if dynamic, how do you delay merging until some other criteria is met?

The first article David refers to is: Data Quality and State.

Interesting that as soon as we step away from static files and data, the world explodes in complexity. Add to that dynamic notions of identity and recognition and complexity seems like an inadequate term for what we face.

Be mindful those are just slices of what people automatically process all day long. Fix your requirements and build to spec. Leave the “real world” to wetware.

We Need To Talk About Binary Search

Filed under: Binary Search,Search Algorithms — Patrick Durusau @ 5:12 pm

We Need To Talk About Binary Search.

From the post:

Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky.

— Donald Knuth

Why is binary search so damn hard to get right? Why is it that 90% of programmers are unable to code up a binary search on the spot, even though it’s easily the most intuitive of the standard algorithms?

  • Firstly, binary search has a lot of potential for off-by-one errors. Do you do inclusive bounds or exclusive bounds? What’s your break condition: lo=hi+1, lo=hi, or lo=hi-1? Is the midpoint (lo+hi)/2 or (lo+hi)/2 - 1 or (lo+hi)/2 + 1? And what about the comparison, < or ? Certain combinations of these work, but it’s easy to pick one that doesn’t.
  • Secondly, there are actually two variants of binary search: a lower-bound search and an upper-bound search. Bugs are often caused by a careless programmer accidentally applying a lower-bound search when an upper-bound search was required, or vice versa.
  • Finally, binary search is very easy to underestimate and very hard to debug. You’ll get it working on one case, but when you increase the array size by 1 it’ll stop working; you’ll then fix it for this case, but now it won’t work in the original case!

I want to generalise and nail down the binary search, with the goal of introducing a shift in the way the you perceive it. By the end of this post you should be able to code any variant of binary search without hesitation and with complete confidence. But first, back to the start: here is the binary search you were probably taught…

In order to return a result to a user or for use in some other process, you have to find it first. This post may help you do just that, reliably.

Amazon S3 Price Reduction

Filed under: Amazon Web Services AWS — Patrick Durusau @ 5:11 pm

Amazon S3 Price Reduction

In case you want to gather up your email archives before turning them into a topic map, Amazon S3 prices have dropped:

As you can tell from my recent post on Amazon S3 Growth for 2011, our customers are uploading new objects to Amazon S3 at an incredible rate. We continue to innovate on your behalf to drive down storage costs and pass along the resultant savings to you at every possible opportunity. We are now happy (some would even say excited) to announce another in a series of price reductions.

With this price change, all Amazon S3 standard storage customers will see a significant reduction in their storage costs. For instance, if you store 50 TB of data on average you will see a 12% reduction in your storage costs, and if you store 500 TB of data on average you will see a 13.5% reduction in your storage costs.

I must confess disappointment that there was no change in the “Next 4000 TB” rate but I suppose I can keep some of my email archives locally. 😉

Other cloud storage options/rates?

« Newer PostsOlder Posts »

Powered by WordPress