Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 4, 2012

Designing Search (part 5): Results pages

Filed under: Interface Research/Design,Search Interface,Searching — Patrick Durusau @ 4:43 pm

Designing Search (part 5): Results pages by Tony Russell-Rose.

From the post:

In the previous post, we looked at the ways in which a response to an information need can be articulated, focusing on the various forms that individual search results can take. Each separate result represents a match for our query, and as such, has the potential to fulfil our information needs. But as we saw earlier, information seeking is a dynamic, iterative activity, for which there is often no single right answer.

A more informed approach therefore is to consider search results not as competing alternatives, but as an aggregate response to an information need. In this context, the value lies not so much with the individual results but on the properties and possibilities that emerge when we consider them in their collective form. In this section we examine the most universal form of aggregation: the search results page.

As usual, Tony illustrates each of his principles with examples drawn from actual webpages. Makes a very nice checklist to use when constructing a results page. Concludes with references and links to all the prior posts in this series.

Unless you are a UI expert, defaulting to following Tony’s advice is not a bad plan. May not be anyway.

Mule School: Introducing Mule 3.3 in Studio

Filed under: Mule — Patrick Durusau @ 4:31 pm

Mule School: Introducing Mule 3.3 in Studio (Error Handling, Caching, Iterative Processing and Expressions) by Nial Darbey.

From the post:

Today I am going to introduce you to some powerful new features in Mule 3.3:

  • Improved Error Handling: New exception strategy patterns fully integrated in Studio. Included are Try/Catch, Rollback processing and Conditional Exception processing in Choice routers based on exception type, or just about any criteria!
  • Iterative Processing using Foreach Scope: Allows for iterative loop type processing while maintaining the original message context
  • Mule Expression Language: A new, unified expression language to improve consistency and ease of use when validating, filtering, routing, or transforming messages
  • Cache: Improve performance with “In memory” caching of messages, such as the results of service calls
  • Graphical Data Mapper: A new graphical tool to easily transform from one data format to another data format while also mapping specific fields within the message structure. Formats supported include XML, JSON, CSV, POJOs, collections of POJOs, and EXCEL

We will do so by walking you through the development of a Mule Application which exploits each of these new features. As we work our way through the development process we will describe what we want to achieve and explain how to achieve that in Mule Studio. For the sake of simplicity we will refer you to our online documentation when we touch on features already available in previous releases.

I need to install the most recent release (Mule 3.3) but just walking through the post I stumbled at:

Graphical Data Mapper

….

Here we need to transform our OrderItem instance into an OrderRequest as expected by Samsung’s webservice.

The image that follows shows a mapping from:

name: string -> name: string

quantity: integer -> quantity: integer

But I didn’t see any naming of the OrderItem or OrderRequest objects?

Not to mention the mapping seems pretty obvious.

A more useful case of documentation would be “who” created the mapping and when? Even more useful would be the ability to define what caused the mapping from one to the other. Not just the fact of mapping.

Like I said, I will have to install the latest version and walk through the examples, so expect further posts on this release.

Your comments and suggestions, as always, are welcome.

The Higgs Boson explained by PhD Comics

Filed under: Graphics,Visualization — Patrick Durusau @ 4:04 pm

The Higgs Boson explained by PhD Comics from Nathan Yau.

There are times when an explanation is so clever that it bears repetition, even if only marginally relevant. 😉

I wish I could develop explanations/promotions as clear as this one for topic maps.

You may enjoy other material from PhD Comics.

Batch Importer – Part 3 [Neo4j]

Filed under: Indexing,Neo4j — Patrick Durusau @ 3:29 pm

Batch Importer – Part 3 [Neo4j] by Max De Marzi.

From the post:

At the end of February, we took a look at Michael Hunger’s Batch Importer. It is a great tool to load millions of nodes and relationships into Neo4j quickly. The only thing it was missing was Indexing… I say was, because I just submitted a pull request to add this feature. Let’s go through how it was done so you get an idea of what the Neo4j Batch Import API looks like, and in the next blog post I’ll show you how to generate data to take advantage of it.

Another awesome post on Neo4j from Max De Marzi.

Definitely a series to follow.

In case you don’t have the links handy:

Batch Importer – Part 2

Batch Importer – Part 1

July 3, 2012

Awesome website for #rstats Mining Twitter using R

Filed under: Data Mining,Graphics,R,Tweets,Visualization — Patrick Durusau @ 7:33 pm

Awesome website for #rstats Mining Twitter using R by Ajay Ohri

From the post:

Just came across this very awesome website.

Did you know there were six kinds of wordclouds in R.

(giggles like a little boy)

https://sites.google.com/site/miningtwitter/questions/talking-about

No, I can honestly say I was unaware “…there were six kinds of wordclouds in R.” 😉

Still, it might be a useful think to know at some point in the future.

Groves: The Past Is Close Behind

Filed under: Uncategorized — Patrick Durusau @ 6:15 pm

I was innocently looking for something else when I encountered:

In HyTime ISO/IEC 10744:1997 “3. Definitions (3.35)”: graph representation of property values is ‘An abstract data structure consisting of a directed graph of nodes in which each node may be connected to other nodes by labeled arcs.’ (http://xml.coverpages.org/groves.html)

That sounds like a data structure that a property graph can represent quite easily.

Does it sound that way to you?

Lucene 4.0.0 alpha, at long last!

Filed under: Lucene,Solr — Patrick Durusau @ 5:33 pm

Lucene 4.0.0 alpha, at long last! by Mike McCandless.

Grabbing enough of the post to make you crazy until you read it in full (there’s lots more):

The 4.0.0 alpha release of Lucene and Solr is finally out!

This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:

  • All deprecated APIs as of 3.6.0 have been removed.
  • Pre-3.0 indices are no longer supported.
  • MIGRATE.txt describes how to update your application code.
  • The index format won’t change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta.

Please try the release and report back!

Pluggable Codec

The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.

There are some fun core codecs:

  • Lucene40 is the default codec.
  • Lucene3x (read-only) reads any index written with Lucene 3.x.
  • SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!).
  • MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.)
  • PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup.
  • AppendingCodec avoids seeking while writing, necessary for file-systems such as Hadoop DFS.

If you create your own Codec it’s easy to confirm all of Lucene/Solr’s tests pass with it. If tests fail then likely your Codec has a bug!

A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.

….

A good thing that tomorrow is a holiday in the U.S. 😉

Apache Flume Development Status Update

Filed under: Flume,Hadoop — Patrick Durusau @ 4:51 pm

Apache Flume Development Status Update by Hari Shreedharan.

From the post:

Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG), which is currently on the trunk branch, techniques and components that can be be used to route the data, configuration validation, and finally support for serializing events.

In the past several months, contributors have been busy adding several new sources, sinks and channels to Flume. Flume now supports Syslog as a source, where sources have been added to support Syslog over TCP and UDP.

Flume now has a high performance, persistent channel – the File Channel. This means if the agent fails for any reason before events committed by the source are not removed and the transaction committed by the sink, the events will reloaded from disk and can be taken when the agent starts up again. The events will only be removed from the channel when the transaction is committed by the sink. The File channel uses a Write Ahead Log to save events.

Among the other features that have been added to Flume is the ability to modify events “in flight.”

I would not construe “event” too narrowly.

Emails, tweets, arrivals, departures, temperatures, wind direction, speed, etc., can all be viewed as one or more “events.”

The merging and other implications of one or more event modifiers will be the subject of a future post.

FreeMind

Filed under: FreeMind,Graphs,Neo4j — Patrick Durusau @ 4:30 pm

FreeMind

From the webpage:

FreeMind is a premier free mind-mapping software written in Java. The recent development has hopefully turned it into high productivity tool. We are proud that the operation and navigation of FreeMind is faster than that of MindManager because of one-click "fold / unfold" and "follow link" operations.

So you want to write a completely new metaphysics? Why don’t you use FreeMind? You have a tool at hand that remarkably resembles the tray slips of Robert Pirsig, described in his sequel to Zen and the Art of Motorcycle Maintenance called Lila. Do you want to refactor your essays in a similar way you would refactor software? Or do you want to keep personal knowledge base, which is easy to manage? Why don’t you try FreeMind? Do you want to prioritize, know where you are, where you’ve been and where you are heading, as Stephen Covey would advise you? Have you tried FreeMind to keep track of all the things that are needed for that?

While looking at the export options I remembered (from Neo4j):

If you can sketch, you can use a graph database.

So, shouldn’t I be able to export a FreeMind map into a graph format for Neo4j?

An easy introduction to graph databases and FreeMind as well.

A win-win situation.

The Science Network of Medical Data Mining

Filed under: Biomedical,Data Mining — Patrick Durusau @ 4:14 pm

The Science Network of Medical Data Mining

From the description of Unit 1:

Bar-Ilan University & The Chaim Sheba Medical Center – The Biomedical Informatics Program – The Science Network of Medical Data Mining

Course 80-665 – Medical Data Mining Spring, 2012

Lecturer: Dr. Ronen Tal-Botzer

Lectures as of today:

  • Unit 01 – Introduction & Scientific Background
  • Unit 02 – From Data to Information to Knowledge
  • Unit 03 – From Knowledge to Wisdom to Decision
  • Unit 04 – The Electronic Medical Record
  • Unit 05 – Artificial Intelligence in Medicine – Part A
  • Unit 06 – Science Network A: System Requirement Description

An enthusiastic lecturer which counts for a lot!

The presentation of medical information as intertwined with data mining sounds like a sound approach to me. Assuming students are grounded in medical information (or some other field), adding data mining is an extension of the familiar.

Three Steps to Heaven: Semantic Publishing in a Real World Workflow

Filed under: Publishing,Semantics — Patrick Durusau @ 2:27 pm

Three Steps to Heaven: Semantic Publishing in a Real World Workflow by Phillip Lord, Simon Cockell, and Robert Stevens.

Abstract:

Semantic publishing offers the promise of computable papers, enriched visualisation and a realisation of the linked data ideal. In reality, however, the publication process contrives to prevent richer semantics while culminating in a `lumpen’ PDF. In this paper, we discuss a web-first approach to publication, and describe a three-tiered approach which integrates with the existing authoring tooling. Critically, although it adds limited semantics, it does provide value to all the participants in the process: the author, the reader and the machine.

With a touch of irony and gloom the authors write:

… There are signi cant barriers to the acceptance of semantic publishing as a standard mechanism for academic publishing. The web was invented around 1990 as a light-weight mechanism for publication of documents. It has subsequently had a massive impact on society in general. It has, however, barely touched most scientifi c publishing; while most journals have a website, the publication process still revolves around the generation of papers, moving from Microsoft Word or LATEX [5], through to a final PDF which looks, feels and is something designed to be printed onto paper4. Adding semantics into this environment is difficult or impossible; the content of the PDF has to be exposed and semantic content retrofi tted or, in all likelihood, a complex process of author and publisher interaction has to be devised and followed. If semantic data publishing and semantic publishing of academic narratives are to work together, then academic publishing needs to change.

4. This includes conferences dedicated to the web and the use of web technologies.

One could add “…includes papers about changing the publishing process” but I digress.

I don’t disagree that adding semantics to the current system has proved problematic.

I do disagree that changing the current system, which is deeply embedded in research, publishing and social practices is likely to succeed.

At least if success is defined as a general solution to adding semantics to scientific research and publishing in general. Such projects may be successful in creating new methods of publishing scientific research but that just expands the variety of methods we must account for.

That doesn’t have a “solution like” feel to me. You?

Mapping Research With WikiMaps

Filed under: Mapping,Maps,WikiMaps,Wikipedia — Patrick Durusau @ 5:12 am

Mapping Research With WikiMaps

From the post:

An international research team has developed a dynamic tool that allows you to see a map of what is “important” on Wikipedia and the connections between different entries. The tool, which is currently in the “alpha” phase of development, displays classic musicians, bands, people born in the 1980s, and selected celebrities, including Lady Gaga, Barack Obama, and Justin Bieber. A slider control, or play button, lets you move through time to see how a particular topic or group has evolved over the last 3 or 4 years. The desktop version allows you to select any article or topic.

Wikimaps builds on the fact that Wikipedia contains a vast amount of high-quality information, despite the very occasional spot of vandalism and the rare instances of deliberate disinformation or inadvertent misinformation. It also carries with each article meta data about the page’s authors and the detailed information about every single contribution, edit, update and change. This, Reto Kleeb, of the MIT Center for Collective Intelligence, and colleagues say, “…opens new opportunities to investigate the processes that lie behind the creation of the content as well as the relations between knowledge domains.” They suggest that because Wikipedia has such a great amount of underlying information in the metadata it is possible to create a dynamic picture of the evolution of a page, topic or collection of connections.

See the demo version: http://www.ickn.org/wikimaps/.

For some very cutting edge thinking, see: Intelligent Collaborative Knowledge Networks (MIT) which has a download link to “Condor,” a local version of the wikimaps software.

Wikimaps builds upon a premise similar to the original premise of the WWW. Links break, deal with it. Hypertext systems prior to the WWW had tremendous overhead to make sure links remained viable. So much overhead that none of them could scale. The WWW allowed links to break and to be easily created. That scales. (The failure of the Semantic Web can be traced to the requirement that links not fail. Just the opposite of what made the WWW workable.)

Wikimaps builds upon the premise that the “facts we have may be incomplete, incorrect, partial or even contradictory. All things that most semantic systems posit as verboten. An odd requirements since our information is always incomplete, incorrect (possibly), partial or even contradictory. We have set requirements for our information systems that we can’t meet working by hand. Not surprising that our systems fail and fail to scale.

How much information failure can you tolerate?

A question that should be asked of every information system at the design stage. If the answer is none, move onto a project with some chance of success.

I was surprised at the journal reference, not one I would usually scan. Recent origin, expensive, not in library collections I access.

Journal reference:

Reto Kleeb et al. Wikimaps: dynamic maps of knowledge. Int. J. Organisational Design and Engineering, 2012, 2, 204-224

Abstract:

We introduce Wikimaps, a tool to create a dynamic map of knowledge from Wikipedia contents. Wikimaps visualise the evolution of links over time between articles in different subject areas. This visualisation allows users to learn about the context a subject is embedded in, and offers them the opportunity to explore related topics that might not have been obvious. Watching a Wikimap movie permits users to observe the evolution of a topic over time. We also introduce two static variants of Wikimaps that focus on particular aspects of Wikipedia: latest news and people pages. ‘Who-works-with-whom-on-Wikipedia’ (W5) links between two articles are constructed if the same editor has worked on both articles. W5 links are an excellent way to create maps of the most recent news. PeopleMaps only include links between Wikipedia pages about ‘living people’. PeopleMaps in different-language Wikipedias illustrate the difference in emphasis on politics, entertainment, arts and sports in different cultures.

Just in case you are interested: International Journal of Organisational Design and Engineering, Editor in Chief: Prof. Rodrigo Magalhaes, ISSN online: 1758-9800, ISSN print: 1758-9797.

July 2, 2012

A big list of the things R can do

Filed under: R — Patrick Durusau @ 6:54 pm

A big list of the things R can do by David Smith.

From the post:

R is an incredibly comprehensive statistics package. Even if you just look at the standard R distribution (the base and recommended packages), R can do pretty much everything you need for data manipulation, visualization, and statistical analysis. And for everything else, there’s more than 5000 packages on CRAN and other repositories, and the big-data capabilities of Revolution R Enterprise. A

As a result, trying to make a list of everything R can do is a difficult task. But we’ve made an effort in this list of R Language Features, a new section on the Revolution Analytics website. It’s broken up into four main sections (analytics, graphics and visualization, R applications and extensions, and programming language features), each with their own subsections:

As much for my benefit as yours. If I don’t write it down, I will remember it but not where I saw it. 😉

Topological Data Analysis

Filed under: Topological Data Analysis — Patrick Durusau @ 6:46 pm

Topological Data Analysis by Larry Wasserman.

From the post:

Topological data analysis (TDA) is a relatively new area of research that spans many disciplines including topology (in particular, homology), statistics, machine learning and computation geometry.

The basic idea of TDA is to describe the “shape of the data” by finding clusters, holes, tunnels, etc. Cluster analysis is special case of TDA. I’m not an expert on TDA but I do find it fascinating. I’ll try to give a flavor of what this subject is about.

Just in case you want to get in on the ground floor of a new area of research.

Larry has citations to the literature in case you need to pick up beach reading.

Update on Apache Bigtop (incubating)

Filed under: Bigtop,Cloudera,Hadoop — Patrick Durusau @ 6:32 pm

Update on Apache Bigtop (incubating) by Charles Zedlewski.

If you are curious about Apache Bigtop or how Cloudera manages to distribute stable distributions of the Hadoop ecosystem, this is the post for you.

Just to whet your appetite:

From the post:

Ever since Cloudera decided to contribute the code and resources for what would later become Apache Bigtop (incubating), we’ve been answering a very basic question: what exactly is Bigtop and why should you or anyone in the Apache (or Hadoop) community care? The earliest and the most succinct answer (the one used for the Apache Incubator proposal) simply stated that “Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem”. That was a nice explanation of how Bigtop relates to the rest of the Apache Software Foundation’s (ASF) Hadoop ecosystem projects, yet it doesn’t really help you understand the aspirations of Bigtop.

Building and supporting CDH taught us a great deal about what was required to be able to repeatedly assemble a truly integrated, Apache Hadoop based data management system. The build, testing and packaging cost was considerable, and we regularly observed that different projects made different design choices that made ongoing integration difficult. We also realized that more and more mission critical workload was running on CDH and the customer demand for stability, predictability and compatibility was increasing.

Apache Bigtop was part of our answer two solve these two different problems. Initiate an Apache open source project that focused on creating the testing and integration infrastructure of an Apache-Hadoop based distribution. With it we hoped that:

  1. We could better collaborate within the extended Apache community to contribute to resolving test, integration & compatibility issues across projects
  2. We could create a kind of developer-focused distribution that would be able to release frequently, unencumbered by the enterprise expectations for long-term stability and compatibility.

See the post for details.

PS: The project is picking up speed and looking for developers/contributors.

Visualizing Streaming Text Data with Dynamic Maps

Filed under: Dynamic Mapping,Stream Analytics,Visualization — Patrick Durusau @ 6:17 pm

Visualizing Streaming Text Data with Dynamic Maps by Emden Gansner, Yifan Hu, and Stephen North.

Abstract:

The many endless rivers of text now available present a serious challenge in the task of gleaning, analyzing and discovering useful information. In this paper, we describe a methodology for visualizing text streams in real time. The approach automatically groups similar messages into “countries,” with keyword summaries, using semantic analysis, graph clustering and map generation techniques. It handles the need for visual stability across time by dynamic graph layout and Procrustes projection techniques, enhanced with a novel stable component packing algorithm. The result provides a continuous, succinct view of evolving topics of interest. It can be used in passive mode for overviews and situational awareness, or as an interactive data exploration tool. To make these ideas concrete, we describe their application to an online service called TwitterScope.

Or, see: TwitterScope, at http://bit.ly/HA6KIR.

Worth the visit to see the static pics in the paper in action.

Definitely a tool with a future in data exploration.

I know “Procrustes” from the classics so had to look up Procrustes transformation. Which was reported to mean:

A Procrustes transformation is a geometric transformation that involves only translation, rotation, uniform scaling, or a combination of these transformations. Hence, it may change the size, but not the shape of a geometric object.

Sounds like abuse of “Procrustes” because I would think having my limbs cut off would change my shape. 😉

Intrigued by the notion of not changing “…the shape of a geometric object.”

Could we say that adding identifications to a subject representative does not change the subject it identifies?

igraph and structured text exploration

Filed under: Graphics,igraph,R,Visualization — Patrick Durusau @ 5:59 pm

igraph and structured text exploration

From the post:

I am in the slow process of developing a package to bridge structured text formats (i.e. classroom transcripts) with the tons of great R packages that visualize and analyze quantitative data (If you care to play with a rough build of this package (qdap) see: https://github.com/trinker/qdap). One of the packages qdap will bridge to is igraph.

A while back I came across a blog post on igraph and word statistics (LINK). It inspired me to learn a little bit about graphing and the igraph package and provided a nice intro to learn. As I play with this terrific package I feel it is my duty to share my experiences with others who are just starting out with igraph as well. The following post is a script and the plots created with a word frequency matrix (similar to a term document matrix from the tm package) and igraph:

A very nice introduction to the use of igraph for exploring texts.

One Culture. Computationally Intensive Research in the Humanities and Social Sciences…

Filed under: Humanities,Social Sciences — Patrick Durusau @ 5:45 pm

One Culture. Computationally Intensive Research in the Humanities and Social Sciences, A Report on the Experiences of First Respondents to the Digging Into Data Challenge by Christa Williford and Charles Henry. Research Design by Amy Friedlander.

From the webpage:

This report culminates two years of work by CLIR staff involving extensive interviews and site visits with scholars engaged in international research collaborations involving computational analysis of large data corpora. These scholars were the first recipients of grants through the Digging into Data program, led by the NEH, who partnered with JISC in the UK, SSHRC in Canada, and the NSF to fund the first eight initiatives. The report introduces the eight projects and discusses the importance of these cases as models for the future of research in the academy. Additional information about the projects is provided in the individual case studies below (this additional material is not included in the print or PDF versions of the published report).

Main Report Online

or

PDF file.

Case Studies:

Humanists played an important role the development of digital computers. That role has diminished over time to the disadvantage of both humanists and computer scientists. Perhaps efforts such as this one will rekindle what was once a rich relationship.

Readersourcing—a manifesto

Filed under: Crowd Sourcing,Publishing,Reviews — Patrick Durusau @ 5:24 pm

Readersourcing—a manifesto by Stefano Mizzaro. (Mizzaro, S. (2012), Readersourcing—a manifesto. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22668)

Abstract:

This position paper analyzes the current situation in scholarly publishing and peer review practices and presents three theses: (a) we are going to run out of peer reviewers; (b) it is possible to replace referees with readers, an approach that I have named “Readersourcing”; and (c) it is possible to avoid potential weaknesses in the Readersourcing model by adopting an appropriate quality control mechanism. The readersourcing.org system is then presented as an independent, third-party, nonprofit, and academic/scientific endeavor aimed at quality rating of scholarly literature and scholars, and some possible criticisms are discussed.

Mizzaro touches a number of issues that have speculative answers in his call for “readersourcing” of research. There is a website in progress, www.readersourcing.org.

I am interested in the approach as an aspect of crowdsourcing the creation of topic maps.

FYI, his statement that:

Readersourcing is a solution to a problem, but it immediately raises another problem, for which we need a solution: how to distinguish good readers from bad readers. If 200 undergraduate students say that a paper is good, but five experts (by reputation) in the field say that it is not, then it seems obvious that the latter should be given more importance when calculating the paper’s quality.

Seems problematic to me. Particularly for graduate students. If professors at their school rate research high or low, that should be calculated into a rating for that particular reader.

If that seems pessimistic, read: Fish, Stanley, “Transmuting the Lump: Paradise Lost, 1942-1979,” in Doing What Comes Naturally. Fish, Stanley (ed.), Duke University Press, 1989), which treats changing “expert” opinions on the closing chapters of Paradise Lost. So far as I know, the text did not change between 1942 and 1979 but “expert” opinion certainly did.

I offer that as a caution that all of our judgements are a matter of social consensus that changes over time. On some issues more quickly than others. Our information systems should reflect the ebb and flow of that semantic renegotiation.

The strange case of eugenics:…

Filed under: Classification,Collocative Integrity,Dewey - DDC,Ontogeny — Patrick Durusau @ 4:01 pm

The strange case of eugenics: A subject’s ontogeny in a long-lived classification scheme and the question of collocative integrity by Joseph T. Tennis. (Tennis, J. T. (2012), The strange case of eugenics: A subject’s ontogeny in a long-lived classification scheme and the question of collocative integrity. J. Am. Soc. Inf. Sci., 63: 1350–1359. doi: 10.1002/asi.22686)

Abstract:

This article introduces the problem of collocative integrity present in long-lived classification schemes that undergo several changes. A case study of the subject “eugenics” in the Dewey Decimal Classification is presented to illustrate this phenomenon. Eugenics is strange because of the kinds of changes it undergoes. The article closes with a discussion of subject ontogeny as the name for this phenomenon and describes implications for information searching and browsing.

Tennis writes:

While many theorists have concerned themselves with how to design a scheme that can handle the addition of subjects, very little has been done to study how a subject changes after it is introduced to a scheme. Simply because we add civil engineering to a scheme of classification in 1920 does not signify that it means the same thing today. Almost 100 years have passed, and many things have changed in that subject. We may have subdivided this class in 1950, thereby separating the pre-1950 meaning from the post-1950 meaning and also affecting the collocative power of the class civil engineering. Other classes in the superclass of engineering might be considered too close, and are eliminated over time, affecting the way the classifier does her or his work (cf. Tennis, 2007; Tennis & Sutton, 2008). It is because of these concerns, coupled with the design requirement of collocation in classification, that we need to look at the life of a subject over time—the subject’s scheme history or ontogeny.

Deeply interesting work that has implications for topic map structures and the preservation of “collocative integrity” over time.

One suspects that preservation of “collocative integrity” is an ongoing process that requires more than simple assignments in a scheme.

What factors would you capture to trace the ontogeny of “euqenics” and how would you use them to preserve “collocative integrity” across that history using a topic map? (Remembering that users at any point in that ontogeny may be ignorant of prior (obviously subsequent) changes in its classification.)

Bisociative Knowledge Discovery

Filed under: Bisociative,Graphs,Knowledge Discovery,Marketing,Networks,Topic Maps — Patrick Durusau @ 8:43 am

Bisociative Knowledge Discovery: An Introduction to Concept, Algorithms, Tools, and Applications by Michael R. Berthold. (Lecture Notes in Computer Science, Volume 7250, 2012, DOI: 10.1007/978-3-642-31830-6)

The volume where Berthold’s Towards Bisociative Knowledge Discovery appears.

Follow the links for article abstracts and additional information. “PDFs” are available under Springer Open Access.

If you are familiar with Steve Newcomb’s universes of discourse, this will sound hauntingly familiar.

How will diverse methodologies of bisociative knowledge discovery, being in different universes of discourse, interchange information?

Topic maps anyone?

Towards Bisociative Knowledge Discovery

Filed under: Associations,Bisociative,Knowledge Discovery — Patrick Durusau @ 8:41 am

Towards Bisociative Knowledge Discovery by Michael R. Berthold.

Abstract:

Knowledge discovery generally focuses on finding patterns within a reasonably well connected domain of interest. In this article we outline a framework for the discovery of new connections between domains (so called bisociations), supporting the creative discovery process in a more powerful way. We motivate this approach, show the difference to classical data analysis and conclude by describing a number of different types of domain-crossing connections.

What is a bisociation you ask?

Informally, bisociation can be defined as (sets of) concepts that bridge two otherwise not –or only very sparsely– connected domains whereas an association bridges concepts within a given domain.Of course, not all bisociation candidates are equally interesting and in analogy to how Boden assesses the interestingness of a creative idea as being new, surprising, and valuable [4], a similar measure for interestingness can be specified when the underlying set of domains and their concepts are known.

Berthold describes two forms of bisociation as bridging concepts and graphs, although saying subject identity and associations would be more familiar to topic map users.

This essay introduces more than four hundred pages of papers so there is much more to explore.

These materials are “open access” so take the opportunity to learn more about this developing field.

As always, terminology/identification is going to vary so there will be a role for topic maps.

July 1, 2012

Neo4j 1.8.M05 – In the Details

Filed under: Cypher,Neo4j — Patrick Durusau @ 4:49 pm

Neo4j 1.8.M05 – In the Details

A new milestone of Neo4j that merits your attention!

Download a copy today!

Neo4j continues to struggle with documentation issues.

For example, the blog post that announces Neo4j 1.8.M05 reports under Cypher:

  • String literals can now contain some escape characters, like:
    • CREATE (n {text:”single \’ and double \” quotes”});

As a standards editor, I cringe when I see “…some escape characters, like:” What the hell does “…some escape characters…” mean?

I can’t begin to guess.

It gets even worse if you try to find an answer in the documentation.

Using the PDF version, searching for “escape” I found:

16.10.8. Relationship types with uncommon characters

Sometime your database will have types with non-letter characters, or with spaces in them. Use ` to escape these.

and,

16.11.4. Escaping in regular expressions

If you need a forward slash inside of your regular expression, escape it just like you expect to.

And “…just like you expect to.” would be?

The example at that point illustrates using “\” as an escape character, as in “/Some\/thing/.”

and,

19.3.12. Get typed relationships

Note that the “&” needs to be escaped for example when using cURL from the terminal.

Not only bad writing but annoying as well.

Don’t state a problem without the answer or linking to the answer if it is too long to insert in place. BTW, the answer is to write any “&” character as “%26” (without the quotes).

and,

19.7.5. Add node to index

Associates a node with the given key/value pair in the given index.

Note

Spaces in the URI have to be escaped.

I haven’t gone back to the RFC but I think the correct term here is “encoded” and the encoded character is “%20” (without the quotes).

19.7.9 Find Node By Exact Match has the same issue. (There is an Asciidoc escape issue reported in 30.3.5 but it is of no relevance to Cypher escape character issue.)

Having searched all the current documentation, I can’t say for sure what escape characters Cypher uses or under what circumstances.

Having a small group of insiders who know the real score is fine for smallish hacker projects. Not so good if you are aspiring to be a viable commercial product.

Web query disambiguation using PageRank

Filed under: Disambiguation,Polysemy,Query Expansion,Searching — Patrick Durusau @ 4:47 pm

Web query disambiguation using PageRank by Christos Makris, Yannis Plegas, and Sofia Stamou. (Makris, C., Plegas, Y. and Stamou, S. (2012), Web query disambiguation using PageRank. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22685)

Abstract:

In this article, we propose new word sense disambiguation strategies for resolving the senses of polysemous query terms issued to Web search engines, and we explore the application of those strategies when used in a query expansion framework. The novelty of our approach lies in the exploitation of the Web page PageRank values as indicators of the significance the different senses of a term carry when employed in search queries. We also aim at scalable query sense resolution techniques that can be applied without loss of efficiency to large data sets such as those on the Web. Our experimental findings validate that the proposed techniques perform more accurately than do the traditional disambiguation strategies and improve the quality of the search results, when involved in query expansion.

A better summary of the author’s approach lies within the article:

The intuition behind our method is that we could improve the Web users’ search experience if we could correlate the importance that the sense of a term has when employed in a query (i.e., the importance of the sense as perceived by the information seeker) with the importance the same sense has when contained in a Web page (i.e., the importance of the sense as perceived by the information provider). We rely on the exploitation of PageRank because of its effectiveness in capturing the importance of every page on the Web graph based on their links’ connectivity, and from which we may infer the importance of every page in the “collective mind” of the Web content providers/creators. To account for that, we explore whether the PageRank value of a page may serve as an indicator of how significant the dominant senses of a query-matching term in the page are and, based on that, disambiguate the query.

Which reminds me of statistical machine translation, which replaced syntax based methods years ago.

Perhaps pagerank is summing our linguistic preferences from some word senses.

If that is the case, how would you incorporate that in ranking results to be delivered to a user from a topic map? There are different possible search outcomes, how do we establish the one a user prefers?

The Case for Semantics-Based Methods in Reverse Engineering

Filed under: Reverse Engineering,Semantics — Patrick Durusau @ 4:47 pm

The Case for Semantics-Based Methods in Reverse Engineering by Rolf Rolles. (pdf – slides)

Jennifer Shockley quotes Rolf as saying:

“The goal of my RECON 2012 keynote speech was to introduce methods in academic program analysis and demonstrate — intuitively, without drawing too much on formalism — how they can be used to solve practical problems that are interesting to industrial researchers in the real world. Given that it was the keynote speech, and my goal of making the material as accessible as possible, I attempted to make my points with pictures instead of dense technical explanations.”

From his blog post: ‘RECON 2012 Keynote: The Case for Semantics-Based Methods in Reverse Engineering.’

Rolf also points to a reading list on program analysis.

Did someone say semantics? 😉

Anyone working on topic map based tools for reverse engineering?

Thinking that any improvement in sharing of results, even partial results, would improve response times.

HBase I/O – HFile

Filed under: Hadoop,HBase,HFile — Patrick Durusau @ 4:46 pm

HBase I/O – HFile by Matteo Bertozzi.

From the post:

Introduction

Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access.

Wait wait? random, realtime read/write access?

How is that possible? Is not Hadoop just a sequential read/write, batch processing system?

Yes, we’re talking about the same thing, and in the next few paragraphs, I’m going to explain to you how HBase achieves the random I/O, how it stores data and the evolution of the HBase’s HFile format.

Hadoop I/O file formats

Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. The only operation allowed is append, and if you want to lookup a specified key, you’ve to read through the file until you find your key.

As you can see, you’re forced to follow the sequential read/write pattern… but how is it possible to build a random, low-latency read/write access system like HBase on top of this?

To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index. This allows for quite a fast lookup, since instead of scanning all the records you scan the index which has less entries. Once you’ve found your block, you can then jump into the real data file.

A couple of important lessons:

First, file formats evolve. They shouldn’t be entombed by programming code, no matter how clever your code may be. That is what “versions” are for.

Second, the rapid evolution of the Hadoop ecosystem makes boundary observations strictly temporary. Wait a week or so, they will change!

Apache Oozie (incubating) 3.2.0 release

Filed under: Hadoop,Oozie — Patrick Durusau @ 4:46 pm

Apache Oozie (incubating) 3.2.0 release

From the post:

This blog was originally posted on the Apache Blog for Oozie.

In June 2012, we released Apache Oozie (incubating) 3.2.0. Oozie is currently undergoing incubation at The Apache Software Foundation (see http://incubator.apache.org/oozie).

Oozie is a workflow scheduler system for Apache Hadoop jobs. Oozie Workflows are Directed Acyclical Graphs (DAGs), and they can be scheduled to run at a given time frequency and when data becomes available in HDFS.

Oozie 3.1.3 was the first incubating release. Oozie 3.1.3 added Bundle job capabilities to Oozie. A bundle job is a collection of coordinator jobs that can be managed as a single application. This is a key feature for power users that need to run complex data-pipeline applications.

Oozie 3.2.0 is the second incubating release, and the first one to include features and fixes done in the context of the Apache Community. The Apache Oozie Community is growing organically with more users, more contributors, and new committers. Speaking as one of the initial developers of Oozie, it is exciting and fulfilling to see the Apache Oozie project gaining traction and mindshare.

While Oozie 3.2.0 is a minor upgrade, it adds significant new features and fixes that make the upgrade worthwhile. Here are the most important new features:

  • Support for Hadoop 2 (YARN Map-Reduce)
  • Built in support for new workflow actions: Hive, Sqoop, and Shell
  • Kerberos SPNEGO authentication for Oozie HTTP REST API and Web UI
  • Support for proxy-users in the Oozie HTTP REST API (equivalent to Hadoop proxy users)
  • Job ACLs support (equivalent to Hadoop job ACLs)
  • Tool to create and upgrade Oozie database schema (works with Derby, MySQL, Oracle, and PostgreSQL databases)
  • Improved Job information over HTTP REST API
  • New Expression Language functions for Workflow and Coordinator applications
  • Share library per action (including only the JARs required for the specific action)

Oozie 3.2.0 also includes several improvements for performance and stability, as well as bug fixes. And, as with previous Oozie releases, we are ensuring 100% backwards compatibility with applications written for previous versions of Oozie.

For those of you who know Michael Sperberg-McQueen, these are Directed Acyclical Graphs (DAGs) put to a useful purpose in an information environment. (Yes, that is an “insider” joke.)

Another important part of the Hadoop ecosystem.

Lessons from Anime and Big Data (Ghost in the Shell)

Filed under: Information Exchange,Information Workers,Marketing,Topic Maps — Patrick Durusau @ 4:45 pm

Lessons from Anime and Big Data (Ghost in the Shell) by James Locus.

From the post:

What lessons might the anime (Japanese animation) “Ghost in the Shell” teach us about the future of big data? The show, originally a graphic novel from creator Masamune Shirow, explores the consequences of a “hyper”-connected society so advanced one is able to download one’s consciousness temporarily into human-like android shells (hence the work’s title). If this sounds familiar, it’s because Ghost in the Shell was a major point of inspiration for the Wachowski brothers, the creators of the Matrix Trilogy.

The ability to handle, process, and manipulate big data is a major theme of the show and focuses on the challenges of a high tech police unit in thwarting potential cyber crimes. The graphic novel was originally created in 1991, long before the concept of big data had grown to prominence (and for-all-intents-and-purposes even before what we now think of as the internet…)

Visions of a “Big Data” Future

While such visions of an interconnected techno-future are common in anime, what makes Ghost in the Shell special is its treatment of the power of big data. Technology is not used simply for its exploitative value, but as a means to create a greater, more capable society. Data becomes the engine that drives an entire civilization towards achieving taller buildings, faster cars, and yes – even androids.

Big data puts many of Ghost in the Shell’s “technological advances” just within reach. The show features almost instantaneous transfers of petabyte hard drives and facial recognition searches about as fast as a Google search. (emphasis added)

A big +1! to technological advances being on the cusp of something transformative, but I am less certain about what that transformation will lead to. While cutting edge research is underway to help amputees, I fully expect the first commercially viable application to be safe, virtual sex (if they are not there already).

We are talking about us. We have a long history of using technology for its exploitative value. In fact, I can’t think of a single example of where technology has not been used for its exploitative value? Can you?

Although Snow Crash is a novel, there is a kernel (sorry) of truth to the proposition that the results of analysis will become items for exchange. That is true now but we lack the exchange mechanisms to make it currency.

People write books, articles, posts, but for the most part, all of those are at too large a level to be reused. We need information libraries that operate like software libraries, that are called for a particular operation. The creator of an information library gets a “credit” for your use of the information.

Not a reality today but overcoming semantic barriers to re-use is a start in that direction. Can settle the question of use of technology for its exploitative value or not, with its use in fact. I know where my money is riding. Yours?

…ain’t no time to be in my neighborhood….

Filed under: Humor — Patrick Durusau @ 4:45 pm

I was reminded of Cheech and Chong (Los Cochinos (1973)) when I read:

A mathematical model that has been used for more than 80 years to determine the hunting range of animals in the wild holds promise for mapping the territories of street gangs, a UCLA-led team of social scientists reports in a new study.

“The way gangs break up their neighborhoods into unique territories is a lot like the way lions or honey bees break up space,” said lead author P. Jeffrey Brantingham, a professor of anthropology at UCLA.

Further, the research demonstrates that the most dangerous place to be in a neighborhood packed with gangs is not deep within the territory of a specific gang, as one might suppose, but on the border between two rival gangs. In fact, the highest concentration of conflict occurs within less than two blocks of gang boundaries, the researchers discovered. (emphasis added)

Like the routine says: “…ain’t no time to be in my neighborhood….”

Almost forty (40) years later, the fundamental soundness of Los Cochinos is confirmed by other research. 😉

Cascading map-side joins over HBase for scalable join processing

Filed under: HBase,Joins,Linked Data,LOD,MapReduce,RDF,SPARQL — Patrick Durusau @ 4:45 pm

Cascading map-side joins over HBase for scalable join processing by Martin Przyjaciel-Zablocki, Alexander Schätzle, Thomas Hornung, Christopher Dorner, and Georg Lausen.

Abstract:

One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. In this paper, we introduce the Map-Side Index Nested Loop Join (MAPSIN join) which combines scalable indexing capabilities of NoSQL storage systems like HBase, that suffer from an insufficient distributed processing layer, with MapReduce, which in turn does not provide appropriate storage structures for efficient large-scale join processing. While retaining the flexibility of commonly used reduce-side joins, we leverage the effectiveness of map-side joins without any changes to the underlying framework. We demonstrate the significant benefits of MAPSIN joins for the processing of SPARQL basic graph patterns on large RDF datasets by an evaluation with the LUBM and SP2Bench benchmarks. For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

Some topic map applications include Linked Data/RDF processing capabilities.

The salient comment here being: “For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

« Newer PostsOlder Posts »

Powered by WordPress