Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 16, 2012

Managing context data for diverse operating spaces

Filed under: Context-aware,Semantic Overlay Network — Patrick Durusau @ 5:31 am

Managing context data for diverse operating spaces by Wenwei Xuea, Hung Keng Pungb, and Shubhabrata Senb.

Abstract:

Context-aware computing is an exciting paradigm in which applications perceive and react to changing environments in an unattended manner. To enable behavioral adaptation, a context-aware application must dynamically acquire context data from different operating spaces in the real world, such as homes, shops and persons. Motivated by the sheer number and diversity of operating spaces, we propose a scalable context data management system in this paper to facilitate data acquisition from these spaces. In our system, we design a gateway framework for all operating spaces and develop matching algorithms to integrate the local context schemas of operating spaces into a global set of domain schemas upon which SQL-based context queries can be issued from applications. The system organizes the operating space gateways as peers in semantic overlay networks and employs distributed query processing techniques over these overlays. Evaluation results on a prototype implementation demonstrate the effectiveness of our system design.

This article came up in a sweep for “semantic overlay networks.”

Encouraging recognition that results may need to vary based on physical context. Who knows? Perhaps recognition that the terminology for one domain and its journals/authors/monographs has different semantics than other domains.

Imagine that, a system that manages queries across semantic domains for users, as opposed to users having to understand all the possible semantic domains in advance to have useful query results (or better query results).

Perhaps the “context” metaphor may be a useful one in marketing topic maps. Less aggressive than “silo.” Let the client come up with that to characterize competing agencies or information sources.

“Context” in the sense of physical space is popular among the smart phone crowd so don’t neglect that as an avenue for topic maps as well. (Looking at your surroundings would mean breaking eye contact with your phone. Might miss an ad or something.)

May 15, 2012

No sorting and lack of structure undermine a chart

Filed under: Graphics,Visualization — Patrick Durusau @ 7:27 pm

No sorting and lack of structure undermine a chart

Kaiser Fung takes the Guardian newspaper, yes, that Guardian, to task for poor graphics on gay rights in the United States.

When people are critical of your graphics but take heart that even experts fail from time to time.

History matters

Filed under: Search Behavior,Search Engines,Search History — Patrick Durusau @ 7:17 pm

History matters by Gene Golovchinsky.

Whose history? Your history. Your search history. Visualized.

Interested? Read more:

Exploratory search is an uncertain endeavor. Quite often, people don’t know exactly how to express their information need, and that need may evolve over time as information is discovered and understood. This is not news.

When people search for information, they often run multiple queries to get at different aspects of the information need, to gain a better understanding of the collection, or to incorporate newly-found information into their searches. This too is not news.

The multiple queries that people run may well retrieve some of the same documents. In some cases, there may be little or no overlap between query results; at other times, the overlap may be considerable. Yet most search engines treat each query as an independent event, and leave it to the searcher to make sense of the results. This, to me, is an opportunity.

Design goal: Help people plan future actions by understanding the present in the context of the past.

While web search engines such as Bing make it easy for people to re-visit some recent queries, and early systems such as Dialog allowed Boolean queries to be constructed by combining results of previously-executed queries, these approaches do not help people make sense of the retrieval histories of specific documents with respect to a particular information need. There is nothing new under the sun, however: Mark Sanderson’s NRT system flagged documents as having been previously retrieved for a given search task, VOIR used retrieval histograms for each document, and of course a browser maintains a limited history of activity to indicate which links were followed.

Our recent work in Querium (see here and here) seeks to explore this space further by providing searchers with tools that reflect patterns of retrieval of specific documents within a search mission.

Even more interested? Read Gene’s post in full.

If not, check your pulse.

SIAM Data Mining 2012 Conference

Filed under: Conferences,Data Mining — Patrick Durusau @ 7:04 pm

SIAM Data Mining 2012 Conference

Ryan Rosario writes:

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.

The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to run there, wait in line, guide myself through crowds, wait in line, get my food, eat it, and run back to the conference in 90 minutes on a weekend. After lunch on the first two days was another plenary session followed by breakout sessions. The evening of the first two days was reserved for poster sessions. Saturday hosted half-day and full-day workshops.

Below is my summary of the conference. Of course, such a summary is very high level my description may miss things, or may not be entirely correct if I misunderstood the speaker.

I doubt Ryan would claim his summary is “as good as being there” but in the absence of attending, you could do far worse.

Suggestions of papers from the conference that I should read first?

Using “Punning” to Answer httpRange-14

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 6:50 pm

Using “Punning” to Answer httpRange-14

Jeni Tennison writes in her introduction:

As part of the TAG’s work on httpRange-14, Jonathan Rees has assessed how a variety of use cases could be met by various proposals put before the TAG. The results of the assessment are a matrix which shows that “punning” is the most promising method, unique in not failing on either ease of use (use case J) or HTTP consistency (use case M).

In normal use, “punning” is about making jokes based around a word that has two meanings. In this context, “punning” is about using the same URI to mean two (or more) different things. It’s most commonly used as a term of art in OWL but normal people don’t need to worry particularly about that use. Here I’ll explore what that might actually mean as an approach to the httpRange-14 issue.

Jeni writes quite well and if you are really interested in the details of this self-inflicted wound, read her post in its entirety.

The post is summarized when she says:

Thus an implication of this approach is that the people who define languages and vocabularies must specify what aspect of a resource a URI used in a particular way identifies.

Her proposal makes disambiguation explicit. A strategy that is more likely to be successful than others.

Following that statement she treats how to usefully proceed from that position. (No guarantee her position will carry the day but it would be a good thing if it does.)

Open Data Visualization: Keeping Traces of the Exploration Process

Filed under: Open Data,Visualization — Patrick Durusau @ 4:49 pm

Open Data Visualization: Keeping Traces of the Exploration Process by Benoît Otjacques, Mickaël Stefas, Maël Cornil, and Fernand Feltz.

Abstract:

This paper describes a system to support the visual exploration of Open Data. During his/her interactive experience with the graphics, the user can easily store the current complete state of the visualization application (called a viewpoint). Next, he/she can compose sequences of these viewpoints (called scenarios) that can easily be reloaded. This feature allows to keep traces of a former exploration process, which can be useful in single user (to support investigation carried out in multiple sessions) as well as in collaborative setting (to share points of interest identified in the data set).

I was unaware of this paper when I wrote my “knowledge toilet” post earlier today. This looks like an interesting starting point for discussion.

Just speculating but I think there will be a “sweet spot” for how much effort users will devote to recording their input. For some purposes it will need to be almost automatic. Like the relationship between search terms and links users choose. Crude but somewhat effective.

On the other hand, there will be professional researchers/authors who want to sell their semantic annotations/mappings of resources.

And applications/use cases in between.

Operations on soft sets revisited

Filed under: Sets,Soft Sets — Patrick Durusau @ 3:59 pm

Operations on soft sets revisited by Ping Zhu and Qiaoyan Wen.

Abstract:

Soft sets, as a mathematical tool for dealing with uncertainty, have recently gained considerable attention, including some successful applications in information processing, decision, demand analysis, and forecasting. To construct new soft sets from given soft sets, some operations on soft sets have been proposed. Unfortunately, such operations cannot keep all classical set-theoretic laws true for soft sets. In this paper, we redefine the intersection, complement, and difference of soft sets and investigate the algebraic properties of these operations along with a known union operation. We find that the new operation system on soft sets inherits all basic properties of operations on classical sets, which justifies our definitions.

An interesting paper will get you interested in soft sets if you aren’t already.

It isn’t easy going, even with the Alice and Bob examples, which I am sure the authors found immediately intuitive.

If you have data where numeric values cannot be assigned, it will be worth your while to explore this paper and the literature on soft sets.

Improving Schema Matching with Linked Data (Flushing the Knowledge Toilet)

Filed under: Linked Data,Schema — Patrick Durusau @ 3:40 pm

Improving Schema Matching with Linked Data by Ahmad Assaf, Eldad Louw, Aline Senart, Corentin Follenfant, Raphaël Troncy, and David Trastour.

Abstract:

With today’s public data sets containing billions of data items, more and more companies are looking to integrate external data with their traditional enterprise data to improve business intelligence analysis. These distributed data sources however exhibit heterogeneous data formats and terminologies and may contain noisy data. In this paper, we present a novel framework that enables business users to semi-automatically perform data integration on potentially noisy tabular data. This framework offers an extension to Google Refine with novel schema matching algorithms leveraging Freebase rich types. First experiments show that using Linked Data to map cell values with instances and column headers with types improves significantly the quality of the matching results and therefore should lead to more informed decisions.

Personally I don’t find mapping Airport -> Airport Code all that convincing a demonstration.

The other problem I have is what happens after a user “accepts” a mapping?

Now what?

I can contribute my expertise to mappings between diverse schemas all day, even public ones.

What happens to all that human effort?

It is what I call the “knowledge toilet” approach to information retrieval/integration.

Software runs (I can’t count the number of times integration software has been run on Citeseer. Can you?) and a user corrects the results as best they are able.

Now what?

Oh, yeah, the next user or group of users does it all over again.

Why?

Because the user before them flushed the knowledge toilet.

The information had been mapped. Possibly even hand corrected by one or more users. Then it is just tossed away.

That has to seem wrong at some very fundamental level. Whatever semantic technology you choose to use.

I’m open to suggestions.

How do we stop flushing the knowledge toilet?

Introducing Neo4j into a Relational Database Organisation

Filed under: Neo4j,RDBMS — Patrick Durusau @ 2:20 pm

Introducing Neo4j into a Relational Database Organisation

The details:

What: Neo4J User Group:Introducing Neo4j into a Relational Database Organisation
Where: The Skills Matter eXchange, London
When: 23 May 2012 Starts at 18:30

From the webpage:

This month, Toby O’Rourke and Michael McCarthy present their experiences of introducing Neo4j into Gamesys: a Relational Database Organisation.

You will hear about Toby and Michael’s experiences, including

  • the path taken from spring data through tinkerpop, to straight neo then spring data again
  • Satisfying the reporting requirements of a place built on a data warehouse approach
  • Modelling our domain
  • Experience of support contracts and the community as a whole

Just in case you need an additional reason to be in London on 23 May 2012, consult London Drum City Guide. 😉

Electronic Discovery Institute

Filed under: Law,Legal Informatics — Patrick Durusau @ 2:03 pm

Electronic Discovery Institute

From the home page:

The Electronic Discovery Institute is a non-profit organization dedicated to resolving electronic discovery challenges by conducting studies of litigation processes that incorporate modern technologies. The explosion in volume of electronically stored information and the complexity of its discovery overwhelms the litigation process and the justice system. Technology and efficient processes can ease the impact of electronic discovery.

The Institute operates under the guidance of an independent Board of Diplomats comprised of judges, lawyers and technical experts. The Institute’s studies will measure the relative merits of new discovery technologies and methods. The results of the Institute’s studies will be shared with the public free of charge. In order to obtain our free publications, you must create a free log-in with a legitimate user profile. We do not sell your information. Please visit our sponsors – as they provide altruistic support to our organization.

I encountered the Electronic Discovery Institute while researching information on electronic discovery. Since law was and still is an interest of mine, wanted to record it here.

The area of e-discovery is under rapid development, in terms rules that govern it, the technology that it employs and its practice in real world situations with consequences for the players.

Commend this site/organization to anyone interested in e-discovery issues.

Natural Language Processing – Nearly Universal Case?

Filed under: Natural Language Processing — Patrick Durusau @ 1:54 pm

I was reading a paper on natural language processing (NLP) when it occurred to me to ask:

When is parsing of any data not natural language processing?

I hear the phrase, “natural language processing,” applied to a corpus of emails, blog posts, web pages, electronic texts, transcripts of international phone calls and the like.

Other than following others out of habit, why do we say those are subject to “natural language processing?”

As opposed to say a database schema?

When we “process” the column headers in a database schema, aren’t we engaged in “natural language processing?” What about SGML/XML schemas or instances they govern?

Being mindful of semantics, synonymy and polysemy, it’s hard think of examples that are not “natural language processing.”

At least for data that would be meaningful if read by a person. Streams of numbers perhaps not, but the symbolism that defines their processing I would argue falls under natural language processing.

Thoughts?

May 14, 2012

Mining GitHub – Followers in Tinkerpop

Filed under: Github,GraphML,Neo4j,R,TinkerPop — Patrick Durusau @ 6:13 pm

Mining GitHub – Followers in Tinkerpop

Patrick Wagstrom writes:

Development of any moderately complex software package is a social process. Even if a project is developed entirely by a single person, there is still a social component that consists of all of the people who use the software, file bugs, and provide recommendations for enhancements. This social aspect is one of the driving forces behind the proliferation of social software development sites such as GitHub, SourceForge, Google Code, and BitBucket.

These sites combine together a variety of tools that are common for software development such as version control, bug trackers, mailing lists, release management, project planning, and wikis. In addition, some of these have more social aspects that allow you find and follow individual developers or watch particular projects. In this post I’m going to show you how we can use some this information to gain insight into a software development community, specifically the community around the Tinkerpop stack of tools for graph databases.

GitHub as a social community. Who knew? 😉

Very instructive walk through Gremlin, GraphML, and R with a prepared data set. It doesn’t get much better than this!

Finite State Automata in Lucene

Filed under: Finite State Automata,Lucene — Patrick Durusau @ 6:12 pm

Finite State Automata in Luceneby Mike McCandless

From the post:

Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene.

You can view the slides here.

This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to “save” your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.

Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!

I must confess to disappointment when I read at slide 23 that “multi-token synonyms mess up graph.”

Particularly since I suspect that not only do synonyms need to be “multi-token” but “multi-dimensional” as well.

Sorting and Filtering Results in Custom Search

Filed under: Google CSE,Searching — Patrick Durusau @ 5:51 pm

Sorting and Filtering Results in Custom Search

From the post:

Using Custom Search Engine (CSE), you can create rich search experiences that make it easier for visitors to find the information they’re looking for on your site. Today we’re announcing two improvements to sorting and filtering of search results in CSE.

First, CSE now supports UI-based results sorting, which you can enable in the Basics tab of the CSE control panel. Once you’ve updated the CSE element code on your site, a “sort by” picker will become visible at the top of the results section.

I am not sure I would call this a “rich search experience” but I suppose any improvement is better than none at all.

Curious how you evaluate the use of “product rich snippets” as being similar to Newcomb’s conferral of properties? (see the post for “product rich snippets”).

Or for that matter, how you would in an indexing context, “confer” additional information on an index entry that does not appear in the document?

To be used when the index is searched.

Comments?

CDG – Community Data Generator

Filed under: Ctools,Data — Patrick Durusau @ 5:50 pm

CDG – Community Data Generator

From the post:

CDG is a datawarehouse generator and the newest member of the Ctools family. Given the definition of dimensions that we want, CDG will randomize data within certain parameters and output 3 different things:

  • Database and table ddl for the fact table
  • A file with inserts for the fact table
  • Mondrian schema file to be used within pentaho

While most of the documentation mentions the usage within the scope of Pentaho there’s absolutely nothing that prevents the resulting database to be used in different contexts.

I had mentioned ctools before but not in any detail. This was the additional resource that made me pick them back up.

It isn’t hard to see how this data generator will be useful.

For subject-centric software, generating files with known “same subject” characteristics would be more useful.

Thoughts, suggestions or pointers to work on generation of such files?

C*Tools

Filed under: Ctools,Dashboard,Pentaho — Patrick Durusau @ 5:42 pm

C*Tools

From the webpage:

The CTools are a Webdetails Open Source project composed by a collection of Pentaho plugins. Its purpose is to streamline the implementation and design process, expanding even further the range of possibilities of Pentaho Dashboards. This page represents our effort to keep you up to date with the our latest developments. Have fun, dazzle your clients and build a “masterpiece of a Dashboard”.

Tools include:

CCC: Community Charting Components (CCC) is a charting library on top of Protovis, a very powerful free and open-source visualization toolkit.

CBF: Focused on a multi-project/ multi-environment scenario, the Community Build Framework (CBF) is the way to setup and deploy Pentaho based applications.

CDA: Community Data Access (CDA) is a Pentaho plugin designed for accessing data with great flexibility. Born for overcoming some cons of the older implementation, CDA allows you to access any of the various Pentaho data sources and:

  • join different datasources just by editing an XML file
  • cache queries providing a great boost in performance.
  • deliver data in different formats (csv, xls, etc.) through the Pentaho User

CDE: The Community Dashboard Editor (CDE) is the outcome of real-world needs: It was born to greatly simplify the creation, edition and rendering of dashboards.

CDF: Community Dashboard Framework (CDF) is a project that allows you to create friendly, powerful, fully featured dashboards on top of the Pentaho BI server. Former Pentaho dashboards had several drawbacks from a developer’s point of view. The developing process was awkward, it required know-how of web technologies and programming languages, and basically it was time-consuming. CDF emerged as a need for a framework that overcame all those difficulties. The final result is a powerful framework featuring the following:

  • It is based on Open Source technologies.
  • It separates logic (JavaScript) of the presentation (HTML, CSS)
  • It features a life cycle with components interacting with each other
  • It uses AJAX
  • It is extensible, which gives the users a high level of customization: . Advanced users can extend the library of components.
  • They also can insert their own snippets of JavaScript and jQuery code.

CST: Community Startup Tabs (CST) represents the easiest way to define and implement the Pentaho startup tabs depending on the user that logs into the PUC. Ranging from a single institutional page to a list of dashboards or reports among other contents, the tabs that each Pentaho user uses to open after loging into the PUC vary depending on the user preferences, or his/her role in the company. Then, why let Pentaho open always the same home page for everyone? The list of tabs to be opened automatically right after the login can be different depending on the user thanks to CST. Community Startup Tabs (CST) is a plugin with the following features:

  • it allows you to define diferent startup tabs for each user that logs into the PUC. .it is easy to configure.
  • it allows to define startup tabs based on user names or user roles.
  • for the definition of the startup tabs it allows you to specify user names or roles using regular expressions.

The trick to dashboards (as opposed to some, nameless, applications) is to deliver obviously useful options and information to users.

TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

Filed under: Data Mining,Data Source,Open Relevance Project,TREC — Patrick Durusau @ 12:47 pm

TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

From the post:

TREC Legal Track — part of the U.S. government’s Text Retrieval Conference — announced last week that the 2012 edition of its annual document review project for testing new systems is canceled, while prominent e-discovery software company Recommind confirmed that it’s been asked to leave the project for prematurely sharing results.

These difficulties highlight the need for:

  • open data sets and
  • protocols for reporting of results as they occur.

That requires a data set with relevance judgments and other work.

Have you thought about the: Open Relevance Project at the Apache Foundation?

Email archives from Apache projects, the backbone of the web as we know it, are ripe for your contributions.

Let me be the first to ask Recommind to join in building a public data set for everyone.

ETL 2.0 – Data Integration Comes of Age

Filed under: Data Integration,ETL — Patrick Durusau @ 12:18 pm

ETL 2.0 – Data Integration Comes of Age by Robin Bloor PhD & Rebecca Jozwiak.

Well…., sort of.

It is a “white paper” and all that implies but when you read:

Versatility of Transformations and Scalability

All ETL products provide some transformations but few are versatile. Useful transformations may involve translating data formats and coded values between the data sources and the target (if they are, or need to be, different). They may involve deriving calculated values, sorting data, aggregating data, or joining data. They may involve transposing data (from columns to rows) or transposing single columns into multiple columns. They may involve performing look-ups and substituting actual values with looked-up values accordingly, applying validations (and rejecting records that fail) and more. If the ETL tool cannot perform such transformations, they will have to be hand coded elsewhere – in the database or in an application.

It is extremely useful if transformations can draw data from multiple sources and data joins can be performed between such sources “in flight,” eliminating the need for costly and complex staging. Ideally, an ETL 2.0 product will be rich in transformation options since its role is to eliminate the need for direct coding all such data transformations.

you start to lose what little respect you had for industry “white papers.”

Not once in this white paper is the term “semantics” used. It is also innocent of using the term “documentation.”

Don’t you think an ETL 2.0 application should enable re-use of “useful transformations?”

Wouldn’t that be a good thing?

Instead of IT staff starting from zero with every transformation request?

Failure to capture the semantics of data leaves you at ETL 2.0, while everyone else is at ETL 3.0.

Where does your business sense tell you about that choice?

(ETL 3.0 – Documented, re-usable, semantics for data and data structures. Enables development of transformation modules for particular data sources.)

Web Developers Can Now Easily “Play” with RDFa

Filed under: RDF,RDFa,Semantic Web — Patrick Durusau @ 9:16 am

Web Developers Can Now Easily “Play” with RDFa by Eric Franzon.

From the post:

Yesterday, we announced RDFa.info, a new site devoted to helping developers add RDFa (Resource Description Framework-in-attributes) to HTML.

Building on that work, the team behind RDFa.info is announcing today the release of “PLAY,” a live RDFa editor and visualization tool. This release marks a significant step in providing tools for web developers that are easy to use, even for those unaccustomed to working with RDFa.

“Play” is an effort that serves several purposes. It is an authoring environment and markup debugger for RDFa that also serves as a teaching and education tool for Web Developers. As Alex Milowski, one of the core RDFa.info team, said, “It can be used for purposes of experimentation, documentation (e.g. crafting an example that produces certain triples), and testing. If you want to know what markup will produce what kind of properties (triples), this tool is going to be great for understanding how you should be structuring your own data.”

A useful site for learning RDFa that is open for contributions, such as examples and documentation.

Cloudera Manager 4.0 Beta released

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 8:49 am

Cloudera Manager 4.0 Beta released by Aparna Ramani

From the post:

We’re happy to announce the Beta release of Cloudera Manager 4.0.

This version of Cloudera Manager includes support for CDH4 Beta2 and several new features for both the Free edition and the Enterprise edition.

This is the last beta before the GA release.

The details are:

I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.

CDH4 has a great many enhancements compared to CDH3.

  • Availability – a high availability namenode, better job isolation, improved hard disk failure handling, and multi-version support
  • Utilization – multiple namespaces and a slot-less resource management model
  • Performance – improvements in HBase, HDFS, MapReduce, Flume and compression performance
  • Usability – broader BI support, expanded API options, a more responsive Hue with broader browser support
  • Extensibility – HBase co-processors enable developers to create new kinds of real-time big data applications, the new MapReduce resource management model enables developers to run new data processing paradigms on the same cluster resources and storage
  • Security – HBase table & column level security and Zookeeper authentication support

Some items of note about this beta:

This is the second (and final) beta for CDH4, and this version has all of the major component changes that we’ve planned to incorporate before the platform goes GA. The second beta:

  • Incorporates the Apache Flume, Hue, Apache Oozie and Apache Whirr components that did not make the first beta
  • Broadens the platform support back out to our normal release matrix of Red Hat, CentOS, SUSE, Ubuntu and Debian
  • Standardizes our release matrix of supported databases to include MySQL, PostgresSQL and Oracle
  • Includes a number of improvements to existing components like adding auto-failover support to HDFS’s high availability feature and adding multi-homing support to HDFS and MapReduce
  • Incorporates a number of fixes that were identified during the first beta period like removing a HBase performance regression

Not as romantic as your subject analysis activities but someone has to manage the systems that implement your analysis!

Not to mention skills here making you more attractive in any big data context.

Lucene conference touches many areas of growth in search

Filed under: BigData,Lucene,LucidWorks,Solr — Patrick Durusau @ 8:35 am

Lucene conference touches many areas of growth in search by Andy Oram.

From the post:

With a modern search engine and smart planning, web sites can provide visitors with a better search experience than Google. For instance, Google may well turn up interesting results if you search for a certain kind of shirt, but a well-designed clothing site can also pull up related trousers, skirts, and accessories. It’s not Google’s job to understand the intricate interrelationships of data on a particular web property, but the site’s own team can constantly tune searches to reflect what the site has to offer and what its visitors uniquely need.

Hence the important of search engines like Solr, based on the Lucene library. Both are open source Apache projects, maintained by Lucid Imagination, a company founded to commercialize the underlying technology. I attended parts of Lucid Imagination’s conference this week, Lucene Revolution, and found Lucene evolving in the ways much of the computer industry is headed.

Andy’s summary of the conference will make you wonder two things:

  1. Why weren’t you at the Lucene Revolution conference this year?
  2. Where are the videos from Lucene Revolution 2012?

I won’t ever be able to answer #1 but will post an answer to #2 as soon as it is available.

Feynman on Curiosity

Filed under: Curiosity — Patrick Durusau @ 4:13 am

Feynman on Curiosity

Ethan Fosse embeds a short video of Feynman on curiosity.

I created a category of “curiosity” today, belatedly.

Curiosity is largely responsible for the variety of materials and resources on this blog.

I try to point out what may be helpful in your current or next project.

But I am also curious about what lies just beyond the technique or data I have just discussed.

Enjoy!

May 13, 2012

Synonyms in the TMDM Legend

Filed under: Synonymy,TMDM — Patrick Durusau @ 10:10 pm

I was going over some notes on synonyms this weekend when it occurred to me to ask:

How many synonyms does a topic item have in the TMDM legend?

A synonym being when one term can be freely substituted for another.

Not wanting to trust my memory, I quote from the TMDM legend (ISO/IEC 13250-2):

Two topic items are equal if they have:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

The wording is a bit awkward for my point about synonyms but I take it that if two topic had

at least one equal string in their [subject identifiers] properties,

I could substitute:

at least one equal string in their [item identifiers] properties, (in all relevant places)

and have the same effect.

I am going to be exploring the use of synonym based processing for TMDM governed topic maps.

Any thoughts or insights would be greatly appreciated.

Tika – A content analysis toolkit

Filed under: Content Analysis,Tika — Patrick Durusau @ 9:59 pm

Tika – A content analysis toolkit

From the webpage:

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.

From the supported formats page:

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

One suspects that even the vastness of “dark data” has a finite number of formats.

Tika may not cover all of them, but perhaps enough to get you started.

Are visual dictionaries generalizable?

Filed under: Classification,Dictionary,Image Recognition,Information Retrieval — Patrick Durusau @ 7:54 pm

Are visual dictionaries generalizable? by Otavio A. B. Penatti, Eduardo Valle, and Ricardo da S. Torres

Abstract:

Mid-level features based on visual dictionaries are today a cornerstone of systems for classification and retrieval of images. Those state-of-the-art representations depend crucially on the choice of a codebook (visual dictionary), which is usually derived from the dataset. In general-purpose, dynamic image collections (e.g., the Web), one cannot have the entire collection in order to extract a representative dictionary. However, based on the hypothesis that the dictionary reflects only the diversity of low-level appearances and does not capture semantics, we argue that a dictionary based on a small subset of the data, or even on an entirely different dataset, is able to produce a good representation, provided that the chosen images span a diverse enough portion of the low-level feature space. Our experiments confirm that hypothesis, opening the opportunity to greatly alleviate the burden in generating the codebook, and confirming the feasibility of employing visual dictionaries in large-scale dynamic environments.

The authors use the Caltech-101 image set because of its “diversity.” Odd because they cite the Caltech-256 image set, which was created to answer concerns about the lack of diversity in the Caltech-101 image set.

Not sure this paper answers the issues it raises about visual dictionaries.

Wanted to bring it to your attention because representative dictionaries (as opposed to comprehensive ones) may be lurking just beyond the semantic horizon.

Zero Tolerance Search : 24 year old neuroscientist

Filed under: Search Engines,Searching — Patrick Durusau @ 6:48 pm

Zero Tolerance Search : 24 year old neuroscientist

Matthew Hurst writes:

[The idea behind ‘zero tolerance search’ posts is to illustrate real life search interactions that show how far we have to go in leveraging the explicit and implicit data in the web and elsewhere.]

Yesterday, I heard part of an interview on NPR. The interview was around a new book on determinism and neuroscience. The only thing I remember about the author was his young age. I wanted to recover the name of the author and the title of his new book so that I could comment on his argument against determinism (which was, essentially, ‘I’m afraid of determinism therefore it can’t be right’).

Matthew continues to outline how the text matching of major search engines fail.

How would you improve the results?

Dark Data

Filed under: BigData,Lucene,LucidWorks,Solr — Patrick Durusau @ 6:37 pm

Lucid Imagination Combines Search, Analytics and Big Data to Tackle the Problem of Dark Data

This post was too well written to break up as quotes/excerpts. I am re-posting it in full.

Organizations today have little to no idea how much lost opportunity is hidden in the vast amounts of data they’ve collected and stored.  They have entered the age of total data overload driven by the sheer amount of unstructured information, also called “dark” data, which is contained in their stored audio files, text messages, e-mail repositories, log files, transaction applications, and various other content stores.  And this dark data is continuing to grow, far outpacing the ability of the organization to track, manage and make sense of it.

Lucid Imagination, a developer of search, discovery and analytics software based on Apache Lucene and Apache Solr technology, today unveiled LucidWorks Big Data. LucidWorks Big Data is the industry’s first fully integrated development stack that combines the power of multiple open source projects including Hadoop, Mahout, R and Lucene/Solr to provide search, machine learning, recommendation engines and analytics for structured and unstructured content in one complete solution available in the cloud.

Tweet This: Lucid Imagination combines #search, analytics and #BigData in complete stack. Beta now open http://ow.ly/aMHef

With LucidWorks Big Data, Lucid Imagination equips technologists and business users with the ability to initially pilot Big Data projects utilizing technologies such as Apache Lucene/Solr, Mahout and Hadoop, in a cloud sandbox. Once satisfied, the project can remain in the cloud, be moved on premise or executed within a hybrid configuration.  This means they can avoid the staggering overhead costs and long lead times associated with infrastructure and application development lifecycles prior to placing their Big Data solution into production.

The product is now available in beta. To sign up for inclusion in the beta program, visit http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data.

Dark Data Problem Is Real

How big is the problem of dark data? The total amount of digital data in the world will reach 2.7 zettabytes in 2012, a 48 percent increase from 2011.* 90 percent of this data will be unstructured or “dark” data. Worldwide, 7.5 quintillion bytes of data, enough to fill over 100,000 Libraries of Congress get generated every day. Conversely, that deep volume of data can serve to help predict the weather, uncover consumer buying patterns or even ease traffic problems – if discovered and analyzed proactively.

“We see a strong opportunity for search to play a key role in the future of data management and analytics,” said Matthew Aslett, research manager, data management and analytics, 451 Research. “Lucid’s Big Data offering, and its combination of large-scale data storage in Hadoop with Lucene/Solr-based indexing and machine-learning capabilities, provides a platform for developing new applications to tackle emerging data management challenges.”

LucidWorks Big Data

Data analytics has traditionally been the domain of business intelligence technologies. Most of these tools, however, have been designed to handle structured data such as SQL, and cannot easily tap into the broad range of data types that can be used in a Big Data application. With the announcement of LucidWorks Big Data, organizations will be able to utilize a single platform for their Big Data search, discovery and analytics needs. LucidWorks Big Data is the only complete platform that:

  • Combines the real time, ad hoc data accessibility of LucidWorks (Lucene/Solr) with compute and storage capabilities of Hadoop
  • Delivers commonly used analytic capabilities along with Mahout’s proven, scalable machine learning algorithms for deeper insight into both content and users
  • Tackles data, both big and small with ease, seamlessly scaling while minimizing the impact of provisioning Hadoop, LucidWorks and other components
  • Supplies a single, coherent, secure and well documented REST API for both application integration and administration
  • Offers fault tolerance with data safety baked in
  • Provides choice and flexibility, via on premise, cloud hosted or hybrid deployment solutions
  • Is tested, integrated and fully supported by the world’s leading experts in open source search.
  • Includes powerful tools for configuration, deployment, content acquisition, security, and search experience that is packaged in a convenient, well-organized application

Lucid Imagination’s Open Search Platform uncovers real-time insights from any enterprise data, whether structured in databases, unstructured in formats such as emails or social channels, or semi-structured from sources such as websites.  The company’s rich portfolio of enterprise-grade solutions is based on the same proven open source Apache Lucene/Solr technology that powers many of the world’s largest e-commerce sites. Lucid Imagination’s on-premise and cloud platforms are quicker to deploy, cost less than competing products and are more easily tailored to specific needs than business intelligence solutions because they leverage innovation from the open source community.  

“We’re allowing a broad set of enterprises to test and implement data discovery and analysis projects that have historically been the province of large multinationals with large data centers. Cloud computing and LucidWorks Big Data finally level the field,” said Paul Doscher, CEO of Lucid Imagination. “Large companies, meanwhile, can use our Big Data stack to reduce the time and cost associated with evaluating and ultimately implementing big data search, discovery and analysis. It’s their data – now they can actually benefit from it.”

Multilingual Natural Language Processing Applications: From Theory to Practice

Filed under: Multilingual,Natural Language Processing — Patrick Durusau @ 12:35 pm

Multilingual Natural Language Processing Applications: From Theory to Practice by Daniel Bikel and Imed Zitouni.

From the description:

Multilingual Natural Language Processing Applications is the first comprehensive single-source guide to building robust and accurate multilingual NLP systems. Edited by two leading experts, it integrates cutting-edge advances with practical solutions drawn from extensive field experience.

Part I introduces the core concepts and theoretical foundations of modern multilingual natural language processing, presenting today’s best practices for understanding word and document structure, analyzing syntax, modeling language, recognizing entailment, and detecting redundancy.

Part II thoroughly addresses the practical considerations associated with building real-world applications, including information extraction, machine translation, information retrieval/search, summarization, question answering, distillation, processing pipelines, and more.

This book contains important new contributions from leading researchers at IBM, Google, Microsoft, Thomson Reuters, BBN, CMU, University of Edinburgh, University of Washington, University of North Texas, and others.

Coverage includes

Core NLP problems, and today’s best algorithms for attacking them

  • Processing the diverse morphologies present in the world’s languages
  • Uncovering syntactical structure, parsing semantics, using semantic role labeling, and scoring grammaticality
  • Recognizing inferences, subjectivity, and opinion polarity
  • Managing key algorithmic and design tradeoffs in real-world applications
  • Extracting information via mention detection, coreference resolution, and events
  • Building large-scale systems for machine translation, information retrieval, and summarization
  • Answering complex questions through distillation and other advanced techniques
  • Creating dialog systems that leverage advances in speech recognition, synthesis, and dialog management
  • Constructing common infrastructure for multiple multilingual text processing applications

This book will be invaluable for all engineers, software developers, researchers, and graduate students who want to process large quantities of text in multiple languages, in any environment: government, corporate, or academic.

I could not bring myself to buy it for Carol (Mother’s Day) so I will have to wait for Father’s Day (June). 😉

If you get it before then, comments welcome!

May 12, 2012

TREC 2012 Crowdsourcing Track

Filed under: Crowd Sourcing,TREC — Patrick Durusau @ 6:22 pm

TREC 2012 Crowdsourcing Track

Panos Ipeirotis writes:

TREC 2012 Crowdsourcing Track – Call for Participation
 June 2012 – November 2012

https://sites.google.com/site/treccrowd/

Goals

As part of the National Institute of Standards and Technology (NIST)‘s annual Text REtrieval Conference (TREC), the Crowdsourcing track investigates emerging crowd-based methods for search evaluation and/or developing hybrid automation and crowd search systems.

This year, our goal is to evaluate approaches to crowdsourcing high quality relevance judgments for two different types of media:

  1. textual documents
  2. images

For each of the two tasks, participants will be expected to crowdsource relevance labels for approximately 20k topic-document pairs (i.e., 40k labels when taking part in both tasks). In the first task, the documents will be from an English news text corpora, while in the second task the documents will be images from Flickr and from a European news agency.

Participants may use any crowdsourcing methods and platforms, including home-grown systems. Submissions will be evaluated against a gold standard set of labels and against consensus labels over all participating teams.

Tentative Schedule

  • Jun 1: Document corpora, training topics (for image task) and task guidelines available
  • Jul 1: Training labels for the image task
  • Aug 1: Test data released
  • Sep 15: Submissions due
  • Oct 1: Preliminary results released
  • Oct 15: Conference notebook papers due
  • Nov 6-9: TREC 2012 conference at NIST, Gaithersburg, MD, USA
  • Nov 15: Final results released
  • Jan 15, 2013: Final papers due

As you know, I am interested in crowd sourcing of paths through data and assignment of semantics.

Although I am puzzled why we continue to put emphasis on post-creation assignment of semantics?

After data is created, we look around surprised the data has no explicit semantics.

Like realizing you are on Main Street without your pants.

Why don’t we look to the data creation process to assign explicit semantics?

Thoughts?

Initial HTTP Speed+Mobility Open Source Prototype Now Available for Download

Filed under: HTTP Speed+Mobility,Interface Research/Design — Patrick Durusau @ 4:35 pm

Initial HTTP Speed+Mobility Open Source Prototype Now Available for Download

From the post:

Microsoft Open Technologies, Inc. has just published an initial open source prototype implementation of HTTP Speed+Mobility. The prototype is available for download on html5labs.com, where you will also find pointers to the source code.

The IETF HTTPbis workgroup met in Paris at the end of March to discuss how to approach HTTP 2.0 in order to meet the needs of an ever larger and more diverse web. It would be hard to downplay the importance of this work: it will impact how billions of devices communicate over the internet for years to come, from low-powered sensors, to mobile phones, to tablets, to PCs, to network switches, to the largest datacenters on the planet.

Prior to that IETF meeting, Jean Paoli and Sandeep Singhal announced in their post to the Microsoft Interoperability blog that Microsoft has contributed the HTTP Speed+Mobility proposal as input to that conversation.

The prototype implements the websocket-based session layer described in the proposal, as well as parts of the multiplexing logic incorporated from Google’s SPDY proposal. The code does not support header compression yet, but it will in upcoming refreshes.

The open source software comprises a client implemented in C# and a server implemented in Node.js running on Windows Azure. The client is a command line tool that establishes a connection to the server and can download a set of web pages that include html files, scripts, and images. We have made available on the server some static versions of popular web pages like http://www.microsoft.com and http://www.ietf.org, as well as a handful of simpler test pages.

I have avoided having a cell phone much less a smart phone all these years.

Now it looks like to evaluate/test semantic applications, including topic maps, I am going to have to get one.

Thanks Jean and Sandeep! 😉

« Newer PostsOlder Posts »

Powered by WordPress