Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 21, 2012

Relational Data to RDF [Bridge to No Where?]

Filed under: R2ML,RDF,SPARQL — Patrick Durusau @ 4:13 pm

Transforming Relational Data to RDF – R2RML Becomes Official W3C Recommendation by Eric Franzon.

From the post:

Today, the World Wide Web Consortium announced that R2RML has achieved Recommendation status. As stated on the W3C website, R2RML is “a language for expressing customized mappings from relational databases to RDF datasets. Such mappings provide the ability to view existing relational data in the RDF data model, expressed in a structure and target vocabulary of the mapping author’s choice.” In the life cycle of W3C standards creation, today’s announcement means that the specifications have gone through extensive community review and revision and that R2RML is now considered stable enough for wide-spread distribution in commodity software.

Richard Cyganiak, one of the Recommendation’s editors, explained why R2RML is so important. “In the early days of the Semantic Web effort, we’ve tried to convert the whole world to RDF and OWL. This clearly hasn’t worked. Most data lives in entrenched non-RDF systems, and that’s not likely to change.”

“That’s why technologies that map existing data formats to RDF are so important,” he continued. “R2RML builds a bridge between the vast amounts of existing data that lives in SQL databases and the SPARQL world. Having a standard for this makes SPARQL even more useful than it already is, because it can more easily access lots of valuable existing data. It also means that database-to-RDF middleware implementations can be more easily compared, which will create pressure on both open-source and commercial vendors, and will increase the level of play in the entire field.” (emphasis added)

If most data resides in non-RDF systems, what do I gain by converting it into RDF for querying with SPARQL?

Some possible costs:

  • Planning the conversion from non-RDF to RDF system
  • Debugging the conversion (unless it is trivial, the few conversions won’t be right)
  • Developing the SPARQL queries
  • Debugging the SPARQL queries
  • Updating the conversion if new data is added to the source
  • Testing the SPARQL query against updated data
  • Maintenance of the source and target RDF systems (unless pushing SPARQL is a way to urge conversion from relational system)

Or to put it another way, if most data is still on non-RDF data stores, why do I need a bridge to SPARQL world?

Of is this a Sarah Palin bridge to no where?

DEITY Launches Indian Search

Filed under: Language,Topic Maps,Use Cases — Patrick Durusau @ 3:42 pm

DEITY Launches Indian Search by Angela Guess.

From the post:

Tech2 reports, “The Department of Electronics and Information Technology (DEITY) unveiled Internet search engine, Sandhan, yesterday to assist users searching for tourism-related information across websites. Sandhan will provide search results to user queries in five Indian languages – Bengali, Hindi, Marathi, Tamil and Telugu.

[Which further quotes Tech2:] With this service, the government aims to plug the wide gap that exists ’in fulfilling the information needs of Indians not conversant with English- estimated at 90 percent of the population’.

Let’s see: 1,220,200,000 (Wikipedia, Demographics of India, 2012 estimated population) x 90% (not conversant with English) = Potential consumer population of 1,098,180,000.

In case you are curious:

1,344,130,000 (Demographics of China, 2012 estimated population) is reported to have two hundred and ninety-two (292) living languages.

503,500,000 (Demographics of EU, 2012 estimated population) has twenty-three (23) official languages.

Wikiipedia has two hundred and eighty-five (285) different language editions.

No shortage of need, question is who has enough to gain to pay the cost of mapping?

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover

Filed under: Hadoop,HBase,Systems Administration — Patrick Durusau @ 2:22 pm

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover by Devaraj Das.

From the post:

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down. Note that, HBase would be functional during the time NameNode would be down. It’d only affect those operations that requires a trip to the NameNode (for example, rolling of the WAL, or compaction, or flush), and those would affect only the relevant end users (a user using the HBase get API may not be affected if that get didn’t require a new file open, for example).

A non-reliable cluster is just that, a non-reliable cluster. Not as bad as a backup that may or may not restore your data, but almost.

Regularly and routinely test any alleged HA capability along with backup restore capability. Document that testing.

As opposed to “testing” when either has to work or critical operations will fail or critical data will be lost.*

*Not Miller but résumé time.

Just Joking: An Irreverent Look At Tech News

Filed under: Humor — Patrick Durusau @ 1:46 pm

Just Joking: An Irreverent Look At Tech News by Fritz Nelson.

From the post:

Sometimes we take ourselves too seriously in this profession … after all, it’s technology. It’s not at all funny. Or is it? Maybe not, but the people and the companies are pretty funny, or at least deserve to be made fun of. Every year at the InformationWeek 500, we kick off the awards ceremony with a fun look back at the year in technology news.

Also, each month, we do the same to kick off our Valley View live Web TV program (the next show is October 24 at 11 a.m. PT). The two video clips below are from the InformationWeek 500 event, and from the September 26 Valley View.

Rock groups “arrive” by appearing on the cover of the Rolling Stone.

Topic maps will “arrive” by appearing in the comedy portion of InformationWeek 500.

Start refining jokes today!

Collaborative Systems: Easy To Miss The Mark

Filed under: Collaboration,Project Management,Requirements,Use Cases,Users — Patrick Durusau @ 10:17 am

Collaborative Systems: Easy To Miss The Mark by Jocob Morgan.

From the post:

Map out use cases defining who you want collaborating and what results you want them to achieve. Skip this step in the beginning, and you’ll regret it in the end.

One of the things that organizations really need to consider when evaluating collaborative solutions is their use cases. Not only that, but also understanding the outcomes of those use cases and how they can map to a desired feature requirement. Use cases really help put things into perspective for companies who are seeking to understand the “why” before they figure out the “how.”

That’s what a use case is: the distilled essence of a role within your organization, how it will interact with some system, and the expected or desired result. Developing use cases makes your plans, requirements, and specifications less abstract because it forces you to come up with specific examples.

This is why we created a framework (inspired by Gil Yehuda) to address this. It breaks down as follows:

  • — Identify the overall business problem you are looking to solve (typically there are several).
  • — Narrow down the problem into specific use cases; each problem has several use cases.
  • — Describe the situation that needs to be present for that use case to be applicable.
  • — Clarify the desired action.
  • — State the desired result.

For topic maps I would write:

Map out use cases defining what data you want to identify and/or integrate and what results you expect from that identification or integration. Skip this step in the beginning, and you’ll regret it in the end.

If you don’t have an expectation of a measurable result (in businesses a profitable one), your efforts at semantic integration are premature.

How will you know when you have reached the end of a particular effort?

The personal cloud series

Filed under: Cloud Computing,Users,WWW — Patrick Durusau @ 9:52 am

The personal cloud series by Jon Udell.

Excellent source of ideas on the web/cloud as we experience it today and as we may experience it tomorrow.

Going through prior posts now and will call some of them out for further discussion.

Which ones impress you the most?

The Last Semantic Mile

Filed under: BigData,Marketing,Topic Maps — Patrick Durusau @ 9:31 am

Unless we conquer the last semantic mile, big data will be an expensive lesson in hardware, software and missed opportunities.

Consider Sid Probstein’s take on Simplify Big Data – Or It’ll Be Useless for Sales (Lareina Yee and Jigar Patel of McKinsey & Company):

I am tempted to refer to this as using a venerable telecom term: “the last mile”. Investing in analytics to help optimize business is a good thing, but without also considering how the information will be integrated, correlated and accessed, the benefit will be severely limited. Here are some concrete steps that companies can take to improve in this regard:

  • Take inventory. Even the mere attempt at cataloging will help identify opportunities for massive productivity increases. In the world of sales, those translate directly to revenue. Cost savings can also be uncovered here. This is the role Attivio plays at a major financial services firm: receiving, versioning, analyzing and tagging data sets as they come in. End users use SQL or search queries to correlate the information at query time, and can seamlessly employ their choice of analytic tool (e.g. QlikView, Tibco Spotfire or Tableau)..
  • Push information — not data. Most end users don’t want data; they want information that helps them make a decision. Identify key people (like Maria in the article) and find out why they do the analysis they do, and move that process upstream. The day they get the analysis, not the data, will be the day they become 10x more productive. As the article concludes, it’s the key to “mask all that complexity”.
  • Think in segments. One thing I really liked about the Forbes article is that it focuses on data-driven selling. Customer service, regulatory risk, internal performance management — each of these has its own sources and methods. Understanding and organizing your inventory in this way is key to understanding where to focus.
  • (emphasis added)

    How would you use topic maps to “simplify” big data for sales? (Using their concerns and terminology is recommended.)

    The block quote and inspiration from: Simplify, Simplify, Simplify. Three Key Steps to Big Data Business Value by Sid Probstein, Attivio.

    Economic Opportunities for Topic Maps

    Filed under: Marketing,Topic Maps — Patrick Durusau @ 4:49 am

    Alex Williams reports In Big Data To Drive $232 Billion In IT Spending Through 2016 that:

    Big data will drive $232 billion in spending through 2016. It will directly or indirectly drive $96 billion of worldwide IT spending in 2012, and is forecast to drive $120 billion of IT spending in 2013.

    …They draw several conclusions from their research:

    • Big data is not a distinct market. More so, data is everywhere, impacting business in any imaginable way. Its influx will force a change in products, practices and solutions. The change is so rapid that companies may have to retire early existing solutions that are not up to par.
    • In 2012, “IT spending driven by big data functional demands will total $28 billion.”Most of that will go toward adapting existing solutions to new demands driven by machine data, social data and the unpredictable velocity that comes with it.
    • Making big data something that has a functional use will drive $4.3 billion in software sales in 2012. The balance will go toward IT services such as outside experts and internal staff.
    • New spending will go toward social media, social network analysis and content analytics with up to 45% of new spending each year.
    • It will cost a significant amount in services to support big data efforts — as much as 20 times higher relative to software purchases. Peopel with the right skill sets are rare and in high demand.

    All of that sounds like music to my topic map ears:

    • “…not a distinct market.” Not surprising, people want their data to make sense, including with other data. Translates into almost limitless potential application areas for topic maps. At least the ones that can pay the freight
    • “…adapting existing solutions to new demands…” Going to be hard without understanding the semantics of data and the existing solutions.
    • “…[m]aking big data something that has a function use…” One of my favorites. Simply having big data isn’t enough.
    • “New spending will go toward social media…” Easier to make the case that same string != same semantic in social media.

      (Apologies to those with the “same string = same semantic” business models. You can fool some of the people some of the time….)

    • “…services…as much as 20 times higher relative to software purchase.” Twenty? A little on the low side but I would say its a good starting point for discussion of professional semantic services.

    You?

    Successful Dashboard Design

    Filed under: Dashboard,Interface Research/Design — Patrick Durusau @ 3:59 am

    Following formulaic rules will not make you a good author. Studying the work of good authors may, no guarantees, give you the skills to be a good author. The same is true of interface/dashboard design.

    Examples of good dashboard design and why certain elements are thought to be exemplary can be found in: 2012 Perceptual Edge Dashboard Design Competition: We Have a Winner! by Stephen Few.

    Unlike a traditional topic map node/arc display, these designs allow quick comparison of information between subjects.

    Even if a topic map underlies the presentation, the nature of the data and expectations of your users will (should) be driving the presentation.

    Looking forward to the appearance of the second edition of Information Dashboard Design ((by Stephen) which will incorporate examples from this contest.

    October 20, 2012

    US presidential election fundraising: help us explore the FEC data

    Filed under: FEC,Government,Government Data — Patrick Durusau @ 4:22 pm

    US presidential election fundraising: help us explore the FEC data by Simon Rogers.

    From the post:

    Interactive: Which candidate has raised the most cash? Where do the donors live? Find your way around the latest data from the Federal Election Commission with this interactive graphic by Craig Bloodworth at the Information Lab and Andy Cotgreave of Tableau.

    • What can you find in the data? Let us know in the comments below

    Being able to track donations to who gets face time the president would be more helpful.

    Would enable potential donors to gauge how much to donate for X amount of face time.

    Until then, practice with this data.

    PivotPaths: a Fluid Exploration of Interlinked Information Collections

    Filed under: Graphics,Navigation,PivotPaths,Visualization — Patrick Durusau @ 3:58 pm

    PivotPaths: a Fluid Exploration of Interlinked Information Collections

    From Information Aesthetics:

    PivotPaths [mariandoerk.de], developed by Marian Dörk and several academic collaborators, is an interactive visualization for exploring the interconnections between multiple resources. In its current demo rendition, the visualization is linked to an academic publication database, so one can filter for a specific research keyword or the name of an academic researcher.

    PivotPaths was particularly designed in such a way that it should encourage users to “take a stroll” in terms of interacting with the information and serendipitously discovering patterns that are worthwhile. PivotPaths took its name through its prominent use of “pivot operations”: lightweight interaction techniques that trigger gradual and animated transitions between views.

    More detailed information can be found here. PivotPaths was today presented at the IEEE Infovis 2012 conference in Seattle.

    See the original post for the image that I mistook for a presentation from a topic map.

    The need for information navigation has increased since the start of ISO 13250 and continues to do so.

    How Google’s Dremel Makes Quick Work of Massive Data

    Filed under: BigData,Dremel,Intelligence — Patrick Durusau @ 3:13 pm

    How Google’s Dremel Makes Quick Work of Massive Data by Ian Armas Foster.

    From the post:

    The ability to process more data and the ability to process data faster are usually mutually exclusive. According to Armando Fox, professor of computer science at University of California at Berkeley, “the more you do one, the more you have to give up on the other.”

    Hadoop, an open-source, batch processing platform that runs on MapReduce, is one of the main vehicles organizations are driving in the big data race.

    However, Mike Olson, CEO of Cloudera, an important Hadoop-based vendor, is looking past Hadoop and toward today’s research projects. That includes one named Dremel, possibly Google’s next big innovation that combines the scale of Hadoop with the ever-increasing speed demands of the business intelligence world.

    “People have done Big Data systems before,” Fox said “but before Dremel, no one had really done a system that was that big and that fast.”

    On Dremel, see: Dremel: Interactive Analysis of Web-Scale Datasets, as well.

    Are you looking (or considering looking) beyond Hadoop?

    Accuracy and timeliness beyond the average daily intelligence briefing will drive demand for your information product.

    Your edge is agility. Use it.

    Sneak Peek into Skybox Imaging’s Cloudera-powered Satellite System [InaaS?]

    Filed under: BigData,Cloudera,Geographic Data,Geography,Intelligence — Patrick Durusau @ 3:02 pm

    Sneak Peek into Skybox Imaging’s Cloudera-powered Satellite System by Justin Kestelyn (@kestelyn)

    This is a guest post by Oliver Guinan, VP Ground Software, at Skybox Imaging. Oliver is a 15-year veteran of the internet industry and is responsible for all ground system design, architecture and implementation at Skybox.

    One of the great promises of the big data movement is using networks of ubiquitous sensors to deliver insights about the world around us. Skybox Imaging is attempting to do just that for millions of locations across our planet.

    Skybox is developing a low cost imaging satellite system and web-accessible big data processing platform that will capture video or images of any location on Earth within a couple of days. The low cost nature of the satellite opens the possibility of deploying tens of satellites which, when integrated together, have the potential to image any spot on Earth within an hour.

    Skybox satellites are designed to capture light in the harsh environment of outer space. Each satellite captures multiple images of a given spot on Earth. Once the images are transferred from the satellite to the ground, the data needs to be processed and combined to form a single image, similar to those seen within online mapping portals.

    With any sensor network, capturing raw data is only the beginning of the story. We at Skybox are building a system to ingest and process the raw data, allowing data scientists and end users to ask arbitrary questions of the data, then publish the answers in an accessible way and at a scale that grows with the number of satellites in orbit. We selected Cloudera to support this deployment.

    Now is the time to start planning topic map based products that can incorporate this type of data.

    There are lots of folks who are “curious” about what is happening next door, in the next block, a few “klicks” away, across the border, etc.

    Not all of them have the funds for private “keyhole” satellites and vacuum data feeds. But they may have money to pay you for efficient and effective collation of intelligence data.

    Topic maps empowering “Intelligence as a Service (InaaS)”?

    Why oDesk has no spammers

    Filed under: Crowd Sourcing,oDesk — Patrick Durusau @ 2:37 pm

    Why oDesk has no spammers by Panos Ipeirotis.

    From the post:

    So, in my last blog post, I described a brief outline on how to use oDesk to execute automatically a set of tasks, in a “Mechanical Turk” style (i.e., no interviews for hiring and completely computer-mediated process for posting a job, hiring, and ending a contract).

    A legitimate question by appeared in the comments:

    “Well, the concept is certainly interesting. But is there a compelling reason to do microstasks on oDesk? Is it because oDesk has a rating system?”

    So, here is my answer: If you hire contractors on oDesk you will not run into any spammers, even without any quality control. Why is that? Is there a magic ingredient at oDesk? Short answer: Yes, there is an ingredient: Lack of anonymity!

    Well, when you put it that way. 😉

    Question: How open are your topic maps?

    Question: Would you use lack of anonymity to prevent spam in a publicly curated topic map?

    Question: If we want a lack of anonymity to provide transparency and accountability in government, why isn’t that the case with public speech?

    10 Caveats Neo4j users should be familiar with

    Filed under: Graphs,Neo4j — Patrick Durusau @ 2:17 pm

    10 Caveats Neo4j users should be familiar with by Yaron Naveh.

    Yaron discovered some aspects of Neo4j the hard way and shares them here.

    If you want to see Michael Hunger’s comments (Neo4j), see the original post (above).

    I first saw this at DZone.

    Cheat Sheet: Cypher Query Language for Graph Databases [Marketing Question]

    Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 1:58 pm

    Cheat Sheet: Cypher Query Language for Graph Databases

    Cypher is the declarative query language for the Neo4j graph database. Download this cheat sheet to get quickly up to speed on querying graphs and to learn how Cypher:

    • Matches patterns of nodes and relationship in the graph, to extract information or modify the data.
    • Has the concept of identifiers which denote named, bound elements and parameters.
    • Can mutate graph data by creating, updating, and removing nodes, relationships, and properties.

    A very good “cheat sheet.”

    Up to you to decide if giving up your phone number is worth it for a free cheat sheet.

    People get pissed off over passive tracking.

    How will they react to overreaching marketing departments collecting phone numbers and addresses?

    Something to keep in mind when designing your marketing efforts.

    Jubatus:…Realtime Analysis of Big Data [XLDB2012 Presentation]

    Filed under: BigData,Jubatus,Machine Learning — Patrick Durusau @ 10:43 am

    XLDB2012: Jubatus: Distributed Online Machine Learning Framework for Realtime Analysis of Big Data by Hiroyuki Makino.

    I first pointed to Jubatus here.

    The presentation reviews some impressive performance numbers and one technique that merits special mention.

    Intermediate results are shared among the servers during processing to improve their accuracy. That may be common in distributed machine learning systems but it was the first mention I have encountered.

    In parallel processing of topic maps, has anyone considered sharing merging information across servers?

    Bio4j 0.8, some numbers

    Filed under: Bio4j,Bioinformatics,Genome,Graphs — Patrick Durusau @ 10:29 am

    Bio4j 0.8, some numbers by Pablo Pareja Tobes.

    Bio4j 0.8 was recently released and now it’s time to have a deeper look at its numbers (as you can see we are quickly approaching the 1 billion relationships and 100M nodes):

    • Number of Relationships: 717.484.649
    • Number of Nodes: 92.667.745
    • Relationship types: 144
    • Node types: 42

    If Pablo gets tired of his brilliant career in bioinformatics he can always run for office in the United States with claims like: “…we are quickly approaching the 1 billion relationships….” 😉

    Still, a stunning achievement!

    See Pablo’s post for more analysis.

    Pass the project along to anyone with doubts about graph databases.

    DRAKON-Erlang: Visual Functional Programming

    Filed under: Erlang,Flowchart,Functional Programming,Graphics,Visualization — Patrick Durusau @ 10:16 am

    DRAKON-Erlang: Visual Functional Programming

    DRAKON is a visual programming language developed for the Buran Space Project.

    I won’t repeat the surplus of adjectives used to describe DRAKON. Its long term use in the Russian space program is enough to recommend review of its visual techniques.

    The DRAKO-Erlang project is an effort to combine DRAKON as a flow language/representation with Erlang.

    A graphical notation for topic maps never caught on and with the rise of big data, visual representation of merging algorithms could be quite useful.

    I am not suggesting DRAKON-Erlang as a solution to those issues but as a data point to take into account.

    Others?

    Navigating the Star Wars Universe with Neo4j

    Filed under: Graphs,Neo4j — Patrick Durusau @ 5:37 am

    Navigating the Star Wars Universe with Neo4j by Andy McGrath.

    For those who prefer “realistic” science fiction, 😉 Andy provides a start on using Neo4j to explore the Star Wars Universe.

    October 19, 2012

    Seeing beyond reading: a survey on visual text analytics

    Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 4:11 pm

    Seeing beyond reading: a survey on visual text analytics by Aretha B. Alencar, Maria Cristina F. de Oliveira, Fernando V. Paulovich. (Alencar, A. B., de Oliveira, M. C. F. and Paulovich, F. V. (2012), Seeing beyond reading: a survey on visual text analytics. WIREs Data Mining Knowl Discov, 2: 476–492. doi: 10.1002/widm.1071)

    Abstract:

    We review recent visualization techniques aimed at supporting tasks that require the analysis of text documents, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.Techniques are organized considering their target input material—either single texts or collections of texts—and their focus, which may be at displaying content, emphasizing relevant relationships, highlighting the temporal evolution of a document or collection, or helping users to handle results from a query posed to a search engine.We describe the approaches adopted by distinct techniques and briefly review the strategies they employ to obtain meaningful text models, discuss how they extract the information required to produce representative visualizations, the tasks they intend to support and the interaction issues involved, and strengths and limitations. Finally, we show a summary of techniques, highlighting their goals and distinguishing characteristics. We also briefly discuss some open problems and research directions in the fields of visual text mining and text analytics.

    Papers like this one make me wish for a high resolution color printer. 😉

    With three tables of representations, twenty-nine (29) entries and sixty (60) footnotes, it isn’t really possible to provide a useful summary beyond quoting the author’s conclusion:

    This survey has provided an overview of the lively field of visual text analytics. The variety of tasks and situations addressed introduces a demand for many domain-specific and/or task-oriented solutions. Nonetheless, despite the impressive number of contributions and wide variety of approaches identified in the literature, the field is still in its infancy. Deployment of existing and novel techniques to a wider audience of users performing real-life tasks remains a challenge that requires tackling multiple issues.

    One issue is to foster tighter integration with traditional text mining tasks and algorithms. Various contributions are found in the literature reporting usage of visual interfaces or visualizations to support interpretation of the output of traditional text mining algorithms. Still, visualization has the potential to give users a much more active role in text mining tasks and related activities, and concrete examples of such usage are still scarce. Many rich possibilities remain open to further exploration. Better visual text analytics will also likely require more sophisticated text models, possibly integrating results and tools from research on natural language processing. Finally, providing usable tools also requires addressing several issues related to scalability, i.e., the capability of effectively handling very large text documents and textual collections.

    However, what I can do is track down the cited literature and point back to this article as the origin for my searching.

    It merits wider readership than its publisher’s access polices are likely to permit.

    Random Forest Methodology – Bioinformatics

    Filed under: Bioinformatics,Biomedical,Random Forests — Patrick Durusau @ 3:47 pm

    Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics by Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. König

    (Boulesteix, A.-L., Janitza, S., Kruppa, J. and König, I. R. (2012), Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. WIREs Data Mining Knowl Discov, 2: 493–507. doi: 10.1002/widm.1072)

    Abstract:

    The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research.

    Something to expand your horizons a bit.

    And a new way to say “curse of dimensionality,” to-wit,

    ‘n ≪ p curse’

    New to me anyway.

    I was amused to read at the Wikipedia article on random forests that its disadvantages include:

    Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret.

    Turn about is fair play since many classifications made by humans are difficult for computers to interpret. 😉

    Matches are the New Hotness

    Filed under: Graphs,Neo4j,Search Behavior,Searching — Patrick Durusau @ 3:42 pm

    Matches are the New Hotness by Max De Marzi.

    From the post:

    match striking image

    How do you help a person without a job find one online? A search screen. How do you help a person find love online? A search screen. How do you find which camera to buy online? A search screen. How do you help a sick person self diagnose online? I have no idea, I go to the doctor. Doesn’t matter, what I want to tell you is that there is another way.

    Max continues with:

    Now, search is great. It usually helps people find what they’re looking for… but sometimes they have to dig through tons of stuff they don’t really want. Why? Because people can usually think of what they want, but not of what they don’t want to come back. So you end up with a tons of results that are not very relevant to your user…. and unless you are one of the major search engines, your search is not very smart. (emphasis added)

    I like that, not thinking about what they want to exclude.

    And why should they? How do they know how much material is available, at least until they are overwhelmed with search results.

    Max walks though using Neo4j to solve this type of problem. By delivering matches, not pages of search results.

    He even remarks:

    Both the job candidate and job post are thinking about the same things, but if you look at a resume and a job description, you will realize they aren’t speaking the same language. Why not? It’s so obvious it has been driving me crazy for years and was one of the reasons I built Vouched and got into this Graph Database stuff in the first place. So let’s solve this problem with a graph.

    I do have a quibble with his solution of “solving” the different language problem, say for job skills with sub-string matching.

    What happens if the job seeker lists their skills are including “mapreduce,” and “yarn,” but the ad says “Haoop?” You or I would recognize the need for a match.

    I don’t see that in Max’s solution.

    Do you?


    I posted the gist of this in a comment at Max’s blog.

    Visit Max’s post to see his response in full but in short Max favors normalization of data.

    Normalization is a choice you can make, but it should not be a default or unconscious one.

    What’s New in CDH4.1 Hue

    Filed under: Hadoop,Hue — Patrick Durusau @ 3:36 pm

    What’s New in CDH4.1 Hue by Romain Rigaux

    From the post:

    Hue is a Web-based interface that makes it easier to use Apache Hadoop. Hue 2.1 (included in CDH4.1) provides a new application on top of Apache Oozie (a workflow scheduler system for Apache Hadoop) for creating workflows and scheduling them repetitively. For example, Hue makes it easy to group a set of MapReduce jobs and Hive scripts and run them every day of the week.

    In this post, we’re going to focus on the Workflow component of the new application.

    “[E]very day of the week” includes the weekend.

    That got your attention?

    Let Hue manage the workflow and you enjoy the weekend.

    Masterful design of the everyday baggage tag

    Filed under: Design,Usability,Users — Patrick Durusau @ 3:35 pm

    Masterful design of the everyday baggage tag by Nathan Yau.

    Nathan points to a post on the history of baggage tags that including the following quote:

    Just as you can track, step-by-step, a package you’ve sent by FedEx, airlines use bar-coded tags to sort and track bags automatically, through the airport, and across the world. That’s a huge change from the old days, when bags were dropped into the “black box” of a manually sorted baggage system. But crucially, an ABT doesn’t just contain a bar code—it’s also custom-printed with your name, flight details, and destination. That made the global implementation of ABTs much easier, because early-adopters could introduce them long before every airport was ready—a huge advantage when it comes to seamlessly connecting the world’s least and most advanced airports. And of course, ABTs can still be read manually when systems break down.

    There is a for design.

    Works with fully manual, fully automated and everything in between systems.

    What about your topic map? Or is it enslaved by the need for electronic power?

    If I can read a street map by sun/moon light, then why not a topic map? (At least sometimes.)

    Suggestions?

    Situational Aware Mappers with JAQL

    Filed under: Hadoop,JAQL,MapReduce — Patrick Durusau @ 3:32 pm

    Situational Aware Mappers with JAQL

    From the post:

    Adapting MapReduce for a higher performance has been one of the popular discussion topics. Let’s continue with our series on Adaptive MapReduce and explore the feature available via JAQL in IBM BigInsights commercial offering. This implementation also points to a much more vital corollary that enterprise offerings of Apache Hadoop are not just mere packaging and re-sell but have a bigger research initiative going on beneath the covers.

    Two papers are explored by the post:

    [1] Rares Vernica, Andrey Balmin, Kevin S. Beyer, Vuk Ercegovac: Adaptive MapReduce using situation-aware mappers. EDBT 2012: 420-431

    Abstract:

    We propose new adaptive runtime techniques for MapReduce that improve performance and simplify job tuning. We implement these techniques by breaking a key assumption of MapReduce that mappers run in isolation. Instead, our mappers communicate through a distributed meta-data store and are aware of the global state of the job. However, we still preserve the fault-tolerance, scalability, and programming API of MapReduce. We utilize these “situation-aware mappers” to develop a set of techniques that make MapReduce more dynamic: (a) Adaptive Mappers dynamically take multiple data partitions (splits) to amortize mapper start-up costs; (b) Adaptive Combiners improve local aggregation by maintaining a cache of partial aggregates for the frequent keys; (c) Adaptive Sampling and Partitioning sample the mapper outputs and use the obtained statistics to produce balanced partitions for the reducers. Our experimental evaluation shows that adaptive techniques provide up to 3x performance improvement, in some cases, and dramatically improve performance stability across the board.

    [2] Andrey Balmin, Vuk Ercegovac, Rares Vernica, Kevin S. Beyer: Adaptive Processing of User-Defined Aggregates in Jaql. IEEE Data Eng. Bull. 34(4): 36-43 (2011)

    Abstract:

    Adaptive techniques can dramatically improve performance and simplify tuning for MapReduce jobs. However, their implementation often requires global coordination between map tasks, which breaks a key assumption of MapReduce that mappers run in isolation. We show that it is possible to preserve fault-tolerance, scalability, and ease of use of MapReduce by allowing map tasks to utilize a limited set of high-level coordination primitives. We have implemented these primitives on top of an open source distributed coordination service. We expose adaptive features in a high-level declarative query language, Jaql, by utilizing unique features of the language, such as higher-order functions and physical transparency. For instance, we observe that maintaining a small amount of global state could help improve performance for a class of aggregate functions that are able to limit the output based on a global threshold. Such algorithms arise, for example, in Top-K processing, skyline queries, and exception handling. We provide a simple API that facilitates safe and efficient development of such functions.

    The bar for excellence in the use of Hadoop keeps getting higher!

    Ngram Viewer 2.0 [String Usage != Semantic Usage]

    Filed under: GoogleBooks,Natural Language Processing,Ngram Viewer — Patrick Durusau @ 3:32 pm

    Ngram Viewer 2.0 by Jon Orwant.

    From the post:

    Since launching the Google Books Ngram Viewer, we’ve been overjoyed by the public reception. Co-creator Will Brockman and I hoped that the ability to track the usage of phrases across time would be of interest to professional linguists, historians, and bibliophiles. What we didn’t expect was its popularity among casual users. Since the launch in 2010, the Ngram Viewer has been used about 50 times every minute to explore how phrases have been used in books spanning the centuries. That’s over 45 million graphs created, each one a glimpse into the history of the written word. For instance, comparing flapper, hippie, and yuppie, you can see when each word peaked:

    (graphic omitted)

    Meanwhile, Google Books reached a milestone, having scanned 20 million books. That’s approximately one-seventh of all the books published since Gutenberg invented the printing press. We’ve updated the Ngram Viewer datasets to include a lot of those new books we’ve scanned, as well as improvements our engineers made in OCR and in hammering out inconsistencies between library and publisher metadata. (We’ve kept the old dataset around for scientists pursuing empirical, replicable language experiments such as the ones Jean-Baptiste Michel and Erez Lieberman Aiden conducted for our Science paper.)

    Tracking the usage of phrases through time is no mean feat, but tracking their semantics would be far more useful.

    For example, “freedom of speech” did not have the same “semantic” in the early history of the United States that it does today. Otherwise, how would you explain criminal statutes against blasphemy and their enforcement after the ratification of the US Constitution? (I have verified this but Wikipedia, Blasphemy Law in the United States, reports a person being jailed for blasphemy in the 1830’s.)

    Or the guarantee of “freedom of speech,” in Article 125 of the 1936 Constitution of the USSR.

    Those three usages, current United States, early United States, USSR 1936 (English translation), don’t have the same semantics to me.

    You?

    Focusing on the Reader: Engagement Trumps Satisfaction

    Filed under: Marketing,Usability,Users — Patrick Durusau @ 3:31 pm

    Focusing on the Reader: Engagement Trumps Satisfaction by Rachel Davis Mersey, Edward C. Malthouse and Bobby J. Calder. (Journalism & Mass Communication Quarterly published online 5 September 2012 DOI: 10.1177/1077699012455391)

    Abstract:

    Satisfaction is commonly monitored by news organizations because it is an antecedent to readership. In fact, countless studies have shown the satisfaction–readership relationship to be true. Still, an essential question remains: Is satisfaction the only, or even the critical, thing to focus on with readership? This research indicates that the answer is no. Two other related constructs, reader experiences and engagement, affect reader behavior even more than does satisfaction. The discussion provides examples of how to increase engagement and calls for experimental research to understand how news organizations can positively affect engagement and thereby readership.

    In the course of the paper, the authors discuss which definition of “engagement” they will be using:

    In both arenas, marketing and journalism, the term engagement has been readily used, and often misused—both causing confusion about the definition of the word and affecting the usefulness of the concept in research and in practice. The disagreement regarding the nature of the role of television in civic engagement, whether the influence of television be positive or negative, is an example of how differing definitions, and specifically how the construct of engagement is operationalized, can create different results even in high-quality research.11 So while researchers tend to rely on mathematically reliable multi-item measures of engagement, as in work by Livingstone and Markham, we cannot be assured that engagement is similarly defined in each body of research.12

    An opportunity for topic maps that I won’t discuss right now.

    Earlier the authors note:

    If content, however distributed, fails to attract readers/users, no business model can ultimately be successful.

    That seems particularly relevant to semantic technologies.

    I won’t spoil the conclusion for you but the social aspects of using the information in day to day interaction play an unexpected large role in engagement.

    Will successful topic map application designers ask users how they use information to interact with others?

    Then foster that use by design of the topic map interface and/or its content?

    What’s New in CDH4.1 Pig

    Filed under: Cloudera,Hadoop,Pig — Patrick Durusau @ 3:28 pm

    What’s New in CDH4.1 Pig by Cheolsoo Park.

    From the post:

    Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.

    Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.

    Cheolsoo covers these new features:

    • Boolean Data Type
    • Nested FOREACH and CROSS
    • Ruby UDFs
    • LIMIT / SPLIT by Expression
    • Default SPLIT Destination
    • Syntactical Sugar for TOTUPLE, TOBAG, and TOMAP
    • AvroStorage Improvements

    Enjoy!

    October 18, 2012

    A Glance at Information-Geometric Signal Processing

    Filed under: Image Processing,Image Recognition,Semantics — Patrick Durusau @ 2:18 pm

    A Glance at Information-Geometric Signal Processing by Frank Nielsen.

    Slides from the MAHI workship (Methodological Aspects of Hyperspectral Imaging)

    From the workshop homepage:

    The scope of the MAHI workshop is to explore new pathways that can potentially lead to breakthroughs in the extraction of the informative content of hyperspectral images. It will bring together researchers involved in hyperspectral image processing and in various innovative aspects of data processing.

    Images, their informational content and the tools to analyze them have semantics too.

    « Newer PostsOlder Posts »

    Powered by WordPress