Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 23, 2013

Ack 2.0 enhances the “grep for source code”

Filed under: Programming,Searching — Patrick Durusau @ 7:06 pm

Ack 2.0 enhances the “grep for source code”

From the post:

The developers of ack have released version 2.0 of their grep-like tool optimised for searching source code. Described as “designed for programmers”, ack has been available since 2005 and is based on Perl’s regular expressions engine. It minimises false positives by ignoring version control directories by default and has flexible highlighting for matches. The newly released ack 2.0 introduces a more flexible identification system, better support for ackrc configuration files and the ability to read the list of files to be searched from stdin.

Its developers say that ack is designed to perform in a similar fashion to GNU grep but to improve on it when searching source code repositories. The programs web site at beyondgrep.com lists a number of reasons why programmers might want to use ack instead of grep when searching through source code, the least of which being that the ack command is quicker to type than grep. But ack brings a lot more to the table than that as it is specifically designed to deal with source code and understand a large number of programming languages and tools such as build systems and version control software.

Is there any ongoing discussion of semantic searching for source code?

Announcing TokuDB v7: Open Source and More

Filed under: MySQL,TokuDB — Patrick Durusau @ 6:57 pm

Announcing TokuDB v7: Open Source and More by Martin Farach-Colton.

From the post:

The free Community Edition is fully functional and fully performant. It has all the compression you’ve come to expect from TokuDB. It has hot schema changes: no-down-time column insertion, deletion, renaming, etc., as well as index creation. It has clustering secondary keys. We are also announcing an Enterprise Edition (coming soon) with additional benefits, such as a support package and advanced backup and recovery tools.

You may have noticed those screaming performance numbers I have cited from TokuDB posts?

Now the origin of those numbers is open source.

Curious, what questions are you going to ask differently or what different questions will you ask as processing power increases?

Or to ask it the other way, what questions have you not asked because of a lack of processing power?

Data Socializing

Filed under: Data,Social Media — Patrick Durusau @ 6:48 pm

If you need more opportunities for data socializing, KDNuggets has complied: Top 30 LinkedIn Groups for Analytics, Big Data, Data Mining, and Data Science.

Here’s an interesting test:

Write down your LinkedIn groups and compare your list to this one.

Enjoy!

MRQL – a SQL on Hadoop Miracle

Filed under: Hadoop,MRQL — Patrick Durusau @ 6:43 pm

MRQL – a SQL on Hadoop Miracle by Edward J. Yoon.

From the post:

Recently, the Apache Incubator accepted a new query engine for Hadoop and Hama, called MRQL (pronounced miracle), which was initially developed in 2011 by Leonidas Fegaras.

MRQL (MapReduce Query Language) is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop and Hama. MRQL has some overlapping functionality with Hive, Impala and Drill, but one major difference is that it can capture many complex data analysis algorithms that can not be done easily in those systems in declarative form. So, complex data analysis tasks, such as PageRank, k-means clustering, and matrix multiplication and factorization, can be expressed as short SQL-like queries, while the MRQL system is able to evaluate these queries efficiently.

Another difference from these systems is that the MRQL system can run these queries in BSP (Bulk Synchronous Parallel) mode, in addition to the MapReduce mode. With BSP mode, it achieves lower latency and higher speed. According to MRQL team, “In near future, MRQL will also be able to process very large data effectively fast without memory limitation and significant performance degradation in the BSP mode”.

Maybe I should turn my back on the newsfeed more often. 😉

I suspect the announcement and my distraction were unrelated.

This looks very important.

I can feel another Apache list subscription in the offing.

Resources and Readings for Big Data Week DC Events

Filed under: BigData,Data,Data Mining,Natural Language Processing — Patrick Durusau @ 6:33 pm

Resources and Readings for Big Data Week DC Events

This is Big Data week in DC and Data Community DC has put together a list of books articles and posts to keep you busy all week.

Very cool!

Apologies for Sudden Slowdown

Filed under: Standards — Patrick Durusau @ 6:28 pm

Sorry about the sudden slow down!

I have a couple of posts for today and will be back at full strength tomorrow.

I got distracted by a standards dispute at OASIS where a TC wanted an “any model” proposal to be approved as an OASIS standard.

Literally, the conformance clause says “must” but when you look at the text, it says any old model will do.

Hard to think of that as a standard.

If you are interested, see: Voting No on TGF at OASIS.

Deadline is tomorrow so if you know anyone who is interested, spread the word.

April 22, 2013

Marketing: Heads I Win, Tails You Lose

Filed under: Marketing — Patrick Durusau @ 3:50 pm

I haven’t seen this marketing tip in any of the manuals:

Watch for bad news, then explain how your technology saved the day!

Like the claim by FLIR Corp. that their thermal imager helped spot Dzhokar Tsarnaev (Boston Marathon bomber) hiding in a boat.

Or more precisely:

FLIR’s thermal imaging gear was able to discern a live, moving individual hiding in a recreational boat being stored in the backyard of a Watertown home, even though the human being could not been seen beneath a covering tarpaulin by video surveillance cameras or the naked eye.

It is not clear from announcements by law enforcement authorities, and news accounts, whether it was the FLIR system that first discovered the wounded alleged terrorist, Dzhokhar Tsarnaev, and led police on the ground to surround the boat and eventually take Tsarnaev into custody. Or whether it was the tipoff from a man living in the Watertown house to blood on the tarpaulin that first led police to the injured alleged terrorist. (From Thermal imager from FLIR Corp. helps spot Boston Marathon terrorist beneath boat tarp)

It’s not that “unclear:”

The manhunt for Dzhokar Tsarnaev lasted all day Friday and left Boston streets deserted as police asked everyone to stay indoors. Then after the request was lifted, authorities got a tip: A Watertown man told police someone was hiding in his boat in the backyard, bleeding. It was their suspect, Watertown police Chief Edward Deveau said.

Officers spotted Tsarnaev poking through the tarp covering the boat, and a shootout erupted, Deveau said. Police used “flash-bangs,” devices meant to stun people with a loud noise, and negotiated with Tsarnaev for about half an hour.

“We used a robot to pull the tarp off the boat,” David Procopio of the Massachusetts State Police said. “We were also watching him with a thermal imaging camera in our helicopter. He was weakened by blood loss — injured last night, most likely.”(From: As Boston reeled, younger bombing suspect partied

After the boat was pointed out, the thermal imager could see the suspect through a cover.

Not as impressive is it?

If you are going to market based on bad news, pick something that isn’t contradicted in published news accounts.

If you are reading marketing, read carefully, very carefully.

TSDW:… [Enterprise Disambiguation]

Filed under: Natural Language Processing,Wikipedia,Word Meaning — Patrick Durusau @ 2:09 pm

TSDW: Two-stage word sense disambiguation using Wikipedia by Chenliang Li, Aixin Sun, Anwitaman Datta. (Li, C., Sun, A. and Datta, A. (2013), TSDW: Two-stage word sense disambiguation using Wikipedia. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22829)

Abstract:

The semantic knowledge of Wikipedia has proved to be useful for many tasks, for example, named entity disambiguation. Among these applications, the task of identifying the word sense based on Wikipedia is a crucial component because the output of this component is often used in subsequent tasks. In this article, we present a two-stage framework (called TSDW) for word sense disambiguation using knowledge latent in Wikipedia. The disambiguation of a given phrase is applied through a two-stage disambiguation process: (a) The first-stage disambiguation explores the contextual semantic information, where the noisy information is pruned for better effectiveness and efficiency; and (b) the second-stage disambiguation explores the disambiguated phrases of high confidence from the first stage to achieve better redisambiguation decisions for the phrases that are difficult to disambiguate in the first stage. Moreover, existing studies have addressed the disambiguation problem for English text only. Considering the popular usage of Wikipedia in different languages, we study the performance of TSDW and the existing state-of-the-art approaches over both English and Traditional Chinese articles. The experimental results show that TSDW generalizes well to different semantic relatedness measures and text in different languages. More important, TSDW significantly outperforms the state-of-the-art approaches with both better effectiveness and efficiency.

TSDW works because Wikipedia is a source of unambiguous phrases, that can also be used to disambiguate phrases that one first pass are not unambiguous.

But Wikipedia did not always exist and was built out of the collaboration of thousands of users over time.

Does that offer a clue as to building better search tools for enterprise data?

What if statistically improbable phrases are mined from new enterprise documents and links created to definitions for those phrases?

Thinking picking a current starting point avoids a “…boil the ocean…” scenario before benefits can be shown.

Current content is also more likely to be a search target.

Domain expertise and literacy required.

Expertise in logic or ontologies not.

April 21, 2013

Enabling action: Digging deeper into strategies for learning

Filed under: Learning,Search Behavior,Searching — Patrick Durusau @ 4:59 pm

Enabling action: Digging deeper into strategies for learning by Thom Haller. (Haller, T. (2013), Enabling action: Digging deeper into strategies for learning. Bul. Am. Soc. Info. Sci. Tech., 39: 42–43. doi: 10.1002/bult.2013.1720390413)

Abstract:

A central goal for information architects is to understand how people use information, make choices as they navigate a website and accomplish their objectives. If the goal is learning, we often assume it relates to an end point, a question to answer, a problem to which one applies new understanding. Benjamin Bloom’s 1956 taxonomy of learning breaks down the cognitive process, starting from understanding needs and progressing to action and final evaluation. Carol Kuhlthau’s 1991 outline of the information search process similarly starts with awareness of a need, progresses through exploring options, refining requirements and collecting solutions, and ends with decision making and action. Recognizing the stages of information browsing, learning and action can help information architects build sites that better meet searchers’ needs.

Thom starts with Bloom, cruises by Kahlthau and ends up with Jared Pomranky restating Kuhlthau in: Seeking Knowledge: Denver, Web Design, And The Stages of Learning:

According to Kuhlthau, the six stages of learning are:

  • Initiation — the person becomes aware that they need information. Generally, it’s assumed that visitors to your website have this awareness already, but there are circumstances in which you can generate this kind of awareness as well.
  • Exploration — the person sees the options that are available to choose between. Quite often, especially online, ‘analysis paralysis’ can set in and make a learner quit at this stage because they can’t decide which of the options are worth further pursuit.
  • Formulation — the person sees that they’re going to have to create further requirements before they’re able to make a final selection, and they make decisions to narrow the field. Confidence returns.
  • Collection — the person has clearly articulated their precise needs and is able to evaluate potential solutions. They gather all available solutions and begin to weigh them based on relevant criteria.
  • Action — the person makes their final decision and acts on it based on their understanding.

Many web designers assume that their surfers are at the Collection stage, and craft their entire webpage toward moving their reader from Collection to Action — but statistically, most people are going to be at Exploration or Formulation when they arrive at your site.

Does that mean that you should build a website that encourages people to go read other options and learn more, hoping they’ll return to your site for their Action? Not at all — but it does mean that by understanding what people are looking for at each stage of their learning process, we can design websites that guide them through the whole thing. This, by no coincidence whatsoever, also results in websites and web content that is useful, user-friendly, and entirely Google-appropriate.

We all use models of online behavior, learning if you like, but I would caution against using models disconnected from your users.

Particularly models disconnected from your users and re-interpreted by you as reflecting your users.

A better course would be to study the behavior of your users and to model your content on their behavior.

Otherwise you will be the seekers who: “… came looking for [your users], only to find Zarathustra.” Thus Spake Zarathustra

Collaborative annotation… [Human + Machine != Semantic Monotony]

Collaborative annotation for scientific data discovery and reuse by Kirk Borne. (Borne, K. (2013), Collaborative annotation for scientific data discovery and reuse. Bul. Am. Soc. Info. Sci. Tech., 39: 44–45. doi: 10.1002/bult.2013.1720390414)

Abstract:

Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.

A persuasive call to arms to develop “collaborative annotation:”

Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams.

And more specifically for the Large Synoptic Survey Telescope (LSST):

The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System, www.biodas.org).

As you might expect, semantic diversity is going to be present with “collaborative annotation.”

Semantic Monotony (aka Semantic Web) has failed for machines alone.

No question it will fail for humans + machines.

Are you ready to step up to the semantic diversity of collaborative annotation (humans + machines)?

PLOS Text Mining Collection

Filed under: Data Mining,Text Mining — Patrick Durusau @ 3:43 pm

The PLOS Text Mining Collection has launched!

From the webpage:

Across all realms of the sciences and beyond, the rapid growth in the number of works published digitally presents new challenges and opportunities for making sense of this wealth of textual information. The maturing field of Text Mining aims to solve problems concerning the retrieval, extraction and analysis of unstructured information in digital text, and to revolutionize how scientists access and interpret data that might otherwise remain buried in the literature.

Here PLOS acknowledges the growing body of work in the area of Text Mining by bringing together major reviews and new research studies published in PLOS journals to create the PLOS Text Mining Collection. It is no coincidence that research in Text Mining in PLOS journals is burgeoning: the widespread uptake of the Open Access publishing model developed by PLOS and other publishers now makes it easier than ever to obtain, mine and redistribute data from published texts. The launch of the PLOS Text Mining Collection complements related PLOS Collections on Open Access and Altmetrics, and further underscores the importance of the PLOS Application Programming Interface, which provides an open source interface with which to mine PLOS journal content.

The Collection is now open across the PLOS journals to all authors who wish to submit research or reviews in this area. Articles are presented below in order of publication date and new articles will be added to the Collection as they are published.

An impressive start to what promises to be a very rich resource!

I first saw this at: New: PLOS Text Mining.

The Pragmatic Haskeller – Episode 2 & 3

Filed under: Haskell — Patrick Durusau @ 3:30 pm

The Pragmatic Haskeller – Episode 2 – Snap by Alfredo Di Napoli.

Using Snap, this episode gets a minimal web app up and running.

The Pragmatic Haskeller – Episode 3 – Configurator by Alfredo Di Napoli.

Eliminates the hard coding of configuration information.

Here’s a semantic question for you:

If hard coding configuration information is bad practice, why is it acceptable to hard code semantics?

Abstract Maps For Powerful Impact

Filed under: Graphics,Maps,Visualization — Patrick Durusau @ 2:03 pm

Abstract Maps For Powerful Impact by Jim Vallandingham.

You can follow the abstraction, even from the bare slides.

Still, it is a slide deck that makes you wish for the video.

The OpenStreetMap Package Opens Up

Filed under: Geographic Data,ggmap,Mapping,Maps,OpenStreetMap — Patrick Durusau @ 1:50 pm

The OpenStreetMap Package Opens Up

From the post:

A new version of the OpenStreetMap package is now up on CRAN, and should propagate to all the mirrors in the next few days. The primary purpose of the package is to provide high resolution map/satellite imagery for use in your R plots. The package supports base graphics and ggplot2, as well as transformations between spatial coordinate systems.

The bigest change in the new version is the addition of dozens of tile servers, giving the user the option of many different map looks, including those from Bing, MapQuest and Apple.

Very impressive display of the new capabilities in OpenStreetMap and this note about OpenStreetMap and ggmap:

Probably the main alternative to OpenStreetMap is the ggmap package. ggmap is an excellent package, and it is somewhat unfortunate that there is a significant duplication of effort between it and OpenStreetMap. That said, there are some differences that may help you decide which to use:

Reasons to favor OpenStreetMap:

  • More maps: OpenStreetMap supports more map types.
  • Better image resolution: ggmap only fetches one png from the server, and thus is limited to the resolution of that png, whereas OpenStreetMap can download many map tiles and stich them together to get an arbitrarily high image resolution.
  • Transformations: OpenStreetMap can be used with any map coordinate system, whereas ggmap is limited to long-lat.
  • Base graphics: Both packages support ggplot2, but OpenStreetMap also supports base graphics.
Reasons to favor ggmap:
  • No Java dependency: ggmap does not require Java to be installed.
  • Geocoding: ggmap has functions to do reverse geo coding.
  • Google maps: While OpenStreetMap has more map types, it currently does not support google maps.

Fair enough?

Deducer: R Graphic Interface For Everyone

Filed under: R — Patrick Durusau @ 1:38 pm

Deducer: R Graphic Interface For Everyone

From the webpage:

Deducer is designed to be a free easy to use alternative to proprietary data analysis software such as SPSS, JMP, and Minitab. It has a menu system to do common data manipulation and analysis tasks, and an excel-like spreadsheet in which to view and edit data frames. The goal of the project is two fold.

  1. Provide an intuitive graphical user interface (GUI) for R, encouraging non-technical users to learn and perform analyses without programming getting in their way.
  2. Increase the efficiency of expert R users when performing common tasks by replacing hundreds of keystrokes with a few mouse clicks. Also, as much as possible the GUI should not get in their way if they just want to do some programming.

Deducer is designed to be used with the Java based R console JGR, though it supports a number of other R environments (e.g. Windows RGUI and RTerm).

You may also be interested in: How to install R, JGR and Deducer in Ubuntu. If so, add: Solving OpenJDK install errors in Ubuntu.

As you might imagine, data analysis has multiple languages (sound familiar?). Which one you choose is largely a matter of personal preference.

Google search:… [GDM]

Filed under: Search Behavior,Search Engines,Searching — Patrick Durusau @ 12:46 pm

Google search: three bugs to fix with better data science by Vincent Granville.

Vincent outlines three issues with Google search results:

  1. Outdated search results
  2. Wrongly attributed articles
  3. Favoring irrelevant pages

See Vincent’s post for advice on how Google can address these issues. (Might help with a Google interview to tell them how to fix such long standing problems.)

More practically, how does your TM application rate on the outdated search results?

Do you just dump content on the user to sort out (the Google dump model (GDM)) or are your results a bit more user friendly?

April 20, 2013

NodeXL HowTo

Filed under: Excel,Graphics,NodeXL,Visualization — Patrick Durusau @ 1:18 pm

Rolling out a “How-To” Software Series

A long preface that ends with a list of posts on “how to” use NodeXL.

Looks very good!

Enjoy!

Fast algorithm for successively merging k-overlapping sets?

Filed under: Merging,Overlapping Sets,Sets — Patrick Durusau @ 12:56 pm

Fast algorithm for successively merging k-overlapping sets?

As posted:

Consider the following algorithm for clustering sets: Begin with n sets, S1, S2,…,Sn, such that sum_{i = 1}^n |Si| = m, and successively merge sets with at least k elements in common. E.g., if S1 = {1, 2, 3}, S2 = {3, 4, 5}, and S3 = {5, 6, 7}, and k = 1 then S1 can be merged with S2 to create S1′ = {1, 2, 3, 4, 5}, and S1′ can be merged with S3 to create S1” = {1,…,7}.

Warmup question: Can this clustering algorithm be implemented efficiently for k = 1?

Answer to warmup question: If the sets only need to overlap by one element to be merged as in the example above, the clustering can be performed in O(m) time using connected components if you are careful.

Harder question: Suppose the sets must overlap by at least 2 (or k) elements to be merged. Is there an efficient algorithm for this case (i.e., close to linear time)? The challenge here is that you can have cases like S1 = {1, 2, 3}, S2 = {2, 4, 5}, S3 = {1, 4, 5}, with k = 2. Note that in this case S1 can be merged with S2 and S3, but only after S2 and S3 are merged to create S2′ so that S1 and S2′ share both elements 1 and 2.

I saw this on Theoretical Computer Science Stack Exchange earlier today.

Reminded me of overlapping set members test for [subject identifiers], [item identifiers], [subject locators], [subject identifiers] and [item identifiers], property of the other, [reified] properties. (Well, reified is a simple match, not a set.)

I have found some early work on the overlapping set member question but also work on other measures of similarity on set members.

Working up a list of papers.

The Amateur Data Scientist and Her Projects

Filed under: Authoring Topic Maps,Data Science — Patrick Durusau @ 10:33 am

The Amateur Data Scientist and Her Projects by Vincent Granville.

From the post:

With so much data available for free everywhere, and so many open tools, I would expect to see the emergence of a new kind of analytic practitioner: the amateur data scientist.

Just like the amateur astronomer, the amateur data scientist will significantly contribute to the art and science, and will eventually solve mysteries. Could the Boston bomber be found thanks to thousands of amateurs analyzing publicly available data (images, videos, tweets, etc.) with open source tools? After all, amateur astronomers have been able to detect exoplanets and much more.

Also, just like the amateur astronomer only needs one expensive tool (a good telescope with data recording capabilities), the amateur data scientist only needs one expensive tool (a good laptop and possibly subscription to some cloud storage/computing services).

Amateur data scientists might earn money from winning Kaggle contests, working on problems such as identifying a Bonet, explaining the stock market flash crash, defeating Google page-ranking algorithms, helping find new complex molecules to fight cancer (analytical chemistry), predicting solar flares and their intensity. Interested in becoming an amateur data scientist? Here’s a first project for you, to get started:

Amateur data scientist, I rather like the sound of that.

And would be an intersection of interests and talents, just like professional data scientists.

Vincent’s example of posing entry level problems is a model I need to follow for topic maps.

Amateur topic map authors?

DELTACON: A Principled Massive-Graph Similarity Function

Filed under: Graphs,Networks,Similarity — Patrick Durusau @ 10:24 am

DELTACON: A Principled Massive-Graph Similarity Function by Danai Koutra, Joshua T. Vogelstein, Christos Faloutsos.

Abstract:

How much did a network change since yesterday? How different is the wiring between Bob’s brain (a left-handed male) and Alice’s brain (a right-handed female)? Graph similarity with known node correspondence, i.e. the detection of changes in the connectivity of graphs, arises in numerous settings. In this work, we formally state the axioms and desired properties of the graph similarity functions, and evaluate when state-of-the-art methods fail to detect crucial connectivity changes in graphs. We propose DeltaCon, a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes (e.g. employees of a company, customers of a mobile carrier). Experiments on various synthetic and real graphs showcase the advantages of our method over existing similarity measures. Finally, we employ DeltaCon to real applications: (a) we classify people to groups of high and low creativity based on their brain connectivity graphs, and (b) do temporal anomaly detection in the who-emails-whom Enron graph.

How different is your current topic map from a prior version?

Could be an interesting marketing ploy to colorize the distinct portions of the graph.

Not to mention using “similarity” to mean the same subject for some purposes. Group subjects come to mind.

And for other types of analysis.

Match Making with NEO4J

Filed under: Graphs,Neo4j — Patrick Durusau @ 10:03 am

Match Making with NEO4J by Max De Marzi.

From the post:

In the “Matches are the new Hotness” blog post, I showed how to connect a person to a job via a location and skills. We’re going to look at a variation on the theme today by matching people to other people by what they want in a potential mate. We’re gonna use Neo4j to bring the love.

There are a ton of opinions on what’s wrong with current dating sites. I don’t claim to know how to fix them, I’m just giving what may be a piece of the puzzle. We could try to match people on the things they have in common, but the saying “opposites attract” exists for a reason. We often don’t want mirrors of ourselves, but rather to supplement some perceived deficiency. However complete opposites may result in exciting relationships, but may not be long-lasting. Some kind of happy middle ground is probably best.

This should come with a warning:

Don’t try this at home!

😉

Romantic advice, even from close friends, is fraught with peril. The professionals are getting paid for the risk.

Still, graphing high school/college romantic relationships could interest young people in computing and graphs.

Max has a great pic at this post. I had forgotten how beautiful she was.

Data Storytelling: The Ultimate Collection of Resources

Filed under: Communication,Data,Data Storytelling — Tags: — Patrick Durusau @ 9:47 am

Data Storytelling: The Ultimate Collection of Resources by Zach Gemignani.

From the post:

The hot new concept in data visualization is “data storytelling”; some are calling it the next evolution of visualization (I’m one of them). However, we’re early in the discussion and there are more questions than answers:

  • Is data storytelling more than a catchy phrase?
  • Where does data storytelling fit into the broader landscape of data exploration, visualization, and presentation?
  • How can the traditional tools of storytelling improve how we communicate with data?
  • Is it more about story-telling or story-finding?

Many of the bright minds in the data visualization field have started to tackle these questions — and it is something that we’ve been exploring at Juice in our work. Below you’ll find a collection of some of the best blog posts, presentations, research papers, and other resources that take on this topic.

I count ten (10) blog posts, four (4) presentations, five (5) papers and eight (8) tools, examples and other resources.

Get yourself a fresh cup of coffee. You are going to be here a while.

PS: I don’t know that “data storytelling” is new or if the last century or so suffered a real drought in “data storytelling.”

Medieval cathedrals were exercises in storytelling but a modern/literate audience fails to appreciate them as designed.

PhenoMiner:..

PhenoMiner: quantitative phenotype curation at the rat genome database by Stanley J. F. Laulederkind, et.al. (Database (2013) 2013 : bat015 doi: 10.1093/database/bat015)

Abstract:

The Rat Genome Database (RGD) is the premier repository of rat genomic and genetic data and currently houses >40 000 rat gene records as well as human and mouse orthologs, >2000 rat and 1900 human quantitative trait loci (QTLs) records and >2900 rat strain records. Biological information curated for these data objects includes disease associations, phenotypes, pathways, molecular functions, biological processes and cellular components. Recently, a project was initiated at RGD to incorporate quantitative phenotype data for rat strains, in addition to the currently existing qualitative phenotype data for rat strains, QTLs and genes. A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature. Concurrently, three of those ontologies needed extensive addition of new terms to move the curation forward. The curation interface development, as well as ontology development, was an ongoing process during the early stages of the PhenoMiner curation project.

Database URL: http://rgd.mcw.edu

The line:

A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature.

sounded relevant to topic maps.

Turns out to be five ontologies and the article reports:

The ‘Create Record’ page (Figure 4) is where the rest of the data for a single record is entered. It consists of a series of autocomplete text boxes, drop-down text boxes and editable plain text boxes. All of the data entered are associated with terms from five ontologies/vocabularies: RS, CMO, MMO, XCO and the optional MA (Mouse Adult Gross Anatomy Dictionary) (13)

Important to note that authoring does not require the user to make explicit the properties underlying any of the terms from the different ontologies.

Some users probably know that level of detail but what is important is the capturing of their knowledge of subject sameness.

A topic map extension/add-on to such a system could flesh out those bare terms to provide a basis for treating terms from different ontologies as terms for the same subjects.

That merging/mapping detail need not bother an author or casual user.

But it increases the odds that future data sets can be reliably integrated with this one.

And issues with the correctness of a mapping can be meaningfully investigated.

If it helps, think of correctness of mappping as accountability, for someone else.

Data Science Markets [Marketing]

Filed under: Data Science,Marketing,Topic Maps — Patrick Durusau @ 8:38 am

Data Visualization: The Data Industry by Sean Gonzalez.

From the post:

In any industry you either provide a service or a product, and data science is no exception. Although the people who constitute the data science workforce are in many cases rebranded from statistician, physicist, algorithm developer, computer scientist, biologist, or anyone else who has had to systematically encode meaning from information as the product of their profession, data scientists are unique from these previous professions in that they operate across verticals as opposed to diving ever deeper down the rabbit hole.

Sean identifies five (5) market segments in data science and a visualization product for each one:

  1. New Recruits
  2. Contributors
  3. Distillers
  4. Consultants
  5. Traders

See Sean’s post for the details.

Have you identified market segments and the needs they have for topic map based data and/or software?

Yes, I said their needs.

You may want a “…more just, verdant, and peaceful world” but that’s hardly a common requirement.

Starting with a potential customer’s requirements is more likely to result in a sale.

Data Computation Fundamentals [Promoting Data Literacy]

Filed under: Data,Data Science,R — Patrick Durusau @ 8:08 am

Data Computation Fundamentals by Daniel Kaplan and Libby Shoop.

From the first lesson:

Teaching the Grammar of Data

Twenty years ago, science students could get by with a working knowledge of a spreadsheet program. Those days are long gone, says Danny Kaplan, DeWitt Wallace Professor of Mathematics and Computer Science. “Excel isn’t going to cut it,” he says. “In today’s world, students can’t escape big data. Though it won’t be easy to teach it, it will only get harder as they move into their professional training.”

To that end, Kaplan and computer science professor Libby Shoop have developed a one-credit class called Data Computation Fundamentals, which is being offered beginning this semester. Though Kaplan doesn’t pretend the course can address all the complexities of specific software packages, he does hope it will provide a framework that students can apply when they come across databases or data-reliant programs in biology, chemistry, and physics. “We believe we can give students that grammar of data that they need to use these modern capabilities,” he says.

Not quite “have developed.” Should say, “are developing, in conjunction with a group of about 20 students.”

Data literacy impacts the acceptance and use of data and tools for using data.

Teaching people to read and write is not a threat to commercial authors.

By the same token, teaching people to use data is not a threat to competent data analysts.

Help the authors and yourself by reviewing the course and offering comments for its improvement.

I first saw this at: A Course in Data and Computing Fundamentals.

The Matasano Crypto Challenges

Filed under: Cryptography,Cybersecurity,Security — Patrick Durusau @ 4:31 am

The Matasano Crypto Challenges by Maciej Ceglowski.

From the post:

I recently took some time to work through the Matasano crypto challenges, a set of 48 practical programming exercises that Thomas Ptacek and his team at Matasano Security have developed as a kind of teaching tool (and baited hook).

Much of what I know (or think I know) about security has come from reading tptacek’s comments on Hacker News, so I was intrigued when I first saw him mention the security challenges a few months ago. At the same time, I worried that I’d be way out of my depth attempting them.

As a programmer, my core strengths have always been knowing how to apologize to users, and composing funny tweets. While I can hook up a web template to a database and make the squigglies come out right, I cannot efficiently sort something for you on a whiteboard, or tell you where to get a monad. From my vantage point, crypto looms as high as Mount Olympus.

To my delight, though, I was able to get through the entire sequence. It took diligence, coffee, and a lot of graph paper, but the problems were tractable. And having completed them, I’ve become convinced that anyone whose job it is to run a production website should try them, particularly if you have no experience with application security.

Since the challenges aren’t really documented anywhere, I wanted to describe what they’re like in the hopes of persuading busy people to take the plunge.

You get the challenges in batches of eight by emailing cryptopals at Matasano, and solve them at your own pace, in the programming language of your choice. Once you finish a set, you send in the solutions and Sean unlocks the next eight. (Curiously, after the third set, Gmail started rejecting my tarball as malware.)

Most of the challenges take the form of practical attacks against common vulnerabilities, many of which will be sadly familiar to you from your own web apps. To keep things fun and fair for everyone, they ask you not to post the questions or answers online. (I cleared this post with Thomas to make sure it was spoiler-free.)

The challenges start with some basic string manipulation tasks, but after that they are grouped by theme. In most cases, you first implement something, then break it in several enlightening ways. The constructions you use will be familiar to any web programmer, but this may be the first time you have ever taken off the lid and looked at the moving parts inside.

While avoiding posting the questions/answers online, mapping vulnerabilities you uncover would make a good start on a security topic map.

I first saw this in Four short links: 19 April 2013 by Nat Torkington.

April 19, 2013

Serious Topic Maps Avoid CNN

Filed under: Humor — Patrick Durusau @ 5:42 pm

The topic map committee choose to not provide guidance on creating topic maps.

In hindsight, I think that was a mistake. A big one.

How else can users know to avoid CNN.com when creating serious topic maps?

Instead of ISO/IEC SC34/WG3, people have to rely on Jon Steward to get that information:

Jon Stewart Rips Into CNN For Lying About The Boston Marathon

I like Jon Stewart but an SDO he’s not. I doubt he is even a member of a national body.

I can imagine using CNN for a topic map, one about sexual graffiti in the catacombs of Rome being investigated by Geraldo Rivera.

And for a topic map on the descent of journalism into 24×7 infotainment.

But outside of that…, not a chance!

NLP Programming Tutorial

Filed under: Natural Language Processing — Patrick Durusau @ 5:07 pm

NLP Programming Tutorial by Graham Neubig.

From the webpage:

This is a tutorial I did at NAIST for people to start learning how to program basic algorithms for natural language processing.

You should need very little programming experience to start out, but each of the tutorials builds on the stuff from the previous tutorials, so it is highly recommended that you do them in order. You can also download the data for the practice exercises.

Slides so you will need to supply reading materials, references, local data sets of interest, etc.

Broccoli: Semantic Full-Text Search at your Fingertips

Filed under: Indexing,Search Algorithms,Search Engines,Semantic Search — Patrick Durusau @ 4:49 pm

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas, see http://broccoli.informatik.uni-freiburg.de.

The most impressive part of an impressive paper was the new index, context lists.

The second idea, which is the main idea behind our new index, is to have what we call context lists instead of inverted lists. The context list for a pre x contains one index item per occurrence of a word starting with that pre x, just like the inverted list for that pre x would. But along with that it also contains one index item for each occurrence of an arbitrary entity in the same context as one of these words.

The performance numbers speak for themselves.

This should be a feature in the next release of Lucene/Solr. Or perhaps even configurable for the number of entities that can appear in a “context list.”

Was it happenstance or a desire for simplicity that caused the original indexing engines to parse text into single tokens?

Literature references on that point?

Schema on Read? [The virtues of schema on write]

Filed under: BigData,Database,Hadoop,Schema,SQL — Patrick Durusau @ 3:48 pm

Apache Hadoop and Data Agility by Ofer Mendelevitch.

From the post:

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.

In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

If a schema is supplied “on read,” how is data validation accomplished?

I don’t mean in terms of datatypes such as string, integer, double, etc. That are trivial forms of data validation.

How do we validate the semantics of data when a schema is supplied on read?”

Mistakes do happen in RDBMS systems but with a schema, which defines data semantics, applications can attempt to police those semantics.

I don’t doubt that schema “on read” supplies a lot of useful flexibility, but how do we limit the damage that flexibility can cause?

For example, many years ago, area codes (for telephones) in the USA were tied to geographic exchanges. Data from the era still exists in the bowels of some data stores. That is no longer true in many cases.

Let’s assume I have older data that has area codes tied to geographic areas and newer data that has area codes that are not. Without a schema to define the area code data in both cases, how would I know to treat the area code data differently?

I concede that schema “on read” can be quite flexible.

On the other hand, let’s not discount the value of schema “on write” as well.

« Newer PostsOlder Posts »

Powered by WordPress