Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 3, 2011

PCAT

Filed under: Data Analysis,Data Mining,PCAT — Patrick Durusau @ 6:45 pm

PCAT – Public Comment Analysis Toolkit

A cloud based analysis service.

PCAT can import:

Federal Docket Management System Archives
Email, Blog and Wiki Content
Plain text, HTML, or XML Documents
Microsoft Word and Adobe PDFs
Excel or CSV Spreadsheets
Archived RSS Feeds
CAT-style Datasets

PCAT capabilities:

Search for key concepts & code text
Remove duplicates & cluster similar comments
Form peer & project networks
Establish credentials & permissions
Assign multiple coders to tasks
Annotate coding with shared memos
Easily measure inter-coder reliability
Adjudicate valid & invalid coder decisions
Generate reports in RTF, CSV, PDF or XML format
Archive or share completed projects online

If you have used PCAT, please comment.

August 27, 2011

What is a Customer?

Filed under: Data Analysis,Data Integration — Patrick Durusau @ 9:11 pm

I ran across a series of posts where David Loshin explores the question: “What is a Customer?” or as he puts it in The Most Dangerous Question to Ask Data Professionals:

Q: What is the most dangerous question to ask data professionals?

A: “What is the definition of customer?”

And he includes some examples:

  • “customer” = person who gave us money for some of our stuff
  • “customer” = the person using our stuff
  • “customer” = the guy who approved the payment for our stuff
  • “customer account manager” = salesperson
  • “customer service” = complaints office
  • “customer representative” = gadfly

and explores the semantic confusion about how we use “customer.”

In Single Views Of The Customer, David explores the hazards and dangers of a single definition of customer.

When Is A Customer Not A Customer? starts to stray into more familiar territory when he says:

Here are the two pieces of our current puzzle: we have multiple meanings for the term “customer” but we want a single view of whatever that term means. To address this Zen-like conundrum we have to go beyond our context and think differently about the end, not the means. Here are two ideas to drill into: entity vs. role and semantic differentiation.

and after some interesting discussion (which you should read) he concludes:

What we can start to see is that a “customer” is not really a data type, nor is it really a customer. Rather, a “customer” is a role played by some entity (in this case, either an individual or an organization) within some functional context at different points of particular business processes. In the next post let’s decide how we can use this to rethink the single view of the customer.

I will be posting an update when the next post appears.

August 20, 2011

B.A.D. Data Is Not Always Bad…If You Have a Data Scientist

Filed under: Data Analysis,Marketing — Patrick Durusau @ 8:07 pm

B.A.D. Data Is Not Always Bad…If You Have a Data Scientist by Frank Coleman.

From the post:

How many times have you heard, “Bad data means bad decisions”? Starting with the Best Available Data (B.A.D.) is a great approach because it gets the inspection process moving. The best way to engage key stakeholders is to show them their numbers, even if you have low confidence in the results. If done well, you will be speaking with a group of passionate colleagues!

People are often afraid to start measuring a project or initiative because they have low confidence in the quality of the data they are accessing. But there is a great deal you can do with B.A.D. data; start by looking for trends. Many times the trend is all you really need to get going. Make sure you also understand what the distribution of this data looks like. You don’t have to be a Six Sigma black belt (though it helps) to know if the data has a normal distribution. From there you can “geek out” if you want, but your time will be better served by keeping it simple – especially at this stage.

A bit “practical” for my tastes, ;-), but worth your attention.

The secrets to successful data visualization

Filed under: Data Analysis,Visualization — Patrick Durusau @ 8:07 pm

The secrets to successful data visualization by Reena Jana.

From the post:

Effective data visualization is about more than designing an eye-catching graphic. It’s about telling a clear and accurate story that draws readers in via powerful choices of shapes and colors. These are some of the observations you’ll find in the insightful new book Visualize This: The Flowing Data Guide to Design, Visualization, and Statistics (Wiley) by Nathan Yau, the blogger behind the popular site Flowing Data. On his blog, Yau analyzes a wide variety of graphs and charts from around the world–and often sparks online discussions and debates among designers.

You have seen the Flowing Data site mentioned here more than once or twice.

If you don’t read another post this weekend, go to Reena’s post and read it. You will get something from it.

August 18, 2011

Building data startups: Fast, big, and focused

Filed under: Analytics,BigData,Data,Data Analysis,Data Integration — Patrick Durusau @ 6:54 pm

Building data startups: Fast, big, and focused (O’Reilly original)

Republished by Forbes as:
Data powers a new breed of startup

Based on the talk Building data startups: Fast, Big, and Focused

by Michael E. Driscoll

From the post:

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Describes the emerging big data stack and says:

The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

The future isn’t going to be getting users to develop topic maps but your use of topic maps (and other tools) to create data products of interest to users.

Think of it as being the difference between selling oil change equipment versus being the local Jiffy Lube. (Sorry, for non-U.S. residents, Jiffy Lube is a chain of oil change and other services. Some 2,000 locations in the North America.) I dare say that Jiffy Lube and its competitors do more auto services than users of oil change equipment.

August 11, 2011

Building The Ocean With Big Data

Filed under: Analytics,BigData,Data Analysis — Patrick Durusau @ 6:33 pm

Building The Ocean With Big Data

From the post:

While working at an agency with a robust analytics group is exciting, it can also be frustrating at times. Clients challenge us with questions that are often difficult to answer with a simple data pull/request. For example, an auto client may ask how digital media is driving auto sales for a specific model in a specific location. Another client may like to better understand how much they need to spend on digital media, and to that end, which media sequencing is most effective (e.g. search -> display -> search -> social, etc.). Questions like these require multiple large sets of data, often in varying formats and time ranges. So the question becomes, with data collection and aggregation more important than ever, what steps can we take to make sure we analyze Big Data in a meaningful way?

Topic maps face the same issues as analysis of Big Data, where do you start?

If you start with no plan or a poorly planned one, you can work very hard for little or no gain. This article, while framed for analysis, has good principals for organizing analysis or mapping of Big Data.

August 8, 2011

Suicide Note Classification…ML Correct 78% of the time.

Filed under: Data Analysis,Data Mining,Machine Learning — Patrick Durusau @ 6:41 pm

Suicide Note Classification Using Natural Language Processing: A Content Analysis

Punch line (for the impatient):

…trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time.

Abstract:

Suicide is the second leading cause of death among 25–34 year olds and the third leading cause of death among 15–25 year olds in the United States. In the Emergency Department, where suicidal patients often present, estimating the risk of repeated attempts is generally left to clinical judgment. This paper presents our second attempt to determine the role of computational algorithms in understanding a suicidal patient’s thoughts, as represented by suicide notes. We focus on developing methods of natural language processing that distinguish between genuine and elicited suicide notes. We hypothesize that machine learning algorithms can categorize suicide notes as well as mental health professionals and psychiatric physician trainees do. The data used are comprised of suicide notes from 33 suicide completers and matched to 33 elicited notes from healthy control group members. Eleven mental health professionals and 31 psychiatric trainees were asked to decide if a note was genuine or elicited. Their decisions were compared to nine different machine-learning algorithms. The results indicate that trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time.

The researchers concede that the data set is small but apparently it is the only one of it kind.

I mention the study here as a reason to consider using ML techniques in your next topic map project.

Merging the results from different ML algorithms re-creates the original topic maps use case, how do you merge indexes made by different indexers?, but that can’t be helped. More patterns to discover to use as the basis for merging rules!*

PS: I spotted this at Improbable Results: Machines vs. Professionals: Recognizing Suicide Notes.


* I wonder if we could apply the lessons from ensembles of classifiers to a situation where multiple classifiers are used by different projects? One part of me says that an ensemble is developed by a person or group that shares an implicit view of the data and so that makes the ensemble workable.

Another part wants to say that no, the results of classifiers, whether programmed by the same group or different groups, should not make a difference. Well, other than having to “merge” the results of the classifiers, which happens with an ensemble anyway. In that case you might have to think about it more.

Hard to say. Will have to investigate further.

July 26, 2011

Using Tools You Already Have

Filed under: Data Analysis,Shell Scripting — Patrick Durusau @ 6:22 pm

Using Tools You Already Have

A useful post on why every data scientist should know something about bash scripting.

July 24, 2011

KNIME Version 2.4.0 released

Filed under: Data Analysis,Data Integration,Data Mining — Patrick Durusau @ 6:45 pm

KNIME Version 2.4.0 released

From the release notice:

We have just released KNIME v2.4, a feature release with a lot of new functionality and some bug fixes. The highlights of this release are:

  • Enhancements around meta node handling (collapse/expand & custom dialogs)
  • Usability improvements (e.g. auto-layout, fast node insertion by double-click)
  • Polished loop execution (e.g. parallel loop execution available from labs)
  • Better PMML processing (added PMML preprocessing, which will also be presented at this year's KDD conference)
  • Many new nodes, including a whole suite of XML processing nodes, cross-tabulation and nodes for data preprocessing and data mining, including ensemble learning methods.

In case you aren’t familiar with KNIME, it is self-described as:

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is currently being used actively by over 6,000 professionals all over the world, in both industry and academia.

What would you do the same/differently for a topic map interface?

July 13, 2011

RecordBreaker: Automatic structure for your text-formatted data

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 7:30 pm

RecordBreaker: Automatic structure for your text-formatted data

From the post:

This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike’s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.

RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.

No quite “automatic” but a step in that direction and a useful one.

Think of “automatic” identification of subjects and associations in such files.

Like the files from campaign financing authorities.

July 12, 2011

MADlib goes beta!

Filed under: Data Analysis,SQL,Statistics — Patrick Durusau @ 7:08 pm

MADlib goes beta! Serious in-database analytics

From the post:

MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!

Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:

  • standard statistical methods like multi-variate linear and logistic regressions,
  • supervised learning methods including support-vector machines, naive Bayes, and decision trees
  • unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
  • descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
  • statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.

Kudos to EMC:

And hats off to EMC-Greenplum for putting significant development resources behind this open-source effort. I started this project at Greenplum before they were acquired, and have been happy to see EMC embrace it and push it further.

Not every acquisition has that happy result.

July 9, 2011

Data journalism, data tools, and the
newsroom stack

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 7:02 pm

Data journalism, data tools, and the newsroom stack by Alex Howard.

From the post:

MIT’s recent Civic Media Conference and the latest batch of Knight News Challenge winners made one reality crystal clear: as a new era of technology-fueled transparency, innovation and open government dawns, it won’t depend on any single CIO or federal program. It will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, whatever form it is delivered in.

The themes that unite this class of Knight News Challenge winners were data journalism and platforms for civic connections. Each theme draws from central realities of the information ecosystems of today. Newsrooms and citizens are confronted by unprecedented amounts of data and an expanded number of news sources, including a social web populated by our friends, family and colleagues. Newsrooms, the traditional hosts for information gathering and dissemination, are now part of a flattened environment for news, where news breaks first on social networks, is curated by a combination of professionals and amateurs, and then analyzed and synthesized into contextualized journalism.

Pointers to the newest resources and analysis of the issues of “transparency, innovation and open government….”

Until government transparency becomes public and cumulative, it will be personal and transitory.

Topic maps have the capability to make it the former instead of the latter.

June 27, 2011

Spark – Lighting-Fast Cluster Computing

Filed under: Clustering (servers),Data Analysis,Scala,Spark — Patrick Durusau @ 6:39 pm

Spark – Lighting-Fast Cluster Computing

From the webpage:

What is Spark?

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

To make programming faster, Spark integrates into the Scala language, letting you manipulate distributed datasets like local collections. You can also use Spark interactively to query big data from the Scala interpreter.

What can it do?

Spark was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can outperform Hadoop by 30x. However, you can use Spark’s convenient API to for general data processing too. Check out our example jobs.

Spark runs on the Mesos cluster manager, so it can coexist with Hadoop and other systems. It can read any data source supported by Hadoop.

Who uses it?

Spark was developed in the UC Berkeley AMP Lab. It’s used by several groups of researchers at Berkeley to run large-scale applications such as spam filtering, natural language processing and road traffic prediction. It’s also used to accelerate data analytics at Conviva. Spark is open source under a BSD license, so download it to check it out!

Hadoop must be doing something right to be treated as the solution to beat.

Still, depending on your requirements, Spark definitely merits your consideration.

June 23, 2011

Personal Analytics

Filed under: Analytics,Conferences,Data,Data Analysis — Patrick Durusau @ 1:49 pm

Personal Analytics

An O’Reilly Online Strata Conference.

Free

July 12, 2011

16:00 – 18:30am UTC

From the website:

It’s only in the past decade that we’ve become aware of how much of our lives is recorded. From phone companies to merchants, social networks to employers, everyone’s building a record of us―except us. That’s changing. Once, recording every aspect of your life might have seemed obsessive. Now, armed with the latest smartphones and comfortable with visualizations and analytics, life-logging is no longer fringe behavior. In this Strata OLC, we’ll look at the rapidly growing field of personal analytics. We’ll discuss tool stacks for recording lives, and hear surprising stories about what happens when introspection meets technology.

O’Reilly Strata Online is a fast-paced, web-based conference series tackling the impact of a data-driven, always-on world. It combines thorough tutorials, provocative panel discussions, real-world case studies, and deep-dives into technology stacks.

This could be fun, not to mention a model for mini-conferences perhaps for topic maps.

June 12, 2011

U.S. DoD Is Buying. Are You Selling?

Filed under: BigData,Data Analysis,Data Integration,Data Mining — Patrick Durusau @ 4:14 pm

CTOVision.com reports: Big Data is Critical to the DoD Science and Technology Investment Agenda

Of the seven reported priorities:

(1) Data to Decisions – science and applications to reduce the cycle time and manpower requirements for analysis and use of large data sets.

(2) Engineered Resilient Systems – engineering concepts, science, and design tools to protect against malicious compromise of weapon systems and to develop agile manufacturing for trusted and assured defense systems.

(3) Cyber Science and Technology – science and technology for efficient, effective cyber capabilities across the spectrum of joint operations.

(4) Electronic Warfare / Electronic Protection – new concepts and technology to protect systems and extend capabilities across the electro-magnetic spectrum.

(5) Counter Weapons of Mass Destruction (WMD) – advances in DoD’s ability to locate, secure, monitor, tag, track, interdict, eliminate and attribute WMD weapons and materials.

(6) Autonomy – science and technology to achieve autonomous systems that reliably and safely accomplish complex tasks, in all environments.

(7) Human Systems – science and technology to enhance human-machine interfaces to increase productivity and effectiveness across a broad range of missions

I don’t see any where topic maps would be out of place.

Do you?

A Few Subjects Go A Long Way

Filed under: Data Analysis,Language,Linguistics,Text Analytics — Patrick Durusau @ 4:11 pm

A post by Rich Cooper (Rich AT EnglishLogicKernel DOT com) Analyzing Patent Claims demonstrates the power of small vocabularies (sets of subjects) for the analysis of patent claims.

It is a reminder that a topic map author need not identify every possible subject, but only so many of those as necessary. Other subjects abound and await other authors who wish to formally recognize them.

It is also a reminder that a topic map need only be as complex or as complete as necessary for a particular task. My topic map may not be useful for Mongolian herdsmen or even the local bank. But, the test isn’t an abstract but a practical. Does it meet the needs of its intended audience?

Dremel: Interactive Analysis of Web-Scale
Datasets

Filed under: BigData,Data Analysis,Data Structures,Dremel,MapReduce — Patrick Durusau @ 4:10 pm

Google, along with Bing and Yahoo! have been attracting a lot of discussion for venturing into web semantics without asking permission.

However that turns out, please don’t miss:

Dremel: interactive analysis of web-scale datasets

Abstract:

Dremel is a scalable, interactive ad hoc query system for analysis of read-only nested data. By combining multilevel execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

I am still working through the article but “…aggregation queries over trillion-row tables in seconds,” is obviously of interest for a certain class of topic map.

June 11, 2011

Image gallery: 22 free tools for data visualization and analysis

Filed under: Data Analysis,Visualization — Patrick Durusau @ 12:42 pm

Image gallery: 22 free tools for data visualization and analysis

A chart of data visualization and analysis tools from ComputerWorld with required skill levels for use. Accompanies a story which reviews each tool, usefully, albeit briefly. Gives references for further study. If you are looking for a new visualization/analysis tool or just want an overview of the area, this is a good place to start.

May 29, 2011

Exploring NYT news and its authors

Filed under: Data Analysis,Data Mining,Visualization — Patrick Durusau @ 7:05 pm

Exploring NYT news and its authors

To say this project/visualization is clever is an understatement!

Completely inadequate description but the interface constructs a mythic “single” reporter on any topic you choose from stories in the New York Times. The interface also gives you reporters who wrote stories on that topic. You can then find what “other” stories the mythic one reporter wrote, as well as compare the stories written by actual NYT reporters.

A project of the IBM Center for Social Software, see: NYTWrites: Exploring The New York Times Authorship.

April 25, 2011

Inside Horizon: interactive analysis at cloud scale

Filed under: Cloud Computing,Data Analysis,Data Mining — Patrick Durusau @ 3:36 pm

Inside Horizon: interactive analysis at cloud scale

From the website:

Late last year, we were honored to be invited to talk at Reflections|Projections, ACM@UIUC’s annual student-run computing conference. We decided to bring a talk about Horizon, our system for doing aggregate analysis and filtering across very large amounts of data. The video of the talk was posted a few weeks back on the conference website.

Horizon started as research project / technology demonstrator built as part of Palantir’s Hack Week – a periodic innovation sprint that our engineering team uses to build brand new ideas from whole cloth. It was then used to by the Center For Public Integrity in their Who’s Behind The Subprime Meltdown report. We produced a short video on the subject, Beyond the Cloud: Project Horizon, released on our analysis blog. Subsequently, it was folded into our product offering, under the name Object Explorer.

In this hour-long talk, two of the engineers that built this technology tell the story of how Horizon came to be, how it works, and show a live demo of doing analysis on hundreds of millions of records in interactive time.

From the presentation:

Mission statement: Organize the world’s information and make it universally accessible and useful. -> Google’s statement

Which should say:

Organize the world’s [public] information and make it universally accessible and useful.

Palantir’s misson:

Organize the world’s [private] information and make it universally accessible and useful.

Closes on human-driven analysis.

A couple of points:

The demo was of a pre-beta version even though the product version shipped several months prior to the presentation. What’s with that?

Long on general statements and short on any specifics.

Did mention this is a column-store solution. Appears to work well with very clean data, but then what solution doesn’t?

Good emphasis on user interface and interactive responses to queries.

I wonder if the emphasis on interactive responses creates unrealistic expectations among customers?

Or an emphasis on problems that can be solved or appear to be solvable, interactively?

My comments about intelligence community bias the other day for example. You can measure and visualize tweets that originate in Tahrir Square, but if they are mostly from Western media, how meaningful is that?

April 22, 2011

Intuition = …because I said so!

Filed under: Data Analysis,Machine Learning — Patrick Durusau @ 1:05 pm

Intuition & Data-Driven Machine Learning

From the post:

Clever algorithms and pages of mathematical formulas filled with probability and optimization theory are usually the associations that get invoked when you ask someone to describe the fields of AI and Machine Learning. Granted, there is definitely an abundance of both, but this mental picture also tends to obscure some of the more interesting and recent developments in these fields: data driven learning, and the fact that you are often better off developing simple intuitive insights instead of complicated domain models which are meant to represent every attribute of the problem.

I wonder about the closing observation:

you are often better off developing simple intuitive insights instead of complicated domain models which are meant to represent every attribute of the problem.

Does that apply to identifications of subjects as well?

May we not be better off to capture the conclusion of an analyst that “X” is a fact, from some large body of data, rather finding a clever way in the data to map their conclusion to that of other analyst’s?

Both said “X,” what more do we need? True enough we need to identify “X” in some way but that is simpler than trying to justify the conclusion in data.

I suppose I am arguing there should be room in subject identification for human intuition, that is, “…because I said so!” 😉

Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics

Filed under: Data Analysis,Hadoop,Indexing,MapReduce — Patrick Durusau @ 1:04 pm

Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics by Jimmy Lin, Dmitriy Ryaboy, and Kevin Weil.

Abstract:

MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one ineffcient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.

I always hope when I see first-class citizen(s) in CS papers that it is going to be talking about data structures and/or metadata (hopefully both).

Alas, I was disappointed once again but the paper is an interesting one and will repay close study.

Oh, the reason I mention treating data structures and metadata as first class citizens is then I can avoid the my way, your way or the highway sort of choices when it comes to metadata and formats.

Granted some formats maybe easier to use on some contexts, such as HDF5 (for space data), FITS (astronomical images), XML (for data and documents) or COBOL (for financial transactions), but if I can see formats as first class citizens, then I can map between them.

Not in a conversion sense, I can see them as though they are the same format as I prefer. Extract data from them, write data to them, etc.

April 10, 2011

When the Data Struts Its Stuff

Filed under: Data Analysis,Visualization — Patrick Durusau @ 2:52 pm

When the Data Struts Its Stuff

A New York Times piece on data visualization.

Probably not anything you don’t already know or at least suspect but it is well written and emphasizes the riches that await discovery.

Think of it as setting the bar for topic map applications that are going to attract a lot of positive press.

March 16, 2011

KNIME – 4th Annual User Group Meeting

Filed under: Data Analysis,Heterogeneous Data,Mapping,Subject Identity — Patrick Durusau @ 3:14 pm

KNIME – 4th Annual User Group Meeting

From the website:

The 4th KNIME Workshop and Users Meeting at Technopark in Zurich, Switzerland took place between February 28th and March 4th, 2011 and was a huge success.

The meeting was very well attended by more than 130 participants. The presentations ranged from customer intelligence and applications of KNIME in soil and fuel research through to high performance data analytics and KNIME applications in the Life Science industry. The second meeting of the special interest group attracted more than 50 attendees and was filled with talks about how KNIME can be put to use in this fast growing research area.

Presentations are available.

A new version of KNIME is available for download with the features listed in ChangeLog 2.3.3.

Focused on data analytics and work flow, another software package that could benefit from an interchangeable subject-oriented approach.

March 8, 2011

Summify’s Technology Examined

Filed under: Data Analysis,Data Mining,MongoDB,MySQL,Redis — Patrick Durusau @ 9:54 am

Summify’s Technology Examined

Phil Whelan writes an interesting review of the underlying technology for Summify.

Many those same components are relevant to the construction of topic map based services.

Interesting that Summify uses MySQL, Redis and MongoDB.

I rather like the idea of using the best tool for a particular job.

Worth a close read.

March 2, 2011

OSCON Data 2011 Call for Participation

Filed under: Conferences,Data Analysis,Data Mining,Data Models,Data Structures — Patrick Durusau @ 7:07 am

OSCON Data 2011 Call for Participation

Deadline: 11:59pm 03/14/2011 PDT

From the website:

The O’Reilly OSCON Data conference is the first of its kind: bringing together open source culture and data hackers to cover data management at a very practical level. From disks and databases through to big data and analytics, OSCON Data will have instruction and inspiration from the people who actually do the work.

OSCON Data will take place July 25-27, 2011, in Portland, Oregon. We’ll be co-located with OSCON itself.

Proposals should include as much detail about the topic and format for the presentation as possible. Vague and overly broad proposals don’t showcase your skills and knowledge, and our volunteer reviewers aren’t mind readers. The more you can tell us, the more likely the proposal will be selected.

Proposals that seem like a “vendor pitch” will not be considered. The purpose of OSCON Data is to enlighten, not to sell.

Submit a proposal.

Yes, it is right before Balisage but I think worth considering if you are on the West Coast and can’t get to Balisage this year or if you are feeling really robust. 😉

Hmmm, I wonder how a proposal that merges the indexes of the different NoSQL volumes from O’Reilly would be received? You are aware that O’Reilly is re-creating the X-Windows problem that was the genesis of both topic maps and DocBook?

I will have to write that up in detail at some point. I wasn’t there but have spoken to some of the principals who were. Plus I have the notes, etc.

February 25, 2011

…a grain of salt

Filed under: Data Analysis,Data Models,Data Structures,Marketing — Patrick Durusau @ 5:46 pm

Benjamin Bock asked me recently about how I would model a mole of salt in a topic map.

That is a good question but I think we had better start with a single grain of salt and then work our way up from there.

At first blush, and only at first blush, do many subjects look quite easy to represent in a topic map.

A grain of salt looks simple to at first glance, just create a PSI (Published Subject Identifier), put that as the subjectIdentifier on a topic and be done with it.

Well…, except that I don’t want to talk about a particular grain of salt, I want to talk about salt more generally.

OK, one of those, I see.

Alright, same answer as before, except make the PSI for salt in general, not some particular grain of salt.

Well,…., except that when I go to the Wikipedia article on salt, Salt, I find that salt is a compound of chlorine and sodium.

A compound, oh, that means something made up of more than one subject. In a particular type of relationship.

Sounds like an association to me.

Of a particular type, an ionic association. (I looked it up, see: Ionic Compound)

And this association between chlorine and sodium has several properties reported by Wikipedia, here are just a few of them:

  • Molar mass: 58.443 g/mol
  • Appearance: Colorless/white crystalline solid
  • Odor: Odorless
  • Density: 2.165 g/cm3
  • Melting point: 801 °C, 1074 K, 1474 °F
  • Boiling point: 1413 °C, 1686 K, 2575 °F
  • … and several others.

    If you are interested in scientific/technical work, please be aware of CAS, a work product of the American Chemical Society, with a very impressive range unique identifiers. (56 million organic and inorganic substances, 62 million sequences and they have a counter that increments while you are on the page.)

    Note that unlike my suggestion, CAS takes the assign a unique identifier view for the substances, sequences and chemicals that they curate.

    Oh, sorry, got interested in the CAS as a source for subject identification. In fact, that is a nice segway to consider how to represent the millions and millions of compounds.

    We could create associations with the various components being role players but then we would have to reify those associations in order to hang additional properties off of them. Well, technically speaking in XTM we would create non-occurrence occurrences and type those to hold the additional properties.

    Sorry, I was presuming the decision to represent compounds as associations. Shout out when I start to presume that sort of thing. 😉

    The reason I would represent compounds as associations is that the components of the associations are then subjects I can talk about and even add addition properties to, or create mappings between.

    I suspect that CAS has chemistry from the 1800’s fairly well covered but what about older texts? Substances before then may not be of interest to commercial chemists but certainly would be of interest to historians and other scholars.

    Use of a topic map plus the CAS identifiers would enable scholars studying older materials to effectively share information about older texts, which have different designations for substances than CAS would record.

    You could argue that I could use a topic for compounds, much as CAS does, and rely upon searching in order to discover relationships.

    Tis true, tis true, but my modeling preference is for relationships seen as subjects, although I must confess I would prefer a next generation syntax that avoids the reification overhead of XTM.

    Given the prevalent of complex relationships/associations as you see from the CAS index, I think a simplification of the representation of associations is warranted.

    Sorry, I never did quite reach Benjamin’s question about a mole of salt but I will take up that gauge again tomorrow.

    We will see that measurements (which figured into his questions about recipes as well) is an interesting area of topic map design.
    *****

    PS: Comments and/or suggestions on areas to post about are most welcome. Subject analysis for topic maps is not unlike cataloging in library science to a degree, except that what classification you assign is entirely the work product of your experience, reading and analysis. There are no fixed answers, only the ones that you find the most useful.

    February 14, 2011

    “Data Bootcamp” tutorial at O’Reilly’s Strata Conference 2011

    Filed under: Data Analysis,Data Mining — Patrick Durusau @ 10:39 am

    “Data Bootcamp” tutorial at O’Reilly’s Strata Conference 2011

    All the materials from the “Data Bookcamp.”

    I haven’t had time to review the materials but am looking forward to it.

    February 11, 2011

    Dealing with Data

    Filed under: Data Analysis,Data Mining,Marketing — Patrick Durusau @ 12:45 pm

    Dealing with Data

    From the website:

    In the 11 February 2011 issue, Science joins with colleagues from Science Signaling, Science Translational Medicine, and Science Careers to provide a broad look at the issues surrounding the increasingly huge influx of research data. This collection of articles highlights both the challenges posed by the data deluge and the opportunities that can be realized if we can better organize and access the data.

    Science is making access to this entire collection FREE (simple registration is required for non-subscribers).

    The growing concern over the influx of data represents a golden marketing opportunity for topic maps!

    First, the predictions about increasing amounts of data are coming true.

    That means impressive numbers to cite and even more impressive predictions about the future.

    Second, the coming data deluge represents a range of commercial opportunities.

    Opportunities for reuse, comparison, and mining such data abound. And, only increase as more data comes online.

    Are you going to be the Facebook of some data area?

    Third, and the reason unique to topic maps:

    The format that contains data is recognized as composed of subjects.

    Subjects that can be identified, placed in associations, have properties added to them,

    That one insight is critical to re-use, combination and comparison of data in the data deluge.

    If you identify the subjects that compose those structures, as well as the subject thought to be recognized by those data structures, you can then create maps between diverse data sets.

    It is the identification of subjects that enables the creation and interchange of maps of where to swim in this vast sea of data.

    *****
    PS: I am going to take a slow walk through these articles and will be posting about opportunities that I see for topic maps. Your comments/feedback welcome!

    February 10, 2011

    The unreasonable effectiveness of simplicity

    Filed under: Authoring Topic Maps,Crowd Sourcing,Data Analysis,Subject Identity — Patrick Durusau @ 1:50 pm

    The unreasonable effectiveness of simplicity from Panos Ipeirotis suggests that simplicity should be considered in the construction of information resources.

    The simplest aggregation technique: Use the majority vote as the correct answer.

    I am mindful of the discussion several years ago about visual topic maps. Which was a proposal to use images as identifiers. Certainly doable now but the simplicity angle suggests an interesting possibility.

    Would not work for highly abstract subjects, but what if users were presented with images when called upon to make identification choices for a topic map?

    For example, marking entities in a newspaper account, the user is presented with images near each marked entity and chooses yes/no.

    Or in legal discovery or research, a similar mechanism, along with the ability to annotate any string with an image/marker and that image/marker appears with that string in the rest of the corpus.

    Unknown to the user is further information about the subject they have identified that forms the basis for merging identifications, linking into associations, etc.

    A must read!

    « Newer PostsOlder Posts »

    Powered by WordPress