September « 2012 « Another Word For It

September 24, 2012

High Dimensional Undirected Graphical Models

Filed under: Graphs,High Dimensionality,Uncertainty — Patrick Durusau @ 4:06 pm

High Dimensional Undirected Graphical Models by Larry Wasserman.

Larry discusses uncertainty in high dimensional graphs. No answers but does illustrate the problem.

Comments Off

Alpinism & Natural Language Processing

Filed under: Natural Language Processing,Topic Maps — Patrick Durusau @ 3:53 pm

Alpinism & Natural Language Processing

You will find a quote in this posting that reads:

“In linguistics and cultural studies, the change of language use over time, special terminology and cultural shifts are of interest. The ”speaking” about mountains is characterised by cultural, historical and social factors; therefore, language use can be viewed as a mirror of these factors. The extra-linguistic world, the essence of a culture, can be reconstructed through analyzing language use within alpine literature in terms of temporal and local specifics that emerged from this typical use of language (Bubenhofer, 2009). For instance, frequent use of personal pronouns and specific intensifiers in texts between 1930 and 1950 can be interpreted as a shift to a more subjective, personal role that mountaineering played in society. In contrary, between 1880 and 1900, the language surface shows less emotionality which probably is a mirror of a period when the moun- tain pioneers claimed more seriousness (Bubenhofer and Schro ̈ter, 2010).”

I thought this might prove interesting to topic map friends who live in areas where mountains and mountain climbing are common.

Comments Off

Oracle ADF Core Functionality Now Available for Free…

Filed under: Oracle — Patrick Durusau @ 3:28 pm

Oracle ADF Core Functionality Now Available for Free – Presenting Oracle ADF Essentials by Shay Shmeltzer.

From the post:

We are happy to announce the new Oracle ADF Essentials – a free to develop and deploy version of the core technologies at the base of Oracle ADF – Oracle’s strategic development framework that was used, among other things, to build the new generation of the enterprise Oracle Fusion Applications.

This release is aligned with the new Oracle JDeveloper 11.1.2.3 version that we released today.

Oracle ADF Essentials enables developers to use the following free:

Oracle ADF Faces Rich Client components –over 150 JSF 2.0 components that include extensive charting and data visualization components, supports skinning, internalization, accessibility and touch gestures and providing advanced Ajax, windowing, drag and drop and other UI capabilities in a declarative way.

Oracle ADF Controller – an extension on top of the JSF controller providing complete process flow definition and enabling advanced reusability of flows inside page’s regions.

Oracle ADF Binding – a declarative way to bind various business services to JSF user interfaces eliminating tedious managed-beans coding.

Oracle ADF Business Components – a declarative layer for building Java based business services on top of relational databases.

The lesson here is to give away tools for people to write the interfaces to products you are interested in selling. Particularly if interfaces aren’t in your product line.

Like applying topic maps to relational database content. Just as an example.

I first saw this at DZone.

Comments Off

How to Build a Recommendation Engine

Filed under: Recommendation — Patrick Durusau @ 3:14 pm

How to Build a Recommendation Engine by John F. McGowan.

From the post:

This article shows how to build a simple recommendation engine using GNU Octave, a high-level interpreted language, primarily intended for numerical computations, that is mostly compatible with MATLAB. A recommendation engine is a program that recommends items such as books and movies for customers, typically of a web site such as Amazon or Netflix, to purchase. Recommendation engines frequently use statistical and mathematical methods to estimate what items a customer would like to buy or would benefit from purchasing.

From a purely business point of view, one would like to maximize the profit from a customer, discounted for time (a dollar today is worth more than a dollar next year), over the duration that the customer is a customer of the business. In a long term relationship with a customer, this probably means that the customer needs to be happy with most purchases and most recommendations.

Recommendation engines are “hot” right now. There are many attempts to apply advanced statistics and mathematics to predict what customers will buy, what purchases will make customers happy and buy again, and what purchases deliver the most value to customers. Data scientists are trying to apply a range of methods with fancy technical names such as principal component analysis (PCA), neural networks, and support vector machines (SVM) — amongst others — to predicting successful purchases and personalizing recommendations for individual customers based on their stated preferences, purchasing history, demographics and other factors.

This article presents a simple recommendation engine using Pearson’s product moment correlation coefficient, also known as the linear correlation coefficient. The engine uses the correlation coefficient to identify customers with similar purchasing patterns, and presumably tastes, and recommends items purchased by one customer to the other similar customer who has not purchased those items.

Probably not the recommendation engine you will use for commercial deployment.

But, it will give you a good start on understanding the principles of recommendation engines.

My interest in recommendations isn’t so much to identify the subjects of recommendation, which are topics in their own rights, as in probing the basis for subject identification by multiple users.

That is there is some identification that underlies a choice of some book or movie over another. It may not be possible to identify the components of that identification, but we do have aftermath of that identification.

Rather than collapsing dimensions, thinking we should expand the dimensions around choices to see if any patterns emerge.

I first saw this at DZone.

Comments Off

Schedule This! Strata + Hadoop World Speakers from Cloudera

Filed under: Cloudera,Conferences,Hadoop — Patrick Durusau @ 2:38 pm

Schedule This! Strata + Hadoop World Speakers from Cloudera by Justin Kestelyn.

Oct. 23-25, 2012, New York City

From the post:

We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)

The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.

Just in case the Clouderans aren’t enough incentive to attend (they should be), consider the full schedule for the conference.

Comments Off

ZooKeeper 3.4.4 is Now Available

Filed under: Zookeeper — Patrick Durusau @ 2:17 pm

ZooKeeper 3.4.4 is Now Available by Mahadev Konar.

From the post:

Apache ZooKeeper release 3.4.4 is now available. This is a bug fix release including 50 bug fixes. Following is a summary of the critical issues fixed in the release.

Cool!

Comments Off

September 23, 2012

Congress.gov: New Official Source of U.S. Federal Legislative Information

Filed under: Government,Government Data,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 7:50 pm

Congress.gov: New Official Source of U.S. Federal Legislative Information

Legal Informatics has gathered up links to a number of reviews/comments on the new legislative interface for the U.S. federal government.

You can see the beta version at: Congress.gov.

Personally I like search and popularity being front and center, but that makes me wonder what isn’t available. Like bulk downloads in some reasonable format (can you say XML?).

What do you think about the interface?

Comments Off

Java: Parsing CSV files

Filed under: CSV,Java — Patrick Durusau @ 7:37 pm

Java: Parsing CSV files by Mark Needham

Mark is switching to OpenCSV.

See his post for how he is using OpenCSV and other info.

Comments Off

Meet the Committer, Part Two: Matt Foley [Ambari herein]

Filed under: Clustering (servers),Hadoop — Patrick Durusau @ 7:30 pm

Meet the Committer, Part Two: Matt Foley by Kim Truong

From the post:

For the next installation of “Future of Apache Hadoop” webinar series, I would like to introduce to you Matt Foley and Ambari. Matt is a member of Hortonworks technical staff, Committer and PMC member for Apache Hadoop core project and will be our guest speaker on September 26, 2012 @10am PDT / 1pm EDT webinar: Deployment and Management of Hadoop Clusters with AMBARI.

Get to know Matt in this second installment of our “Meet the Committer” series.

No pressure but I do hope this compares well to the Alan Gates webinar on Pig. No pressure. 😉

In case you want to investigate/learn/brush up on Ambari.

Comments Off

Qlikview and Google BigQuery…

Filed under: Google BigQuery,Qlikview — Patrick Durusau @ 4:57 pm

Qlikview and Google BigQuery – Data Visualization for Big Data by Istvan Szegedi.

From the post:

Google have launched its BigQuery cloud service in May to support interactive analysis of massive datasets up to billions of rows. Shortly after this launch Qliktech, one of the market leaders in BI solutions who is known for its unique associative architecture based on colunm store, in-memory database demonstrated a Qlikview Google BigQuery application that provided data visualization using BigQuery as backend. This post is about how Qlikview and Google BigQuery can be intagrated to provide easy-to-use data analytics application for business users who work on large datasets.

A “big data” offering to limber you up for the coming week!

Comments Off

neo4j: The Batch Inserter and the sunk cost fallacy

Filed under: Design,Neo4j — Patrick Durusau @ 4:47 pm

neo4j: The Batch Inserter and the sunk cost fallacy by Mark Needham.

From the post:

About a year and a half ago I wrote about the sunk cost fallacy which is defined like so:

The Misconception: You make rational decisions based on the future value of objects, investments and experiences.

The Truth: Your decisions are tainted by the emotional investments you accumulate, and the more you invest in something the harder it becomes to abandon it.

Over the past few weeks Ashok and I have been doing some exploration of one of our client’s data by modelling it in a neo4j graph and seeing what interesting things the traversals reveal.

Taking his own advice reduced a load time from 20 to 2 minutes.

Worth your time to read and consider.

Comments Off

Five User Experience Lessons from Tom Cruise

Filed under: Interface Research/Design,Usability,Users — Patrick Durusau @ 3:24 pm

Five User Experience Lessons from Tom Cruise by Steve Tengler.

From the post:

As previously said best by Steve Jobs, “The broader one’s understanding of the human experience, the better designs we will have.” And the better the design, the more your company will thrive.

But how can we clarify some basics of User Experience for the masses? The easiest and obvious point of reference is pop culture; something to which we all can relate. My first inclination was to make this article “Five User Experience Lessons from Star Wars” since, at my core, I am a geek. But that’s like wearing a “KICK ME” sign at recess, so I thought better of it. Instead, I looked to a source of some surprisingly fantastic examples: movie characters played by Tom Cruise. I know, I’m playing up to my female readers, but hey, they represent 51% of the population … so I’m simply demonstrating that understanding your customer persona is part of designing a good user experience!

Tengler’s Five Lessons:

Lesson #1: Social Media Ratings of User Experiences Can Be Powerful

Lesson #2: Arrange Your User Interface around the Urgent Tasks

Lesson #3: Design Your System with a Multimodal Interface

Lesson #4: You Must Design For Human Error Upfront For Usability

Lesson #5: Style Captures the Attention

Whether you are a female reader or not, you will find the movie examples quite useful.

What actor/actress and movies would you choose for these principles?

Walk your users through the lessons and ask them to illustrate the lessons with movies they have seen.

A good way to break the ice for designing a user interface.

Comments Off

Pig Out to Hadoop (Replay) [Restore Your Faith in Webinars]

Filed under: Hadoop,Hortonworks,Pig — Patrick Durusau @ 3:08 pm

Pig Out to Hadoop with Alan Gates (Link to the webinar page at Hortonworks. Scroll down for this webinar. You have to register/login to view.)

From the description:

Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.

I should have been watching more closely for this webinar recording to get posted.

Not only is it a great webinar on Pig, but it will restore your faith in webinars as a means of content delivery.

I have suffered through several lately where introductions took more time than actual technical content of the webinar.

Hard to know until you have already registered and spent time expecting substantive content.

Is there a public tally board for webinars on search, semantics, big data, etc.?

Comments (1)

Analysis of Boolean Functions

Filed under: Boolean Functions,Mathematics — Patrick Durusau @ 2:39 pm

Analysis of Boolean Functions. Course by Ryan O’Donnell.

The course description:

Boolean functions, f : {0,1}ⁿ → {0,1}, are perhaps the most basic object of study in theoretical computer science. They also arise in several other areas of mathematics, including combinatorics (graph theory, extremal combinatorics, additive combinatorics), metric and Banach spaces, statistical physics, and mathematical social choice.

In this course we will study Boolean functions via their Fourier transform and other analytic methods. Highlights will include applications in property testing, social choice, learning theory, circuit complexity, pseudorandomness, constraint satisfaction problems, additive combinatorics, hypercontractivity, Gaussian geometry, random graph theory, and probabilistic invariance principles.

If you look at the slides from Lecture One, 2007, you will see all the things that “boolean function” means across several disciplines.

It should also give you an incentive to keep up with the videos of the 2012 version.

Comments Off

Working More Effectively With Statisticians

Filed under: Bioinformatics,Biomedical,Data Quality,Statistics — Patrick Durusau @ 10:33 am

Working More Effectively With Statisticians by Deborah M. Anderson. (Fall 2012 Newsletter of Society for Clinical Data Management, pages 5-8)

Abstract:

The role of the clinical trial biostatistician is to lend scientific expertise to the goal of demonstrating safety and efficacy of investigative treatments. Their success, and the outcome of the clinical trial, is predicated on adequate data quality, among other factors. Consequently, the clinical data manager plays a critical role in the statistical analysis of clinical trial data. In order to better fulfill this role, data managers must work together with the biostatisticians and be aligned in their understanding of data quality. This article proposes ten specific recommendations for data managers in order to facilitate more effective collaboration with biostatisticians.

See the article for the details but the recommendations are generally applicable to all data collection projects:

Recommendation #1: Communicate early and often with the biostatistician and provide frequent data extracts for review.

Recommendation #2: Employ caution when advising sites or interactive voice/web recognition (IVR/IVW) vendors on handling of randomization errors.

Recommendation #3: Collect the actual investigational treatment and dose group for each subject.

Recommendation #4: Think carefully and consult the biostatistician about the best way to structure investigational treatment exposure and accountability data.

Recommendation #5: Clarify in electronic data capture (EDC) specifications whether a question is only a “prompt” screen or whether the answer to the question will be collected explicitly in the database.

Recommendation #6: Recognize the most critical data items from a statistical analysis perspective and apply the highest quality standards to them.

Recommendation #7: Be alert to protocol deviations/violations (PDVs).

Recommendation #8: Plan for a database freeze and final review before database lock.

Recommendation #9: Archive a snapshot of the clinical database at key analysis milestones and at the end of the study.

Recommendation #10: Educate yourself about fundamental statistical principles whenever the opportunity arises.

I first saw this at John Johnson’s Data cleaning is harder than statistical analysis.

Comments Off

The Cost of Strict Global Consistency [Or Rules for Eventual Consistency]

Filed under: Consistency,Database,Finance Services,Law,Law - Sources — Patrick Durusau @ 10:15 am

What if all transactions required strict global consistency? by Matthew Aslett.

Matthew quotes Basho CTO Justin Sheehy on eventual consistency and traditional accounting:

“Traditional accounting is done in an eventually-consistent way and if you send me a payment from your bank to mine then that transaction will be resolved in an eventually consistent way. That is, your bank account and mine will not have a jointly-atomic change in value, but instead yours will have a debit and mine will have a credit, each of which will be applied to our respective accounts.”

And Matthew comments:

The suggestion that bank transactions are not immediately consistent appears counter-intuitive. Comparing what happens in a transaction with a jointly atomic change in value, like buying a house, with what happens in normal transactions, like buying your groceries, we can see that for normal transactions this statement is true.

We don’t need to wait for the funds to be transferred from our accounts to a retailer before we can walk out the store. If we did we’d all waste a lot of time waiting around.

This highlights a couple of things that are true for both database transactions and financial transactions:

that eventual consistency doesn’t mean a lack of consistency

that different transactions have different consistency requirements

that if all transactions required strict global consistency we’d spend a lot of time waiting for those transactions to complete.

All of which is very true but misses an important point about financial transctions.

Financial transactions (involving banks, etc.) are eventually consistent according to the same rules.

That’s no accident. It didn’t just happen that banks adopted ad hoc rules that resulted in a uniform eventual consistency.

It didn’t happen over night but the current set of rules for “uniform eventual consistency” of banking transactions are spelled out by the Uniform Commercial Code. (And other laws, regulations but that is a major part of it.)

Dare we say a uniform semantic for financial transactions was hammered out without the use of formal ontologies or web addresses? And that it supports billions of transactions on a daily basis? To become eventually consistent?

Think about the transparency (to you) of your next credit card transaction. Standards and eventual consistency make that possible.

Comments Off

September 22, 2012

Datasets! Datasets! Get Your Datasets Here!

Filed under: Data,Dataset — Patrick Durusau @ 3:59 pm

Datasets from René Pichardt’s group:

Four New Bipartite DBpedia Networks Jérôme Kunegis on 2012/09/19

The project KONECT (Koblenz Network Collection) has extracted and made available four new network datasets based on information in the English Wikipedia, using data from the DBpedia project. The four network datasets are: The bipartite network of writers and their works (113,000 nodes and 122,000 edges) The bipartite network of producers and the works they […]

Finding Web Documents to Represent Entities from a Knowledge Base—A Novel Retrieval Task by Thomas Gottron on 2012/09/13

Assume you have a knowledge base containing entities and their properties or relations with other entities. For instance, think of a knowledge base about movies, actors and directors. For the movies you have structured knowledge about the title and the year they were made in, while for the actors and directors you might have their […]

New Datasets: Wikipedia Hyperlinks in 8 Languages by Jérôme Kunegis on 2012/09/04

The Institute for Web Science and Technologies (WeST) at the University of Koblenz-Landau is making available a new series of datasets: The Wikipedia hyperlink networks in the eight largest Wikipedia languages: http://konect.uni-koblenz.de/networks/wikipedia_link_en – English http://konect.uni-koblenz.de/networks/wikipedia_link_de – German http://konect.uni-koblenz.de/networks/wikipedia_link_fr – French http://konect.uni-koblenz.de/networks/wikipedia_link_ja – Japanese http://konect.uni-koblenz.de/networks/wikipedia_link_it – Italian http://konect.uni-koblenz.de/networks/wikipedia_link_pt – Portugese http://konect.uni-koblenz.de/networks/wikipedia_link_ru – Russian The largest dataset, […]

Ohloh – Open Source Projects Directory by Leon Kastler on 2012/07/23

I found an article about ohloh, a directory created by Black Duck Software with over 500,000 open source projects. They offer a RESTful API and the data is available under the Creative Commons Attribution 3.0 licence. An interesting aspect are Kudos. With a Kudo, a ohlo user can thank another user for his or her contribution, so […]

I started to mention these earlier in the week but decided they needed a separate post.

Comments (2)

Damerau-Levenshtein Edit Distance

Filed under: Damerau-Levenshtein Edit Distance,Edit Distance,Levenshtein Distance — Patrick Durusau @ 3:42 pm

Damerau-Levenshtein Edit Distance by Kevin Stern.

From the post:

The Damerau-Levenshtein distance admits all of the operations from the Levenshtein distance and further allows for swapping of adjacent characters, with the caveat that cost of two adjacent character swaps be at least the cost of a character deletion plus the cost of a character insertion (this caveat enables a fast dynamic programming solution to the problem). There is a sub-variant of the Damerau-Levenshtein distance known as the restricted edit distance which further specifies that no substring be modified more than once, which is primarily what I found when searching for algorithms for computing Damerau-Levenshtein distance, since, I presume, this sub-variant is a bit more straight forward to compute. In addition, I’ve had a difficult time finding a good explanation of the algorithm for computing the full Damerau-Levenshtein distance – hence, the motivation behind this blog post.

A variation on the Levenshtein edit distance algorithm that you may find useful.

I first saw this at DZone.

Comments Off

Damn Cool Algorithms: Cardinality Estimation

Filed under: Algorithms,Cardinality Estimation — Patrick Durusau @ 3:12 pm

Damn Cool Algorithms: Cardinality Estimation by Nick Johnson.

From the post:

Suppose you have a very large dataset – far too large to hold in memory – with duplicate entries. You want to know how many duplicate entries, but your data isn’t sorted, and it’s big enough that sorting and counting is impractical. How do you estimate how many unique entries the dataset contains? It’s easy to see how this could be useful in many applications, such as query planning in a database: the best query plan can depend greatly on not just how many values there are in total, but also on how many unique values there are.

I’d encourage you to give this a bit of thought before reading onwards, because the algorithms we’ll discuss today are quite innovative – and while simple, they’re far from obvious.

Duplicate entries?

They are singing our song!

😉

I found this looking around for newer entries after stumbling on the older one.

Enjoy!

Comments Off

Damn Cool Algorithms: Levenshtein Automata

Filed under: Indexing,Levenshtein Distance — Patrick Durusau @ 3:06 pm

Damn Cool Algorithms: Levenshtein Automata by Nick Johnson.

From the post:

In a previous Damn Cool Algorithms post, I talked about BK-trees, a clever indexing structure that makes it possible to search for fuzzy matches on a text string based on Levenshtein distance – or any other metric that obeys the triangle inequality. Today, I’m going to describe an alternative approach, which makes it possible to do fuzzy text search in a regular index: Levenshtein automata.

Introduction

The basic insight behind Levenshtein automata is that it’s possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton. Further, due to the nature of FSAs, it will do so in O(n) time with the length of the string being tested. Compare this to the standard Dynamic Programming Levenshtein algorithm, which takes O(mn) time, where m and n are the lengths of the two input words! It’s thus immediately apparrent that Levenshtein automaton provide, at a minimum, a faster way for us to check many words against a single target word and maximum distance – not a bad improvement to start with!

Of course, if that were the only benefit of Levenshtein automata, this would be a short article. There’s much more to come, but first let’s see what a Levenshtein automaton looks like, and how we can build one.

Not recent but I think you will enjoy the post anyway.

I first saw this at DZone.

Comments (1)

Building a “Data Eye in the Sky”

Filed under: Intelligence,Prediction — Patrick Durusau @ 2:50 pm

Building a “Data Eye in the Sky” by Erwin Gianchandani.

From the post:

Nearly a year ago, tech writer John Markoff published a story in The New York Times about Open Source Indicators (OSI), a new program by the Federal government’s Intelligence Advanced Research Projects Activity (IARPA) seeking to automatically collect publicly available data, including Web search queries, blog entries, Internet traffic flows, financial market indicators, traffic webcams, changes in Wikipedia entries, etc., to understand patterns of human communication, consumption, and movement. According to Markoff:

It is intended to be an entirely automated system, a “data eye in the sky” without human intervention, according to the program proposal. The research would not be limited to political and economic events, but would also explore the ability to predict pandemics and other types of widespread contagion, something that has been pursued independently by civilian researchers and by companies like Google.

This past April, IARPA issued contracts to three research teams, providing funding potentially for up to three years, with continuation beyond the first year contingent upon satisfactory progress. At least two of these contracts are now public (following the link):

Erwin reviews what is known about programs at Virginia Tech and BBN Technologies.

And concludes with:

Each OSI research team is being required to make a number of warnings/alerts that will be judged on the basis of lead time, or how early the alert was made; the accuracy of the warning, such as the where/when/what of the alert; and the probability associated with the alert, that is, high vs. very high.

To learn more about the OSI program, check out the IARPA website or a press release issued by Virginia Tech.

Given the complexities of semantics, what has my curiosity up is how “warnings/alerts” are going to be judged?

Recalling that “all the lights were blinking red” before 9/11.

If all the traffic lights in the U.S. flashed three (3) times at the same time, without more, it could mean anything from the end of the Mayan calendar to free beer. One just never knows.

Do you have the stats on the oracle at Delphi?

Might be a good baseline for comparison.

Comments Off

Dancing With Dirty Data Thanks to SAP Visual Intelligence [Kinds of Dirty?]

Filed under: Identifiers,SAP,SAP Visual Intelligence — Patrick Durusau @ 2:19 pm

Dancing With Dirty Data Thanks to SAP Visual Intelligence by Timo Elliott.

From the post:

(graphic omitted)

Here’s my entry for the SAP Ultimate Data Geek Challenge, a contest designed to “show off your inner geek and let the rest of world know your data skills are second to none.” There have already been lots of great submissions with people using the new SAP Visual Intelligence data discovery product.

I thought I’d focus on one of the things I find most powerful: the ability to create visualizations quickly and easily even from real-life, messy data sources. Since it’s election season in the US, I thought I’d use some polling data on whether voters believe the country is “headed in the right direction.” There is lots of different polling data on this (and other topics) available at pollingreport.com.

Below you can see the data set I grabbed: as you can see, the polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent (the month is not always included, sometimes spaces around the middle dash, sometimes not…).

Take a closer look at Timo’s definition of “dirty” data: “…polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent….”

Sure, that’s “dirty” data all right, but only one form of dirty data. It is dirty data that arises from typographical inconsistency. Inconsistency that prevents reliable automated processing.

Another form of dirty data arises from identifier inconsistency. That is one or more identifiers are used for the same subject, and/or the same identifier is used for different subjects.

I take the second form, identifier inconsistency to be distinct from typographical inconsistency. Can turn out to overlap but conceptually I find it helpful to distinguish the two.

Resolution of either form of inconsistency requires judgement about the reference being made by the identifiers.

Question: If you are resolving typographical inconsistency, do you keep a map of the resolution? If not, why not?

Question: Same questions for identifier inconsistency.

Comments Off

The Ultimate Data Geek Challenge

Filed under: Challenges,SAP,SAP Visual Intelligence — Patrick Durusau @ 1:59 pm

The Ultimate Data Geek Challenge by Nic Smith.

From the post:

Are You the Ultimate Data Geek?

The time has come to show off your inner geek and let the rest of world know your data skills are second to none.

We’re excited to announce the Ultimate Data Geek Challenge. Grab your data and share your visual creation in a video, screen capture, or blog post on the SCN. Once you enter, you’ll have a chance to be crowned the Ultimate Data Geek.

How Do I Enter?

It’s easy – just four simple steps:

Step 1: Download and install SAP Visual Intelligence here (it’s a free 90-day trial of the full software)

Step 2: Grab your favorite data or use one the data sets we’ve provided on SCN

Step 3: Turn on your creative data geek and document the process with video, screenshots, or a blog post on the SAP Community Network

Step 4: Submit your Data Geek entry by e-mailing a link to your data creation – enter as many times as you like

Important note: Challenge entries will be accepted up until November 30, 2012, at 11:59 p.m. Pacific.

There are videos and other materials to help you learn SAP Visual Intelligence.

Another tool to find subjects and data about subjects. I haven’t looked at SAP Visual Intelligence so would appreciate a shout if you have.

I first saw this at: Dancing With Dirty Data Thanks to SAP Visual Intelligence

Comments Off

The Stages of Database Development (video)

Filed under: Database,Design — Patrick Durusau @ 1:32 pm

The Stages of Database Development (video) by Jeremiah Peschka.

The description:

Strong development practices don’t spring up overnight; they take time, effort, and teamwork. Database development practices are doubly hard because they involve many moving pieces – unit testing, integration testing, and deploying changes that could have potential side effects beyond changing logic. In this session, Microsoft SQL Server MVP Jeremiah Peschka will discuss ways users can move toward a healthy cycle of database development using version control, automated testing, and rapid deployment.

Nothing you haven’t heard before in one form or another.

Question: How does your database environment compare to the one Jeremiah describes?

(Never mind that you have “reasons” (read excuses) for the current state of your database environment.)

Doesn’t just happen with databases or even servers.

What about your topic map development environment?

Or other development environment.

Looking forward to a sequel (sorry) to this video.

Comments Off

Real-Time Twitter Search by @larsonite

Filed under: Indexing,Java,Relevance,Searching,Tweets — Patrick Durusau @ 1:18 pm

Real-Time Twitter Search by @larsonite by Marti Hearst.

From the post:

Brian Larson gives a brilliant technical talk about how real-time search Real-Time Twitter Search by @larsoniteworks at Twitter; He really knows what he’s talking about given that he’s the tech lead for search and relevance at Twitter!

The coverage of the real-time indexing, Java memory model, safe publication were particularly good.

As a bonus, also discusses relevance near the end of the presentation.

You may want to watch this more than once!

Brian recommends Java Concurrency in Practice by Brian Goetz as having good coverage of the Java memory model.

Comments Off

Gnip Introduces Historical PowerTrack for Twitter [Gnip Feed Misses What?]

Filed under: Semantics,Tweets — Patrick Durusau @ 4:31 am

Gnip Introduces Historical PowerTrack for Twitter

From the post:

Gnip, the largest provider of social data to the world, is launching Historical PowerTrack for Twitter, which makes available every public Tweet since the launch of Twitter in March of 2006.

People use Twitter to connect with and share information on the things they care about. To date, analysts have had incomplete access to historical Tweets. Starting today, companies can now analyze a full six years of discussion around their brands and product launches to better understand the impact of these conversations. Political reporters can compare Tweets around the 2008 Election to the activity we are seeing around this year’s Election. Financial firms can backtest their trading algorithms to model how incorporating Twitter data generates additional signal. Business Intelligence companies can incorporate six years of Tweets into their data offerings so their customers can identify correlation with key business metrics like inventory and revenue.

“We’ve been developing Historical PowerTrack for Twitter for more than a year,” said Chris Moody, President and COO of Gnip. “During our early access phase, we’ve given companies like Esri, Brandwatch, Networked Insights, Union Metrics, Waggener Edstrom Worldwide and others the opportunity to take advantage of this amazing new data. With today’s announcement, we’re making this data fully available to the entire data ecosystem.” (emphasis added)

Can you name one thing that Gnip’s “PowerTrack for Twitter” is not capturing?

Think about it for a minute. I am sure they have all the “text” of tweets, along with whatever metadata was in the stream.

So what is Gnip missing and cannot deliver to you?

In a word, semantics.

The one thing that makes one message valuable and another irrelevant.

Example: In a 1950’s episode of “I Love Lucy,” Lucy says to Ricky over the phone, “There’s a man here making passionate love to me.” Didn’t have the same meaning in the 1950’s as it does now (and Ricky was in on the joke).

A firehose of tweets may be impressive, but so is an open fire plug in the summer.

Without direction (read semantics), the water just runs off into the sewer.

Comments Off

September 21, 2012

The 2012 ACM Computing Classification System toc

Filed under: Classification,Ontology — Patrick Durusau @ 7:39 pm

The 2012 ACM Computing Classification System toc

From the post:

The 2012 ACM Computing Classification System has been developed as a poly-hierarchical ontology that can be utilized in semantic web applications. It replaces the traditional 1998 version of the ACM Computing Classification System (CCS), which has served as the de facto standard classification system for the computing field. It is being integrated into the search capabilities and visual topic displays of the ACM Digital Library. It relies on a semantic vocabulary as the single source of categories and concepts that reflect the state of the art of the computing discipline and is receptive to structural change as it evolves in the future. ACM will a provide tools to facilitate the application of 2012 CCS categories to forthcoming papers and a process to ensure that the CCS stays current and relevant. The new classification system will play a key role in the development of a people search interface in the ACM Digital Library to supplement its current traditional bibliographic search.

The full CCS classification tree is freely available for educational and research purposes in these downloadable formats: SKOS (xml), Word, and HTML. In the ACM Digital Library, the CCS is presented in a visual display format that facilitates navigation and feedback.

Will be looking at how the classification has changed since 1998. And since we have so much data online, should not be all that hard to see how well 1998 categories work for 1988, or 1977?

All for a classification that is “current and relevant.”

Still, don’t want papers dropping off the edge of the semantic world due to changes in classification.

Comments Off

WeST researcher’s summary of SocialCom 2012 in Amsterdam

Filed under: Graphity,Graphs,Networks — Patrick Durusau @ 7:26 pm

WeST researcher’s summary of SocialCom 2012 in Amsterdam by René Pichardt.

René has a new principal blogging site and reports on SocialCom 2012.

In the beginning of this month I was attending my first major conferenc IEEE SocialComp2012 in Amsterdam. I was presenting my work Graphity. On the following URL you can find the slides, videos, data sets, source code and of course the full paper!

www.rene-pickhardt.de/graphity

In this article I want to talk about the conference itself. About what presentations I particularly liked and share some impressions.

I got distracted by the Graphity paper but promise I will read the rest of René’s comments on the conference this weekend!

Comments Off

Easy and customizable maps with TileMill

Filed under: Mapping,Maps — Patrick Durusau @ 7:18 pm

Easy and customizable maps with TileMill by Nathan Yau.

From the post:

I’m late to this party. TileMill, by mapping platform MapBox, is open source software that lets you quickly and easily create and edit maps. It’s available for OS X, Windows, and Ubuntu. Just download and install the program, and then load a shapefile for your point of interest.

For those unfamiliar with shapefiles, it’s a file format that describes geospatial data, such as polygons (e.g. countries), lines (e.g. roads), and points (e.g. landmarks), and they’re pretty easy to find these days. For example, you can download detailed shapefiles for roads, bodies of water, and blocks in the United States from the Census Bureau in just a few clicks.

Very cool!

Makes me wonder about shapefiles and relating information to them as information products.

You can download a road shapefile but does it include the road blocking accidents for the last five (5) years?

Comments Off

RDF triple stores — an overview

Filed under: RDF — Patrick Durusau @ 4:26 pm

RDF triple stores — an overview by Lars Marius Garshol.

From the post:

There’s a huge range of triple stores out there, and it’s not trivial to find the one most suited for your exact needs. I reviewed all those I could find earlier this year for a project, and here is the result. I’ve evaluated the stores against the requirements that mattered for that particular project. I haven’t summarized the scores, as everyone’s weights for these requirements will be different.

I’ve deliberately left out rows for whether these tools support things like R2RML, query federation, data binding, SDshare, and so on, even though many of them do. The rationale is that if you pick a triple store that doesn’t support these things you can get support anyway through separate components.

I’ve also deliberately left out cloud-only offerings, as I feel these are a different type of product from the databases you can install and maintain locally.

If you are looking for an RDF triple store, check the post for the full table.

I first saw this at SemanticWeb.com.

Comments (1)

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 24, 2012

September 23, 2012

September 22, 2012

September 21, 2012