Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 19, 2012

Linked Ancient World Data Institute

Filed under: Ancient World,Conferences — Patrick Durusau @ 7:35 pm

Linked Ancient World Data Institute

From the webpage:

Applications due February 17

New York University’s Institute for the Study of the Ancient World (ISAW) will host the Linked Ancient World Data Institute (LAWDI) from May 31st to June 2nd, 2012 in New York City. “Linked Open Data” is an approach to the creation of digital resources that emphasizes connections between diverse information on the basis of published and stable web addresses (URIs) that identify common concepts and individual items. LAWDI, funded by the Office of Digital Humanities of the National Endowment for Humanities, will bring together an international faculty of practitioners working in the field of Linked Data with twenty attendees who are implementing or planning the creation of digital resources.

LAWDI’s intellectual scope is the Ancient Mediterranean and Ancient Near East, two fields in which a large and increasing number of digital resources is available, with rich coverage of the archaeology, literature and history of these regions. Many of these resources publish stable URIs for their content and so are enabling links and re-use that create a varied research and publication environment. LAWDI attendees will learn how to take advantage of these resources and also how to contribute to the growing network of linked scholarly materials.

The organizers encourage applications from faculty, university staff, graduate students, librarians, museum professionals, archivists and others with a serious interest in creating digital resources for the study of the Ancient World. Applications to attend should take the form of an attached (MS-Word, PDF or other common format) one-page statement of interest e-mailed to <sebastian.heath@nyu.edu> by Friday, February 17. A discussion of current or planned work should be a prominent part of this statement. As part of the curriculum, successful applicants will be asked to present their work and be ready to actively participate in conversations about topics presented by faculty and the other participants.

The announcement for LAWDI is here and the organizers are grateful for any circulation of this information.

A second session of LAWDI will also take place from May 30 to June 1 of 2013 at Drew University in New Jersey (http://drew.edu).

If you know of anyone who would be interested, please forward them a link to this post.

RDF silos

Filed under: Linked Data,RDF — Patrick Durusau @ 7:35 pm

Bibliographic Framework: RDF and Linked Data

Karen Coyle writes:

With the newly developed enthusiasm for RDF as the basis for library bibliographic data we are seeing a number of efforts to transform library data into this modern, web-friendly format. This is a positive development in many ways, but we need to be careful to make this transition cleanly without bringing along baggage from our past.

Recent efforts have focused on translating library record formats into RDF with the result that we now have:
    ISBD in RDF
    FRBR in RDF
    RDA in RDF

and will soon have
    MODS in RDF

In addition there are various applications that convert MARC21 to RDF, although none is “official.” That is, none has been endorsed by an appropriate standards body.

Each of these efforts takes a single library standard and, using RDF as its underlying technology, creates a full metadata schema that defines each element of the standard in RDF. The result is that we now have a series of RDF silos, each defining data elements as if they belong uniquely to that standard. We have, for example, at least four different declarations of “place of publication”: in ISBD, RDA, FRBR and MODS, each with its own URI. There are some differences between them (e.g. RDA separates place of publication, manufacture, production while ISBD does not) but clearly they should descend from a common ancestor:
(emphasis added)

Karen makes a very convincing argument about RDF silos and libraries.

I am less certain about her prescription that libraries concentrate on creating data and build records for that data separately.

In part because there aren’t any systems where data exists separate from either an implied or explicit structure to access it. And those structures are just as much “data” as the “data” they enclose. We may not often think of it that way but shortcomings on our part don’t alter our data and the “data” that encloses it.

Guided Exploration = Faceted Search, Backwards

Filed under: Faceted Search,Guided Exploration — Patrick Durusau @ 7:33 pm

Guided Exploration = Faceted Search, Backwards by Daniel Tunkelang.

Daniel starts off:

Information Scent

In the early 1990s, PARC researchers Peter Pirolli and Stuart Card developed the theory of information scent (more generally, information foraging) to evaluate user interfaces in terms of how well users can predict which paths will lead them to useful information. Like many HCIR researchers and practitioners, I’ve found this model to be a useful way to think about interactive information seeking systems.

Specifically, faceted search is an exemplary application of the theory of information scent. Faceted search allows users to express an information need as a keyword search, providing them with a series of opportunities to improve the precision of the initial result set by restricting it to results associated with particular facet values.

For example, if I’m looking for folks to hire for my team, I can start my search on LinkedIn with the keywords [information retrieval], restrict my results to Location: San Francisco Bay Area, and then further restrict to School: CMU.

But quickly comes to:

Guided exploration exchanges the roles of precision and recall. Faceted search starts with high recall and helps users increase precision while preserving as much recall as possible. In contrast, guided exploration starts with high precision and helps users increase recall while preserving as much precision as possible.

That sounds great in theory, but how can we implement guided exploration in practice?

A very interesting look at how to expand a result set and maintain precision at the same time.

Of particular interest for anyone who wants to implement dynamic merging of proxies based on subject similarity.

An open field of research that offers a number of exciting possibilities.

January 18, 2012

Spreadsheet -> Topic Maps: Wrong Direction?

Filed under: Marketing,Spreadsheets,Topic Maps — Patrick Durusau @ 8:01 pm

After reading BI’s Dirty Secrets – Why Business People are Addicted to Spreadsheets and the post it points to, I started to wonder if the spreadsheet -> topic maps path was in the wrong direction?

For example, Spreadsheet Data Connector Released bills itself:

This project contains an abstract layer on top of the Apache POI library. This abstraction layer provides the Spreadsheet Query Language – eXql and additional method to access spreadsheets. The current version is designed to support the XLS and XLSX format of Microsoft© Excel® files.

The Spreadsheet Data Connector is well suited for all use cases where you have to access data in Excel sheets and you need a sophisticated language to address and query the data.

Do you remember “Capt. Wrongway Peachfuzz” from Bullwinkel? That is what this sounds like to me.

You are much more likely to be in Excel and need the subject identity/merging capabilities of topic maps. I won’t say the ratio of going to Excel versus going to topic maps, it’s too embarrassing.

If the right direction is topic maps -> spreadsheet, where should we locate the subject identity/merging capabilities?

What about configurable connectors that accept specification of data sources and subject identity/merging tests?

The BI user sees the spreadsheet just as they always have, as a UI.

Sounds plausible to me. How does it sound to you?

Neo4j Challenge – Seed the Cloud

Filed under: Contest,Graphs,Neo4j — Patrick Durusau @ 7:59 pm

Neo4j Challenge

Important Dates: January 18 – February 13, 2012

From the challenge webpage:

Challenge: Seed the Cloud

Join Neo4j on Heroku, then help others get started by creating a Heroku-ready template or demo application using Neo4j.

The best project templates will win recognition and prizes. Use any language, any framework, with Neo4j!

  1. Create a Project using the Neo4j Add-on
  2. Share the Project as a Template on Gensen
  3. Win a place in the clouds (and cool prizes)

Neo4j has thrown down their gage. Will you be the one that picks it up?

Amazon DynamoDB

Filed under: Amazon DynamoDB,Amazon Web Services AWS — Patrick Durusau @ 7:58 pm

Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications by Werner Vogels.

From the post:

Today is a very exciting day as we release Amazon DynamoDB, a fast, highly reliable and cost-effective NoSQL database service designed for internet scale applications. DynamoDB is the result of 15 years of learning in the areas of large scale non-relational databases and cloud services. Several years ago we published a paper on the details of Amazon’s Dynamo technology, which was one of the first non-relational databases developed at Amazon. The original Dynamo design was based on a core set of strong distributed systems principles resulting in an ultra-scalable and highly reliable database system. Amazon DynamoDB, which is a new service, continues to build on these principles, and also builds on our years of experience with running non-relational databases and cloud services, such as Amazon SimpleDB and Amazon S3, at scale. It is very gratifying to see all of our learning and experience become available to our customers in the form of an easy-to-use managed service.

Amazon DynamoDB is a fully managed NoSQL database service that provides fast performance at any scale. Today’s web-based applications often encounter database scaling challenges when faced with growth in users, traffic, and data. With Amazon DynamoDB, developers scaling cloud-based applications can start small with just the capacity they need and then increase the request capacity of a given table as their app grows in popularity. Their tables can also grow without limits as their users store increasing amounts of data. Behind the scenes, Amazon DynamoDB automatically spreads the data and traffic for a table over a sufficient number of servers to meet the request capacity specified by the customer. Amazon DynamoDB offers low, predictable latencies at any scale. Customers can typically achieve average service-side in the single-digit milliseconds. Amazon DynamoDB stores data on Solid State Drives (SSDs) and replicates it synchronously across multiple AWS Availability Zones in an AWS Region to provide built-in high availability and data durability.

Impressive numbers and I am sure this is impressive software.

Two questions: Werner starts off talking about “internet scale” and then in the second paragraph says there is “…fast performance at any scale.”

Does anybody know what “internet scale” means? If they said U.S. Census scale, where I know software has been developed for record linkage on billion row tables, then I might have some idea of what is meant. If you know or can point to someone who does, please comment.

Second question: So if I need the Amazon DynamoDB because it handles “internet scale,” why would I need it for something less? My wife needs a car to go back and forth to work, but that doesn’t mean she needs a Hummer. Yes? I would rather choose a tool that is fit for the intended purpose. If you know a sensible break point for choosing the Amazon DynamoDB, please comment.

Disclosure: I buy books and other stuff at Amazon. But I don’t think my purchases past, present or future have influenced my opinions in this post. 😉

First seen at: myNoSQL as: Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications.

Compact Binary Relation Representations with Rich Functionality

Filed under: Algorithms,Binary Relations,Graphs,Trees,Wavelet Trees — Patrick Durusau @ 7:57 pm

Compact Binary Relation Representations with Rich Functionality by Jérémy Barbay, Francisco Claude, Gonzalo Navarro.

Abstract:

Binary relations are an important abstraction arising in many data representation problems. The data structures proposed so far to represent them support just a few basic operations required to fit one particular application. We identify many of those operations arising in applications and generalize them into a wide set of desirable queries for a binary relation representation. We also identify reductions among those operations. We then introduce several novel binary relation representations, some simple and some quite sophisticated, that not only are space-efficient but also efficiently support a large subset of the desired queries.

Read the introduction (runs about two of the thirty-two pages) and tell me you aren’t interested. Go ahead.

Just the start of what looks like a very interesting line of research.

Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang

Filed under: Erlang,Identifiers — Patrick Durusau @ 7:54 pm

Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang

From the post:

At Boundary we have developed a system for unique id generation. This started with two basic goals:

  • Id generation at a node should not require coordination with other nodes.
  • Ids should be roughly time-ordered when sorted lexicographically. In other words they should be k-ordered 1, 2.

All that is required to construct such an id is a monotonically increasing clock and a location 3. K-ordering dictates that the most-significant bits of the id be the timestamp. UUID-1 contains this information, but arranges the pieces in such a way that k-ordering is lost. Still other schemes offer k-ordering with either a questionable representation of ‘location’ or one that requires coordination among nodes.

Just in case you are looking for a decentralized source of K-ordered unique IDs. 😉

First seen at: myNoSQL as: Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang.

Neo4j on Heroku (pts. 1, 2 & 3)

Filed under: Graphs,Heroku,Neo4j,Ruby — Patrick Durusau @ 7:53 pm

Neo4j on Heroku Part 1 starts out:

On his blog Marko A. Rodriguez showed us how to make A Graph-Based Movie Recommender Engine with Gremlin and Neo4j.

In this two part series, we are going to take his work from the Gremlin shell and put it on the web using the Heroku Neo4j add-on and altering the Neovigator project for our use case. Heroku has a great article on how to get an example Neo4j application up and running on their Dev Center and Michael Hunger shows you how to add JRuby extensions and provides sample code using the Neo4j.rb Gem by Andreas Ronge.

We are going to follow their recipe, but we are going to add a little spice. Instead of creating a small 2 node, 1 relationship graph, I am going to show you how to leverage the power of Gremlin and Groovy to build a much larger graph from a set of files.

Neo4j on Heroku Part 2 starts out:

We are picking up where we left off on Neo4j on Heroku –Part One so make sure you’ve read it or you’ll be a little lost. So far, we have cloned the Neoflix project, set up our Heroku application and added the Neo4j add-on to our application. We are now ready to populate our graph.

CAUTION: Part 2 populates the graph with over one million relationships! If you are looking for trivial uses of Neo4j, you had better stop here in part 2.

Neo4j on Heroku Part3 starts out:

This week we learned that leaving the create_graph method accessible to the world was a bad idea. So let’s go ahead and delete that route in Sinatra, and instead create a Rake Task for it.

And announces the Neo4j Challenge!

Thanks Max De Marzi!

Statistics 110: Introduction to Probability

Filed under: Mathematics,Statistics — Patrick Durusau @ 7:52 pm

Statistics 110: Introduction to Probability by Joseph Blitzstein.

Description:

Statistics 110 (Introduction to Probability), taught at Harvard University by Joe Blitzstein in Fall 2011. Lecture videos, homework, review material, practice exams, and a large collection of practice problems with detailed solutions are provided. This course is an introduction to probability as a language and set of tools for understanding statistics, science, risk, and randomness. The ideas and methods are useful in statistics, science, philosophy, engineering, economics, finance, and everyday life. Topics include the following. Basics: sample spaces and events, conditional probability, Bayes’ Theorem. Random variables and their distributions: cumulative distribution functions, moment generating functions, expectation, variance, covariance, correlation, conditional expectation. Univariate distributions: Normal, t, Binomial, Negative Binomial, Poisson, Beta, Gamma. Multivariate distributions: joint, conditional, and marginal distributions, independence, transformations, Multinomial, Multivariate Normal. Limit theorems: law of large numbers, central limit theorem. Markov chains: transition probabilities, stationary distributions, reversibility, convergence.

Like Michael Heise, I haven’t watched the lectures but I would appreciate hearing comments from anyone who does.

Particularly in an election year where people are going to be using (mostly abusing) statistics to influence your vote in city, county (parish in Louisiana), state and federal elections.

First seen at Statistics via iTunes by Michael Heise.

Hadoop World 2011 Videos and Slides Available

Filed under: Cloudera,Conferences,Hadoop — Patrick Durusau @ 7:51 pm

Hadoop World 2011 Videos and Slides Available

From the post:

Last November in New York City, Hadoop World, the largest conference of Apache Hadoop practitioners, developers, business executives, industry luminaries and innovative companies took place. The enthusiasm for the possibilities in Big Data management and analytics with Hadoop was palpable across the conference. Cloudera CEO, Mike Olson, eloquently summarizes Hadoop World 2011 in these final remarks.

Those who attended Hadoop World know how difficult navigating a route between two days of five parallel tracks of compelling content can be—particularly since Hadoop World 2011 consisted of sixty-five informative sessions about Hadoop. Understanding that it is nearly impossible to obtain and/or retain all the valuable information shared live at the event, we have compiled all the Hadoop World presentation slides and videos for perusing, sharing and for reference at your convenience. You can turn to these resources for technical Hadoop help and real-world production Hadoop examples, as well as information about advanced data science analytics.

Comments if you attended or suggestions of which ones to watch first?

BI’s Dirty Secrets – Why Business People are Addicted to Spreadsheets

Filed under: Business Intelligence,Marketing,Spreadsheets — Patrick Durusau @ 7:51 pm

BI’s Dirty Secrets – Why Business People are Addicted to Spreadsheets by Rick Sherman.

SecretMicrosoft Excel spreadsheets are the top BI tool of choice. That choking sound you hear is vendors and IT people reacting viscerally when they confront this fact. Their responses include:

  • Business people are averse to change; they don’t want to invest time in learning a new tool
  • Business people don’t understand that BI tools such as dashboards are more powerful than spreadsheets; they’re foolish not to use them
  • Spreadsheets are filled with errors
  • Spreadsheets are from hell

IDC estimated that the worldwide spend on business analytics in 2011 was $90 billion. Studies have found that many firms have more than one BI tool in use, and often more than six BI tools. Yet a recent study found that enterprises have been “stuck” at about a 25% adoption rate of BI tools by business people for a few years.

So why have adoption rates flatlined in enterprises that have had these tools for a while? Are the pundits correct in saying that business people are averse to change, lazy or just ignorant of how wonderful BI tools are?

The answers are very different if you put yourself in the business person’s position.

Read Rick’s blog to see what business people think about changing from spreadsheets.

Have you ever heard the saying: If you can’t lick ’em, join ’em?

There have been a number of presentations/papers on going from spreadsheets to XTM topic maps.

I don’t recall any papers that address adding topic map capabilities to spreadsheets. Do you?

Seems to me the question is:

Should topic maps try for a percentage of the 25% slice of the BI pie (against other competing tools) or, try for a percentage of the 75% of the BI pie owed by spreadsheets?

To avoid the dreaded pie chart, I make images of the respective market shares, one three times the size of the other:

BI Market Shares

Question: If you could only have 3% of a market, which market would you pick?*

See, you are on your way to being a topic map maven and a successful entrepreneur.


* Any resemblance to a question on any MBA exam is purely coincidental.

January 17, 2012

Why Semantic Web Software Must Be Easy(er) to Use

Filed under: Semantic Web — Patrick Durusau @ 8:23 pm

Why Semantic Web Software Must Be Easy(er) to Use

Lee Feigenbaum of Cambridge Semantics writes:

Over on my personal blog, I’ve written a couple of posts that outline two key thoughts on the transformative effects that Semantic Web technologies can have in the enterprise:

There’s a key corrollary of these two observations that you need to keep in mind when building, browsing, or buying Semantic Web software. Semantic Web software must be easy to use.

On the surface, this sounds a bit trite. Surely we should demand that all software be easy to use, right? While ease of use is clearly an important goal in software design in general, I’d argue that it’s absolutely crucial to successfully realizing the value from Semantic Web software….

I think Lee has a point about software, in this case, Semantic Web software, needs to be easy to use.

It isn’t that hard to come up with parallel examples from W3C specs. Take XML for example. Sure, there are DocBook users but compare the number of XML users when you count DocBook users versus XML users when you count users of OpenOffice, LibreOffice, KOffice, MS Word. Several orders of magnitude in favor of the latter. Why? Because it is easier to author XML using better interfaces than exist for DocBook.

Where I disagree with Lee is where he claims:

The point of semantic web tech is not that it’s revolutionary – it’s not cold fusion, interstellar flight, quantum computing – it’s an evolutionary advantage – you could do these projects with traditional techs but they’re just hard enough to be impractical, so IT shops don’t – that’s what’s changing here. Once the technologies and tools are good enough to turn “no-go” into “go”, you can start pulling together the data in your department’s 3 key databases; you can start automating data exchange between your group and a key supply-chain partner; you can start letting your line-of-business managers define their own visualizations, reports, and alerts that change on a daily basis. And when you start solving enough of these sorts of problems, you derive value that can fundamentally affect the way your company does business. (from Asking the Wrong Question)

and

Calendar time is what matters. If my relational database application renders a sales forecast report in 500 milliseconds while my Semantic Web application takes 5 seconds, you might hear people say that the relational approach is 10 times faster than the Semantic Web approach. But if it took six months to design and build the relational solution versus two weeks for the Semantic Web solution, Semantic Sam will be adjusting his supply chain and improving his efficiencies long before Relational Randy has even seen his first report. The Semantic Web lets you do things fast, in calendar time. (from Saving Months, Not Milliseconds)

First, you will notice that Lee doesn’t cite any examples in either case. Which would be the first thing you would expect to see from a marketing document. “Our foobar is quicker, faster, better at X that its competitors.” Even if the test results are cooked, they still give concrete examples.

Second, the truth is for the Semantic Web (original recipe) or Semantic Web (linked data special blend) or topic maps or conceptual graphs or whatever, semantic integration is hard. If it were easy, do you think we would have witnessed the ten year slide from the Scientific American Semantic Web original to the current day, linked data version?

Third, semantic diversity has existed for the length of recorded language, depending on whose estimates you accept, 4,000 to 5,000 years. And there has been no shortage of people with a plan to eliminate semantic diversity all that time. Semantic diversity persists to this day. If people haven’t been able to eliminate semantic diversity in 4,000 to 5,000 years, what chance does an automated abacus have?

Topic Maps as Jigsaw Puzzles?

Filed under: Cyc,Ontology,SUMO,Topic Maps — Patrick Durusau @ 8:20 pm

I ran across:

How could a data governance framework possibly predict how you will assemble the puzzle pieces? Or how the puzzle pieces will fit together within your unique corporate culture? Or which of the many aspects of data governance will turn out to be the last (or even the first) piece of the puzzle to fall into place in your organization? And, of course, there is truly no last piece of the puzzle, since data governance is an ongoing program because the business world constantly gets jumbled up by change.

So, data governance frameworks are useful, but only if you realize that data governance frameworks are like jigsaw puzzles. (emphasis added)

in A Data Governance Framework Jigsaw Puzzle by Jim Harris.

I rather liked the comparison to a jigsaw puzzle and the argument that the last piece seems magical only because it is the last piece. You could jumble them up and some other piece would be the last piece.

The other part that I liked was the conclusion that “…the business world constantly gets jumbled up by change.”

Might want to read that again: “…the business world constantly gets jumbled up by change.”

I will boldly generalize that to: the world constantly gets jumbled by change.

Well, perhaps not such a bold statement as I think anyone old enough to be reading this blog realizes the world of today isn’t the world it was ten years ago. Or five years ago. Or in many cases one year ago.

I think that may explain some of my unease with ontologies that claim to have captured something fundamental rather than something fit for a particular use.

At one time an ontology based on earth, wind, fire and water would have been sufficient for most purposes. It isn’t necessary to claim more than fitness for use and in so doing, it leaves us the ready option to change should a new use come along. One that isn’t served by the old ontology.

Interchange is one use case and if you want to claim that Cyc or SUMO are appropriate for a particular case of interchange, that is a factual claim that can be evaluated. Or to claim that either one is sufficient for “reasoning” about a particular domain. Again, a factual question subject to evaluation.

But the world that produced both Cyc and SUMO isn’t the world of today. Both remain useful but the times they are a changing. Enough change and both ontologies and topic maps will need to change to suit your present needs.

Ontologies and topic maps are jigsaw puzzles with no final piece.

NIST CC Business Use Cases Working Group

Filed under: Cloud Computing,Marketing — Patrick Durusau @ 8:19 pm

NIST CC Business Use Cases Working Group

From the description:

NIST will lead interested USG agencies and industry to define target USG Cloud Computing business use cases (set of candidate deployments to be used as examples) for Cloud Computing model options, to identify specific risks, concerns and constraints.

Not about topic maps per se but certainly about opportunities to apply topic maps! USG agencies, to say nothing of industry, are a hot-bed of semantic diversity.

The more agencies move towards “cloud” computing, the more likely they are to encounter “foreign” or “rogue” data.

Someone is going to have to assist with their assimilation or understanding of that data. May as well be you!

The ClioPatria Semantic Web server

Filed under: Prolog,RDF,Semantic Web — Patrick Durusau @ 8:18 pm

The ClioPatria Semantic Web server

I ran across this whitepaper about the ClioPatria Semantic Web server that reads in part:

What is ClioPatria?

ClioPatria is a (SWI-)Prolog hosted HTTP application-server with libraries for Semantic Web reasoning and a set of JavaScript libraries for presenting results in a browser. Another way to describe ClioPatria is as “Tomcat+Sesame (or Jena) with additional reasoning libraries in Prolog, completed by JavaScript presentation components”.

Why is ClioPatria based on Prolog?

Prolog is a logic-based language using a simple depth-first resolution strategy (SLD resolution). This gives two readings to the same piece of code: the declarative reading and the procedural reading. The declarative reading facilitates understanding of the code and allows for reasoning about it. The procedural reading allows for specifying algorithms and sequential aspects of the code, something which we often need to describe interaction. In addition, Prolog is reflexive: it can reason about Prolog programs and construct them at runtime. Finally, Prolog is, like the RDF triple-model, relational. This match of paradigms avoids the complications involved with using Object Oriented languages for handling RDF (see below). We illustrate the fit between RDF and Prolog by translating an example query from the official SPARQL document:…

Just in case you are interested in RDF or Prolog or both.

Data Governance Next Practices: The 5 + 2 Model

Filed under: Data Governance — Patrick Durusau @ 8:16 pm

Data Governance Next Practices: The 5 + 2 Model by Jill Dyché.

From the post:

If you’re a regular reader of this newsletter or one of my blogs, odds are I’ve already ripped the doors off of some of your closely held paradigms about data governance. I’d like to flatter myself and say that this is because I enjoy being provocative and I’m a little sassy. Though both of these statements are factual, indeed empirically tested, the real reason is because I’m with clients all the time and I see what does and doesn’t work. And one thing I’ve learned from overseeing dozens of client engagement is this: There’s no single right way to deliver data governance.

Companies that have succeeded with data governance have deliberately designed their data governance efforts. They’ve assembled a core working group, normally comprised of half a dozen or so people from both business and IT functions, who have taken the time to envision what data governance will look like before deploying it. These core teams then identify the components, putting them into place like a Georges Seurat painting, the small pieces comprising the larger landscape.

Well, I don’t have any real loyalty to one particular approach to data governance over another. And I am very much interested in data governance approaches that succeed as opposed to those that don’t.

What Jill says makes sense, at least to me but I do have one question that perhaps one of you can answer. (I am asking at her blog as well and will report back with her response.)

In the illustrations there is a circle that surrounds the entire process, labeled “Continuous Measurement.” Happens twice. I searched the text for some explanation of “continuous measurement” or even “measurement” and came up empty.

So, “continuous measurement” of what?

I ask because if I were using this process with a topic map, I would be interested in measuring how closely the mapping of subject identity was meeting the needs of the various groups.

Particularly since semantics change over time, some more quickly than others. That is the data governance project would never be completed, although it might be more or less active depending upon the rate of semantic change.

I am sure there are other aspects to data governance other than semantic identity but it happens to be the one of the greatest interest to me.

Communities of Practice

Filed under: Communities of Practice — Patrick Durusau @ 8:10 pm

Communities of Practice: a brief introduction by Etienne Wenger.

Etienne Wenger is the originator of the term “communities of practice,” although he concedes the social act it names is quite old:

The term “community of practice” is of relatively recent coinage, even though the phenomenon it refers to is age-old. The concept has turned out to provide a useful perspective on knowing and learning. A growing number of people and organizations in various sectors are now focusing on communities of practice as a key to improving their performance.

This brief and general introduction examines what communities of practice are and why researchers and practitioners in so many different contexts find them useful as an approach to knowing and learning.

What are communities of practice?

Communities of practice are formed by people who engage in a process of collective learning in a shared domain of human endeavor: a tribe learning to survive, a band of artists seeking new forms of expression, a group of engineers working on similar problems, a clique of pupils defining their identity in the school, a network of surgeons exploring novel techniques, a gathering of first-time managers helping each other cope. In a nutshell:

Communities of practice are groups of people who share a concern or a passion for something they do and learn how to do it better as they interact regularly.

Note that this definition allows for, but does not assume, intentionality: learning can be the reason the community comes together or an incidental outcome of member’s interactions. Not everything called a community is a community of practice. A neighborhood for instance, is often called a community, but is usually not a community of practice.

Etienne’s post is a good summary of his work with pointers to additional resources if you are interested in the details.

For any number of circumstances, but particularly professional activities, I suspect the term “community of practice” will resonate with potential users/customers.

Be aware that “communities of practice” is a narrower term than interpretive communities, which was coined by Stanley Fish.

Not for actual encounters with clients but good training for the same:

Do you think interpretative communities or communities of practice are more useful in developing a model to be represented as a topic map? Why? Choosing one, discuss how you would develop such a model? (Who you would ask, what you would ask, etc.)

The Long Tail of Semantics

Filed under: Search Engines,Searching,Semantics — Patrick Durusau @ 8:09 pm

It came up in a conversation with Sam Hunting recently that search engines are holding a large end of a long tail of semantics. Well, that is how I would summarize almost 30 minutes of hitting at and around the idea!

Think about it, search engines by their present construction, are bound to a large end of a long tail of search results. That is the end of the long tail that they report to users, with varying degrees of filtering and enhancement, not to mention paid ads.

Problem: The long tail of semantics hasn’t been established in general, for some particular set of terms and certainly not for any particular user. Opps. (as Rick Perry would say)

And each search result represents some unknown position in some long tail of semantics for a particular user. Opps, again.

Search engines do well enough to keep users coming back, so they are hitting some part of the long tail of semantics, they just don’t know what part for any particular user.

I am sure it is easier to count occurrences, queries and the like and trust that the search engine is hitting high enough somewhere on the long tail to justify ad rates.

But what if we could improve that? That is not be banging around somewhere on a long tail of semantics in general but some particular sub-tail of semantics.

For example, we know when terminology is being taken from an English language journal on cardiology. We have three semantic indicators, English as a language, journal as means of publication and cardiology as a subject area. What is more, we can discover without too much difficulty, the papers cited by authors of that journal. Which more likely than note would be recognized by other readers of that journal. So what if we kept the results from that area segregated from other search results and did the same (virtually) with other recognized areas. (Mathematics for example have varying terms even within disciplines, set theory for example, so work would be left to be done.)

Rather than putting search results together and later trying to disambiguate them, start that process at the beginning and preserve as much data as we can that may help distinguish part of a long tail into smaller ones.

(This sounds like “personalization” to me as I write it but personalization has its own hazards and dangers. Some of which can be avoided by asking a librarian. More on that another time.)

Did Web Search kill Artificial Intelligence?

Filed under: Artificial Intelligence — Patrick Durusau @ 8:08 pm

Did Web Search kill Artificial Intelligence?

Matthew Hurst writes (in part):

…, we currently have the following:

  • Search engines that don’t understand language and which attempt to mediate between people (searches by people and documents by people),
  • The best and the brightest coming to work for document oriented web companies.

I can’t help but wonder where the AI project would be today if web search (as it is currently envisioned) hadn’t gobbled up so much bandwidth.

No doubt it would be different, i.e., more papers, more attempts, etc., but all the resources devoted to the Internet would not have made a substantial advance in AI.

Why?

Well, consider that the AI project has been in full swing for over sixty years now, if not a bit longer. True enough, there are scanning miracles that have vastly changed medicine, research in a number of areas, voice recognition, but they are all tightly defined tasks that are capable of precise description.

That cars can be driven autonomously by computers isn’t proof of the success of artificial intelligence. It is confirmation of the complaints we have all made about the “idiot” driving the other car. Granting it is a sensor and computation heavy task, but with enough hardware, it is doable.

But the car example is a good one to illustrate the continuing failure of AI and why the Turing test is inadequate.

First, a question:

Given the same location with the same inputs from its sensors, would a car being driven by an autonomous agent:

  • Take the same path as on a previous run, or
  • Choose to take another path?

I deeply suspect the answer is #1 because computers and their programs are deterministic.

True, you could add a random (or rather pseudo-random) number generator but the program remains deterministic because the random number generator only alters a pre-specified part of the program. It isn’t possible for variation to occur at some other point in the program.

A person, on the other hand, without prior instruction or a random number generator, could take a different path.

Consider the case of Riemann geometry. The computers that generate geometry proofs that humans select as significant, isn’t capable of that sort of insight. Why? Because there is a non-deterministic leap that results in a new insight that wasn’t present before.

Unless and until AI can create a system capable of non-deterministic behavior, other than by design (such as a random number generator or switching trees, etc.), it will not have created artificial intelligence. Perhaps a mimic of intelligence, but nothing more.

January 16, 2012

Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)

Filed under: Database,Knowledge Discovery,Machine Learning — Patrick Durusau @ 2:43 pm

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) will take place in Bristol, UK from September 24th to 28th, 2012.

Dates:

Abstract submission deadline: Thu 19 April 2012
Paper submission deadline: Mon 23 April 2012
Early author notification: Mon 28 May 2012
Author notification: Fri 15 June 2012
Camera ready submission: Fri 29 June 2012
Conference: Mon – Fri, 24-28 September, 2012.

From the call for papers:

The European Conference on “Machine Learning” and “Principles and Practice of Knowledge Discovery in Databases” (ECML-PKDD) provides an international forum for the discussion of the latest high-quality research results in all areas related to machine learning and knowledge discovery in databases and other innovative application domains.

Submissions are invited on all aspects of machine learning, knowledge discovery and data mining, including real-world applications.

The overriding criteria for acceptance will be a paper’s:

  • potential to inspire the research community by introducing new and relevant problems, concepts, solution strategies, and ideas;
  • contribution to solving a problem widely recognized as both challenging and important;
  • capability to address a novel area of impact of machine learning and data mining.

Other criteria are scientific rigour and correctness, challenges overcome, quality and reproducibility of the experiments, and presentation.

I rather like that: quality and reproducibility of the experiments.

As opposed to the “just believe in the power of ….” and you will get all manner of benefits. But no one can produce data to prove those claims.

Reminds me of the astronomer in Ben Johnson’s who claimed to:

I have possessed for five years the regulation of the weather and the distribution of the seasons. The sun has listened to my dictates, and passed from tropic to tropic by my direction; the clouds at my call have poured their waters, and the Nile has overflowed at my command. I have restrained the rage of the dog-star, and mitigated the fervours of the crab. The winds alone, of all the elemental powers, have hitherto refused my authority, and multitudes have perished by equinoctial tempests which I found myself unable to prohibit or restrain. I have administered this great office with exact justice, and made to the different nations of the earth an impartial dividend of rain and sunshine. What must have been the misery of half the globe if I had limited the clouds to particular regions, or confined the sun to either side of the equator?’”

And when asked how he knew this to be true, replied:

“‘Because,’ said he, ‘I cannot prove it by any external evidence; and I know too well the laws of demonstration to think that my conviction ought to influence another, who cannot, like me, be conscious of its force. I therefore shall not attempt to gain credit by disputation. It is sufficient that I feel this power that I have long possessed, and every day exerted it. But the life of man is short; the infirmities of age increase upon me, and the time will soon come when the regulator of the year must mingle with the dust. The care of appointing a successor has long disturbed me; the night and the day have been spent in comparisons of all the characters which have come to my knowledge, and I have yet found none so worthy as thyself.’” (emphasis added)

Project Gutenberg has a copy online: Rasselas, Prince of Abyssinia, by Samuel Johnson.

For my part, I think semantic integration has been, is and will be hard, not to mention expensive.

Determining your ROI is just as necessary for semantic integration project, whatever technology you choose, as for any other project.

Legislation Identity Issues

Filed under: Government,Government Data,Transparency — Patrick Durusau @ 2:41 pm

After posting House Launches Transparency Portal I started to think about all the identity issues that such a resource raises. None of them new but with greater access to the stuff of legislation, the more those issues come to the fore.

The easy ones are going to be identify the bills themselves, what parts of the U.S. Code they modify, legislative history (in terms of amendments), etc. And the current legislation can be tracked, etc.

Legislation identifies the subject matter to which it applies, what the rules on the subject are to become, and a host of other details.

But more than that, legislation, indirectly, identifies who will benefit from the legislation and who will bear the costs of it. Not identified in the sense that we think of social security numbers, addresses or geographic location, but just as certainly identification.

For example, what if a bill in Congress says that it applies to all cities with more than two million inhabitants. (New York, Los Angeles, Chicago, Houston – largest to smallest) Sound fair on the face of it but only four cities in four different states are going to benefit from it.

Another set of identity issues will be who wrote the legislation. Oh, err, members of Congress are “credited” with writing bills but it is my understanding that is a polite fiction. Bills are written by specialists in legislative writing. Some work for the government, some for lobbyists, some for other interest groups, etc.

It would make a very interesting subject identity project to use authorship techniques to try to identify when Covington & Burling LLP, Arnold & Porter LLP, Monsanto, or the hand of the ACLU can be detected in legislation.

Whether you identify the “actual” author of a bill or not, there is also the question of the identity of who paid for the legislation?

All of these “identity” issues and others have always existed with regard to legislation, regulations, executive orders, etc., but making bills available electronically may change how those issues are approached.

Not a plan of action but just imagine say a number people are interested enough in a particular bill to loosely organize and produce an annotated version that ties it to existing laws and probable sources of the bill and who it benefits. Other people, perhaps specialists in campaign finances or even local politics for an area, could further the analysis started by others.

I have been told that political blogging works that way, unlike the conventional news services that horde information and therefore only offer partial coverage of any event.

Whatever semantic technology that is used to produce annotations, RDF, linked data, topic maps (my favorite), you are still going to face complex identity issues with legislation.

Suggestions on possible approaches or interfaces?

Neo4j 1.6.M03 “Jörn Kniv”

Filed under: Graphs,Neo4j — Patrick Durusau @ 2:39 pm

Neo4j 1.6.M03 “Jörn Kniv”

Didn’t mean for today to turn into a Neo4j day but I missed this release last week.

From the post:

Kernel changes

This release includes a popular feature request: the ability to ensure that key-value pairs for entities are unique!

If you look up entities (nodes or relationships) using an external key, you’ll want exactly one entity to correspond to each value of the key. For example, if you have nodes representing people, and you look these up using Social Security Number (SSN), you’ll want exactly one node for each SSN. This is easily achieved if you load all your data sequentially, because you can add a new node each time you meet a value of the key (a new SSN). However, up to now, it has been awkward to maintain this uniqueness when multiple processes are adding data simultaneously (via web requests for example).

Since this is a common use-case, we’ve improved the API to make it easy to enforce entity uniqueness for a given key-value pair. At the index level, we’ve added a new method putIfAbsent which ensures that only one entity will indexed for the key-value pair, even if lots of threads are using the same key-value pair at the same time. Alternatively, if you’d prefer to work with nodes or relationships rather than with the underlying indexes, there’s a higher level API provided by UniqueFactory. This makes it easy to retrieve an entity using get-or-create semantics, i.e. it returns a matching entity if one exists, otherwise it creates one. Again, this mechanism is thread-safe, so it doesn’t matter how many threads call getOrCreate simultaneously, only one entity will be created for each key-value pair. This functionality is also exposed through the REST API, via a ?unique query parameter.

Cypher

Array properties have been supported in Neo4j for a long time, but until now it wasn’t possible to query on them. This milestone makes it possible to filter on array properties in Cypher. We have also improved aggregation performance.

Lucene upgrade

Neo4j uses Apache Lucene for its indexing features – this allows you to find “entry points” into the graph before starting graph-based queries. Lucene is an actively developed project in its own right, and is constantly being enhanced and improved. In this Neo4j release, we’re taking the opportunity to upgrade to a newer stable release of Apache Lucene, so that all users get the benefits of recent enhancements in Lucene. We’ve moved to Lucene 3.5; for details on all the changes, have a look at their changelog.

Hmmm, so putifAbsent returns “the previously indexed entity,” so I should be able to determine if properties I have are absent from the entity. If they are, add, if not, ignore (or add additional provenance information, etc).

Spring Data Neo4j Webinar Follow Up

Filed under: Graphs,Neo4j,Spring Data — Patrick Durusau @ 2:38 pm

Spring Data Neo4j Webinar Follow Up

OK, I’ll ‘fess up, I missed the Spring Data Neo4j webinar. 🙁

But, all is not lost!

Not only can you watch the webinar, grab the slides, and other resources, but questions left when the webinar ended are answered.

Graph Theory and Network Science (Aurelius)

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 2:37 pm

Graph Theory and Network Science (Aurelius)

When a post that starts out:

Graph theory and network science are two related academic fields that have found application in numerous commercial industries. The terms ‘graph’ and ‘network’ are synonymous and one or the other is favored depending on the domain of application. A Rosetta Stone of terminology is provided below to help ground the academic terms to familiar, real-world structures.

And ends:

Ranking web pages is analogous to determining the most influential people in a social network or finding the most relevant concepts in a knowledge network. Finally, all these problems are variations of one general process—graph traversing. Graph traversing is the simple process of moving from one vertex to another vertex over the edges in the graph and either mutating the structure or collecting bits of information along the way. The result of a traversal is either an evolution of the graph or a statistic about the graph.

The tools and techniques developed by graph theorists and networks scientists has an astounding number of practical applications. Interestingly enough, once one has a general understanding of graph theory and network science, the world’s problems start to be seen as one in the same problem.

With about as nice an introduction to why graphs/networks are important as I have read in a long time, you know it is going to be a good day!

Particularly when the source cited on graph traversal is none other than Marko A. Rodriguez and Peter Neubauer of Neo4j fame. (They may be famous for other reasons as well but you will have to contribute those.) (Yes, I noticed who is associated with the site.)

I find the notion of mutating the structure of a graph based on traversal to be deeply interesting and useful in a topic maps context.

The Graph Traversal Pattern

Filed under: Graphs,Neo4j — Patrick Durusau @ 2:34 pm

The Graph Traversal Pattern by Marko A. Rodriguez and Peter Neubauer.

Abstract:

A graph is a structure composed of a set of vertices (i.e.nodes, dots) connected to one another by a set of edges (i.e.links, lines). The concept of a graph has been around since the late 19th century, however, only in recent decades has there been a strong resurgence in both theoretical and applied graph research in mathematics, physics, and computer science. In applied computing, since the late 1960s, the interlinked table structure of the relational database has been the predominant information storage and retrieval model. With the growth of graph/network-based data and the need to efficiently process such data, new data management systems have been developed. In contrast to the index-intensive, set-theoretic operations of relational databases, graph databases make use of index-free, local traversals. This article discusses the graph traversal pattern and its use in computing.

Cited in Graph Theory and Network Science and one of those bibliography items I have been meaning to pick up.

A good resource for understanding the importance of traversal in graph database applications.

Introducing Meronymy SPARQL Database Server

Filed under: RDF,Semantic Web,SPARQL — Patrick Durusau @ 2:33 pm

Introducing Meronymy SPARQL Database Server

Inge Henriksen writes:

I am pleased to announce today that the Meronymy SPARQL Database Server is ready for release later in 2012. Meronymy SPARQL Database Server is a high performance RDF Enterprise Database Management System (DBMS).

Our goal has been to make a really fast, ACID, OS portable, user friendly, secure, SPARQL-driven RDF database server usable with most programming languages.

Let’s not start any language wars about Meronymy being written in C++/assembly, 😉 , and concentrate on its performance in actual use.

Suggested RDF data sets to use to test that performance? (Knowing Inge I trust it is fast but the question is how fast under what circumstances?)

Or other RDF engines to test along side of it?

PS: If you don’t know SPARQL, check out Learning SPARQL by Bob Ducharme.

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings

Filed under: Conferences,Entities,Entity Extraction,Entity Resolution,Search Data,Searching — Patrick Durusau @ 2:32 pm

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings (PDF file)

There you will find:

Session 1:

  • High Performance Clustering for Web Person Name Disambiguation Using Topic Capturing by Zhengzhong Liu, Qin Lu, and Jian Xu (The Hong Kong Polytechnic University)
  • Extracting Dish Names from Chinese Blog Reviews Using Suffix Arrays and a Multi-Modal CRF Model by Richard Tzong-Han Tsai (Yuan Ze University, Taiwan)
  • LADS: Rapid Development of a Learning-To-Rank Based Related Entity Finding System using Open Advancement by Bo Lin, Kevin Dela Rosa, Rushin Shah, and Nitin Agarwal (Carnegie Mellon University)
  • Finding Support Documents with a Logistic Regression Approach by Qi Li and Daqing He (University of Pittsburgh)
  • The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data by Stephane Campinas (National University of Ireland), Diego Ceccarelli (University of Pisa), Thomas E. Perry (National University of Ireland), Renaud Delbru (National University of Ireland), Krisztian Balog (Norwegian University of Science and Technology) and Giovanni Tummarello (National University of Ireland)

Session 2

  • Cross-Domain Bootstrapping for Named Entity Recognition by Ang Sun and Ralph Grishman (New York University)
  • Semi-supervised Statistical Inference for Business Entities Extraction and Business Relations Discovery by Raymond Y.K. Lau and Wenping Zhang (City University of Hong Kong)
  • Unsupervised Related Entity Finding by Olga Vechtomova (University of Waterloo)

Session 3

  • Learning to Rank Homepages For Researcher-Name Queries by Sujatha Das, Prasenjit Mitra, and C. Lee Giles (The Pennsylvania State University)
  • An Evaluation Framework for Aggregated Temporal Information Extraction by Enrique Amigó, (UNED University), Javier Artiles (City University of New York), Heng Hi (City University of New York) and Qi Li (City University of New York)
  • Entity Search Evaluation over Structured Web Data by Roi Blanco (Yahoo! Research), Harry Halpin (University of Edinburgh), Daniel M. Herzig (Karlsruhe Institute of Technology), Peter Mika (Yahoo! Research), Jeffrey Pound (University of Waterloo), Henry S. Thompson (University of Edinburgh) and Thanh Tran Duc (Karlsruhe Institute of Technology)

A good start on what promises to be a strong conference series on entity-oriented search.

Nuxeo World 2011 – Update

Filed under: Conferences,Humor — Patrick Durusau @ 2:31 pm

Nuxeo World 2011 – Update

Videos from Nuxeo World 2011 are available!

Something to watch if your team got knocked out of Super Bowl contention. 😉

Admittedly, there are no GoDaddy ads as you will likely see during the Super Bowl.

I wonder why that is?

That GoDaddy doesn’t sponsor web commercials for technical conferences?

There are a couple of conferences I would like to see have GoDaddy commercials.

To drive registration and not just web “hits,” one of the GoDaddy girls could give away GoDaddy swag at the conference.

Please post if you know how to get that done.

« Newer PostsOlder Posts »

Powered by WordPress