Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 27, 2011

Big Ball of Mud

Filed under: Architecture,Design — Patrick Durusau @ 9:12 pm

Big Ball of Mud by Brian Foote and Joseph Yoder.

I ran across a reference to this paper by John Schmidt in his reply to a comment on his post Four Canonical Techniques That Really Work (Or Not).

The authors present seven patterns of software systems:

  • BIG BALL OF MUD
  • THROWAWAY CODE
  • PIECEMEAL GROWTH
  • KEEP IT WORKING
  • SHEARING LAYERS
  • SWEEPING IT UNDER THE RUG
  • RECONSTRUCTION

All the superlatives have been used before so I will simply say read it.

Think about topic maps, Semantic Web apps, information systems you have helped write or design. Do you recognize any of them after reading this paper? What would you do differently today?

What is a Customer?

Filed under: Data Analysis,Data Integration — Patrick Durusau @ 9:11 pm

I ran across a series of posts where David Loshin explores the question: “What is a Customer?” or as he puts it in The Most Dangerous Question to Ask Data Professionals:

Q: What is the most dangerous question to ask data professionals?

A: “What is the definition of customer?”

And he includes some examples:

  • “customer” = person who gave us money for some of our stuff
  • “customer” = the person using our stuff
  • “customer” = the guy who approved the payment for our stuff
  • “customer account manager” = salesperson
  • “customer service” = complaints office
  • “customer representative” = gadfly

and explores the semantic confusion about how we use “customer.”

In Single Views Of The Customer, David explores the hazards and dangers of a single definition of customer.

When Is A Customer Not A Customer? starts to stray into more familiar territory when he says:

Here are the two pieces of our current puzzle: we have multiple meanings for the term “customer” but we want a single view of whatever that term means. To address this Zen-like conundrum we have to go beyond our context and think differently about the end, not the means. Here are two ideas to drill into: entity vs. role and semantic differentiation.

and after some interesting discussion (which you should read) he concludes:

What we can start to see is that a “customer” is not really a data type, nor is it really a customer. Rather, a “customer” is a role played by some entity (in this case, either an individual or an organization) within some functional context at different points of particular business processes. In the next post let’s decide how we can use this to rethink the single view of the customer.

I will be posting an update when the next post appears.

Visualizing Jane Austen

Filed under: Visualization — Patrick Durusau @ 9:10 pm

Visualizing Jane Austen by Matthew Hurst.

Visualization based on occurrence of names of in paragraphs. (color of paragraphs changed) In part 2, the columns indicate chapters.

I mention it as an example of exploring a text with a relatively unsophisticated tool. Sometimes that may be all you need.

Factual’s Crosswalk API

Filed under: Crosswalk,Mapping — Patrick Durusau @ 9:09 pm

Factual’s Crosswalk API by Matthew Hurst.

From the post:

Factual, which is mining the web for knowledge using a variety of web mining methods, has released an API in the local space which aims to expose, for a specific local entity (e.g. a restaurant) the places on the web that it is mentioned. For example, you might find for a restaurant its homepage, its listing on Yelp, its listing on UrbanSpoon, etc.

This mapping between entities and mentions is potentially a powerful utility. Given all these mentions, if some of the data changes (e.g. via a user update on a Yelp page) then the central knowledge base information for that entity can be updated.

When I looked, the crosswalk API was still limited to the US. Matthew uncovers the accuracy of mapping issues known all to well to topic mappers.

From the Factual site:

Factual Crosswalk does four things:

  1. Converts a Factual ID into 3rd party identifiers and URLs
  2. Converts a 3rd party URL into a Factual canonical record
  3. Converts a 3rd party namespace and ID into a Factual canonical record
  4. Provides a list of URLs where a given Factual entity is found on the Internet

Don’t know about you but I am unimpressed.

In part because of the flatland mapping approach to identification. If all I know is Identifier1 was mapped to Identifier2, that is better than a poke with a sharp stick for identification purposes, but only barely. How do I discover what entity you thought was represented by Identifier1 or Identifier2?

I suppose piling up identifiers is one approach but we can do better than that.


PS: I am adding Crosswalk as a category so I can cover traditional crosswalks as developed by librarians. I am interested in what implicit parts of crosswalks should become explicit in a topic map. Pointers and suggestions welcome. Or conversions of crosswalks into topic maps.

August 26, 2011

Trinity Podcast

Filed under: Graphs,Trinity — Patrick Durusau @ 6:32 pm

Microsoft Research: Trinity is a Graph Database and a Distributed Parallel Platform for Graph Data

Episode of Hanselminutes, a weekly audio talk show with noted web developer and technologist Scott Hanselman and hosted by Carl Franklin.

Scott talks via Skype to Haixun Wang at Microsoft Research Asia about Trinity: a distributed graph database and computing platform. What is a GraphDB? How is it different from a traditional Relational DB, a Document DB or even just a naive in-memory distributed data structure? Will your next database be a graph database?

The interview is quite entertaining and leaving the booster comments aside, is quite informative.

Relational database vendors may be surprised to hear their products described as good for “small data.” In their defense (as if they needed one), I would note there is a lot of money to be made in “small data.”

For further information see the Tinity project homepage at Microsoft Research.

Trinity code is available only for internal release, 🙁 , but you can look at the Trinity Manual.

First play with Neo4j

Filed under: Neo4j — Patrick Durusau @ 6:31 pm

First play with Neo4j

Interesting post on a first experience with Neo4j:

There are many cool things that have emerged over the last few years that I’ve wanted to play with but not had a real reason to. One of these is Neo4j, a graph database; another tool to emerge from the “noSQL movement”. Well, when I was younger, I spent a LOT of time playing on MUDs (indeed, it was this that provided me with my first opportunity to write Real Code); for those who haven’t experienced them before, a common MUD family (ROM) has a movement system that consists of Rooms, discrete locations with a number of Exits to other rooms. Hey look ma, a graph!

Looks like another MUD engine is under development!

My Data Mining Weblog

Filed under: Data Mining — Patrick Durusau @ 6:30 pm

My Data Mining Weblog by Ridzuan Daud.

I stumbled across this blog when I found the Python Data Mining Resources post.

I poked around after reading that post and thought the blog itself needed separate mention. Appears to be a good source of current information as well as listings of books, software, tutorials, etc. Definitely a place to spend some time if you are interested in data mining.

Python Data Mining Resources

Filed under: Data Mining,Python — Patrick Durusau @ 6:29 pm

Python Data Mining Resources

From the post:

Python for data mining has been gaining some interest from data miner community due to its open source, general purpose programming and web scripting language. Below are some resources to kick start doing data mining using Python:

A resource for Lars Marius to point others to when they have questions about his data mining techniques. Errant Perl programmers for instance. 😉

Neoclipse

Filed under: Graphs,Neo4j,Neoclipse — Patrick Durusau @ 6:29 pm

Neoclipse

I am not sure how I missed mentioning this but I did.

From the webpage:

Neoclipse is a subproject of Neo4j which aims to be a tool that supports the development of Neo4j applications.

Main features:

  • visualize the graph
  • increase/decrease the traversal depth
  • filter the view by relationship types
  • add/remove nodes/relationships
  • create relationship types
  • add/remove/change properties on nodes and relationships
  • highlight nodes/relationships in different ways
  • add icons to nodes

Information Systems Category Editor Needed for Computing Reviews

Filed under: Jobs — Patrick Durusau @ 6:27 pm

Information Systems Category Editor Needed for Computing Reviews

Anyone interested in topic maps is likely to be interested in positions such as this one. Your chance to contribute back to the community.

Computing Reviews, the post-publication review and comment journal of ACM, is seeking a volunteer editor interested in serving as category editor for the information systems area (encompassing models & principles, database management, information storage & retrieval, and information systems applications).

The qualified candidate will be prepared to check written reviews of already-published items for quality, and the classification terms from ACM’s CCS for accuracy, as well as use a Web-based editing system to make any suggested changes to the CCS terms or to the review itself. Most importantly, the category editor provides feedback to the review’s author so that existing guidelines are met. He or she also works with staff and reviewers to develop additional features for the publication. This is an opportunity for an enthusiast in the discipline to use specialist knowledge to contribute to a product that helps others navigate and sift through the computing literature. The time commitment is approximately 1-2 hours per week.

It may just be boiler-plate but I would be most interested in:

He or she also works with staff and reviewers to develop additional features for the publication.

The first feature I would like to see is a common login for my ACM + Digital Library account and Computing Reviews.

The second feature would be citation networks for articles.

What features would you like to see?

Riak 1.0 Overview (webinar)

Filed under: Riak — Patrick Durusau @ 6:26 pm

Riak 1.0 Overview (webinar)

Details:

Webinar Date and Time

Wednesday, September 21, 2011 at 2:00 pm, Eastern Daylight Time (New York, GMT-04:00)

Wednesday, September 21, 2011 at 11:00 am, Pacific Daylight Time (San Francisco, GMT-07:00)

Wednesday, September 21, 2011 at 8:00 pm, Europe Summer Time (Berlin, GMT+02:00)

Join Basho Technologies’ CTO Justin Sheehy and Principal Architect Andy Gross for an in-depth overview of the key features and enhancements found in Riak 1.0, including:

  • Increased Query Capabilities
  • Usability Enhancements
  • Greater Reliability and Stability
  • Enhanced Scalability

In addition to a sneak peek at Riak 1.0, attendees will also learn about what is in store for the Riak platform beyond this milestone 1.0 release, as well as discover additional services and products available from Basho Technologies, the creators of Riak.

As always, attendees will have the chance to have their questions addressed by our Riak experts on hand. We hope you can join us as we review this landmark release.

Couchbase Server 2.0 Tour and Demo

Filed under: Couchbase — Patrick Durusau @ 6:26 pm

Couchbase Server 2.0 Tour and Demo

From the post:

It’s been a busy few weeks since CouchConf San Francisco, where we announced and demo’d the developer preview of Couchbase Server 2.0, which integrates Apache CouchDB, Membase and Memcached into a single, powerful NoSQL database solution.

We just finished an update to the developer preview and it is now available. Be sure to download the latest version and let us know what you think.

If you missed the demo at CouchConf (or if you were there and just want to see it again), here is the video of the presentation and demo that Damien and I did at the show. I hope you enjoy it!

A Genome Sequence Analysis…

Filed under: Hypertable — Patrick Durusau @ 6:25 pm

A Genome Sequence Analysis System Built With Hypertable by Doug Judd.

Interesting use of matching to discover new or novel genetic information (deletes matches, what’s left is new/novel).

Realtime Search: Solr vs Elasticsearch

Filed under: ElasticSearch,Solr — Patrick Durusau @ 6:24 pm

Realtime Search: Solr vs Elasticsearch by Ryan Sonnek.

Comparison of Solr and Elasticsearch for realtime searching.

Where “realtime” means you are updating the index while performing searches.

I’m not convinced that “realtime” requirements are any more common than those of “BigData.” They do exist and when they do, use the appropriate solution. On the other hand, don’t plan or build for “realtime” or “BigData” unless those are your requirements.

August 25, 2011

Developer Contest: Win Apple Stuff!

Filed under: Graphs,InfiniteGraph — Patrick Durusau @ 7:06 pm

Developer Contest: Win Apple Stuff!

From the website:

Build a cool software application, web or mobile service around social, game and/or location-based networks, using InfiniteGraph to traverse the objects and relationships in your data. You could win up to $12,000 worth of Apple products, gear and tech!

Presentation and code due 30 September 2011.

What objects and relationships are in your data?

InfiniteGraph Steps Out of Beta…

Filed under: Graphs,InfiniteGraph — Patrick Durusau @ 7:05 pm

InfiniteGraph Steps Out Of Beta To Help Companies Identify Deep Relationships In Large Data Sets

From the article:

Working with these kind of large enterprises requires support for billions of data points, and so InfiniteGraph has built a system to enable scaling and big data capacity, with realtime functionality. Today, InfiniteGraph is expanding its reach to businesses and developers looking to mine their data stores for complex relationships, be they enterprise apps targeting SMBs, SMEs themselves, or Fortune 500 companies.

But the important thing to point out about InfiniteGraph’s commercial release (the system has been being developed in beta over the last year) is that it doesn’t require developers to re-engineer their databases from scratch to benefit from the technology. Developers can simply use the platform’s dedicated graph API to leverage InfiniteGraph’s relationshop mining on top of their existing data. It also offers a high-scale database management system, which is a nice bonus.

Other features of note in InfiniteGraph’s commercial release include parallel data loading and accelerated ingest, meaning that developers can import and continuously feed apps with data from multiple input streams more speedily. The graph database also allows developers to choose from different indexing options that suit their company’s specific needs (from automatic to manual), as well as enabling devs to view, verify, and test data models in customizable approaches. (emphasis added)

Sounds like topic map navigation of information doesn’t it?

Check it out: http://www.infinitegraph.com

SERIMI…. (Have you washed your data?)

Filed under: Linked Data,LOD,RDF,Similarity — Patrick Durusau @ 7:04 pm

SERIMI – Resource Description Similarity, RDF Instance Matching and Interlinking

From the website:

The interlinking of datasets published in the Linked Data Cloud is a challenging problem and a key factor for the success of the Semantic Web. Manual rule-based methods are the most effective solution for the problem, but they require skilled human data publishers going through a laborious, error prone and time-consuming process for manually describing rules mapping instances between two datasets. Thus, an automatic approach for solving this problem is more than welcome. We propose a novel interlinking method, SERIMI, for solving this problem automatically. SERIMI matches instances between a source and a target datasets, without prior knowledge of the data, domain or schema of these datasets. Experiments conducted with benchmark collections demonstrate that our approach considerably outperforms state-of-the-art automatic approaches for solving the interlinking problem on the Linked Data Cloud.

SERIMI-TECH-REPORT-v2.pdf

From the Results section:

The poor performance of SERIMI in the Restaurant1-Reataurant2 is mainly due to missing alignment in the reference set. The poor performance in the Person21-Person22 pair is due to the nature of the data. These datasets where built by adding spelling mistakes to the properties and literals values of their original datasets. Also only instances of class Person were retrieved into the pseudo-homonym sets during the interlinking process.

Impressive work overall but isn’t dirty data really the test? Just about any process can succeed with clean data.

Or is that really the weakness of the Semantic Web? That it requires clean data?

Graph Data Management: Techniques and Applications

Filed under: Graphs — Patrick Durusau @ 7:03 pm

Graph Data Management: Techniques and Applications

I haven’t seen the book but the following is just too amusing to pass up:

$185: Graph Data Management: Techniques and Applications – Amazon.

$164.45: Graph Data Management: Techniques and Applications – Walmart.

I did a search for the exact title, thinking I would pick up the usual suspects and maybe a review.

I don’t normally think of shopping at Walmart for technical computer books. Maybe that needs to change.

Suspect this to be the detailed Table of Contents.

Homepages:

Dr. Sherif Sakr’s

Eric Pardede

Neo4j Manual v.1.5 Snapshot

Filed under: Neo4j — Patrick Durusau @ 7:03 pm

Neo4j Manual v.1.5 Snapshot

From the preface:

The material is practical, technical, and focused on answering specific questions. It addresses how things work, what to do and what to avoid to successfully run Neo4j in a production environment. After a brief introduction, each topic area assumes general familiarity as it addresses the particular details of Neo4j.

The goal is to be thumb-through and rule-of-thumb friendly.

Each section should stand on its own, so you can hop right to whatever interests you. When possible, the sections distill “rules of thumb” which you can keep in mind whenever you wander out of the house without this manual in your back pocket.

Good whether you need the details or are simply exploring what you can try next.

Erlang Community Site

Filed under: Erlang,Marketing — Patrick Durusau @ 7:02 pm

Erlang Commnity site: www.trapexit.org

Interesting collection of links to various Erlang resources.

Includes Try Erlang site, where you can try Erlang in your browser.

I have seen topic maps displayed in web browsers. I have seen fairly ugly topic map editors in web browsers. No, don’t think I have seen a “Try Topic Maps” type site. Have I just missed it?

Thoughts? Suggestions?

Growing a DSL with Clojure

Filed under: Clojure,DSL — Patrick Durusau @ 7:01 pm

Growing a DSL with Clojure: Clojure Makes DSL Writing Straightforward by by Ambrose Bonnaire-Sergeant.

From the post:

From seed to full bloom, Ambrose takes us through the steps to grow a domain-specific language in Clojure.

Lisps like Clojure are well suited to creating rich DSLs that integrate seamlessly into the language.

You may have heard Lisps boasting about code being data and data being code. In this article we will define a DSL that benefits handsomely from this fact.

We will see our DSL evolve from humble beginnings, using successively more of Clojure’s powerful and unique means of abstraction.

Is a DSL in your subject identity future?

Learn to Use DiscoverText – Free Tutorial Webinar

Filed under: Data Mining,DiscoverText,Text Extraction — Patrick Durusau @ 7:00 pm

Learn to Use DiscoverText – Free Tutorial Webinar

From the announcement:

This free, live Webinar introduces DiscoverText and key features used to ingest, filter, search & code text. We take your questions and demonstrate the newest tools, including a Do-It-Yourself (DIY) machine-learning classifier. You can create a classification scheme, train the system, and run the classifier in less than 20 minutes.

DiscoverText’s latest feature additions can be easily trained to perform customized mood, sentiment and topic classification. Any custom classification scheme or topic model can be created and implemented by the user. Once a classification scheme is created, you can then use advanced, threshold-sensitive filters to look at just the documents you want.

You can also generate interactive, custom, salient word clouds using the “Cloud Explorer” and drill into the most frequently occurring terms or use advanced search and filters to create “buckets” of text.

The system makes it possible to capture, share and crowd source text data analysis in novel ways. For example, you can collect text content off Facebook, Twitter & YouTube, as well as other social media or RSS feeds.

Apologies but if you notice the date this announcement was posted, the day before the webinar, I posted this late.

Puzzles me why there is a tendency to announce webinars the day or two in advance. Why not a week?

They have recorded prior versions of this presentation so you can still learn something about DiscoverText.

PageRank Implementation in Pig

Filed under: Pig,Software — Patrick Durusau @ 6:59 pm

PageRank Implementation in Pig

Simple implementation of PageRank using Pig. Think of it as an easy intro to Pig.

If you don’t know Pig, see: Pig. 😉 Sorry.

Saw this in NoSQL Weekly (Issue 39 – Aug 25, 2011). I can’t point you to the issue, the NoSQL Weekly site reports “beta” and asks if you want a sample copy.

Micro Cloud Foundry

Filed under: Cloud Computing,Software — Patrick Durusau @ 6:59 pm

Micro Cloud Foundry

Described in NoSQL Weekly as:

VMware has issued a free version of its Cloud Foundry Platform-as-a-Service (PaaS) stack that can run on a single laptop or desktop computer.The idea behind this package, called Micro Cloud Foundry, is to give developers an easy way to build out Cloud Foundry applications and test them before moving them to an actual Cloud Foundry service. The package includes all the components in the full-fledged Cloud Foundry stack, including the Spring framework for Java, Ruby on Rails, the Sinatra Ruby framework, the JavaScript Node.js library, the Grails framework, and the MongoDB, MySQL, and Redis data stores.

Are you building topic map applications for Cloud Foundry services? Interested in your comments, experiences.

August 24, 2011

Do You CTRL+F?

Filed under: Marketing,Search Interface,Searching,Topic Maps — Patrick Durusau @ 7:00 pm

College students stumped by search engines

This link was forwarded to me by Sam Hunting.

That college students can’t do adequate searching isn’t a surprise.

What did surprise me was the finding: “…90 percent of American Google users do not know how to use CTRL or Command+F to find a word on a page.”

That finding was reported in: Crazy: 90 Percent of People Don’t Know How to Use CTRL+F.

Or as it appears in the article:

This week, I talked with Dan Russell, a search anthropologist at Google, about the time he spends with random people studying how they search for stuff. One statistic blew my mind. 90 percent of people in their studies don’t know how to use CTRL/Command + F to find a word in a document or web page! I probably use that trick 20 times per day and yet the vast majority of people don’t use it at all.

“90 percent of the US Internet population does not know that. This is on a sample size of thousands,” Russell said. “I do these field studies and I can’t tell you how many hours I’ve sat in somebody’s house as they’ve read through a long document trying to find the result they’re looking for. At the end I’ll say to them, ‘Let me show one little trick here,’ and very often people will say, ‘I can’t believe I’ve been wasting my life!'”

How should this finding influence subject identity tests and/or user interfaces for topic maps?

Should this push us towards topic map based data products, as data products, not topic maps?

Sesame 2.5.0 Release

Filed under: RDF,Sesame,SPARQL — Patrick Durusau @ 7:00 pm

Sesame 2.5.0 Release

From the webpage:

  • SPARQL 1.1 Query Language support
    Sesame 2.5 features near-complete support for the
    SPARQL 1.1 Query Language Last Call Working Draft ,
    including all new builtin functions and operators, improved aggregation behavior and more.
  • SPARQL 1.1 Update support
    Sesame 2.5 has full support for the new SPARQL 1.1 Update Working Draft. The Repository API has been extended to support creation of SPARQL Update operations, the SAIL API has been extended to allow Update operations to be passed directly to the underlying backend implementation for optimized execution. Also, the Sesame Workbench application has been extended to allow easy execution of SPARQL update operations on your repositories.
  • SPARQL 1.1 Protocol support
    Sesame 2.5 fully supports the SPARQL 1.1 Protocol for RDF Working Draft. The Sesame REST protocol has been extended to allow update operations via SPARQL on repositories. A Sesame server therefore now automatically publishes any repository as a fully compliant SPARQL endpoint.
  • Binary RDF support
    Sesame 2.5 includes a new binary RDF serialization format. This format has been derived from the existing binary tuple results format. It’s main features are reduced parsing overhead and minimal memory requirements (for handling really long literals, a.o.t.).

Clojure: The Essence of Programming

Filed under: Clojure — Patrick Durusau @ 6:59 pm

Clojure: The Essence of Programming by Howard Lewis Skip.

From the description:

Howard Lewis Ship talks about Clojure, a language more concise, testable, and readable than Java, letting the developer to focus on his work rather than a verbose syntax.

Really basic survey but I was struck by the phrase: dumb data structures – wrapped with smart functions. (I am paraphrasing.)

The International Foundation for Information Technology (IF4IT)

Filed under: Taxonomy — Patrick Durusau @ 6:59 pm

The International Foundation for Information Technology (IF4IT)

The Foundation has released:

A Glossary Taxonomy that provides a hierarchy of Glossaries, Terms and Definitions that are semantically grouped by relevant domain area.

A File Plan Taxonomy that specifically correlates with the previously published Records Management Taxonomy and the Records Taxonomy.

A Service Taxonomy that covers the majority of all enterprise and IT services.

A Software Taxonomy that itemizes the many different categories of enterprise and IT software.

It may be easier than coming up with your own taxonomy.

The twenty-four (24), yes, twenty-four, social media options (including email) on every page, reminded me that one “killer” semantic web/topic map app would be to create a common interface to all of those. Would need to include set intersection for the contacts on the various services. And manage the identify of contacts across the services.

10 Lessons Learned by BigData Pioneers

Filed under: BigData — Patrick Durusau @ 6:58 pm

10 Lessons Learned by BigData Pioneers

After the third lesson, I gave up.

That Information Week knows this little about web design doesn’t give me a lot of confidence in its content.

What was wrong? Well, having to wade through ad content in order to find snippets of content was a pain. The navigation looked like it was designed in the mid-1990’s. And there was no option to opt out for a print view, which would avoid most of the meaningless ads.

I suppose IBM has enough money to waste on ads to simply take up space on webpages but I would rather see that spent at alphaWorks than here.

How Browsers Work:…

Filed under: Interface Research/Design,Search Interface,Web Applications — Patrick Durusau @ 6:56 pm

How Browsers Work: Behind the Scenes of Modern Web Browsers by Tali Garsiel.

If you are delivering topic map content to web browsers, ;-), you will probably find something in this article that is useful.

Enjoy.

« Newer PostsOlder Posts »

Powered by WordPress