Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 2, 2012

Open-source Weave liberates data for journalists, citizens

Filed under: Graphics,News,Visualization,Weave — Patrick Durusau @ 3:39 pm

Open-source Weave liberates data for journalists, citizens: The software can help journalists create infinitely interactive visualizations by Andrew Phelps.

From the post:

Data nerds from government and academia gathered Friday at Northeastern University to show off the latest version of Weave, an open-source, web-based platform designed to visualize “any available data by anyone for any purpose.” The software has a lot of potential for journalists.

Weave is supported by the Open Indicators Consortium, an unusual partnership of planning agencies and universities who wanted better tools to inform public policy and community decision-making. The groups organized and agreed to share data and code in 2008, well before Gov 2.0 was hot.

Think of Weave as more programming language than app. It powers websites such as the Connecticut Data Collaborative and Rhode Island’s RI DataHUB. The newly relaunched MetroBoston DataCommon, a project of eastern Massachusetts’ regional planning agency, really shows off the software’s power. There, users can upload their own datasets (Weave claims to be able to handle virtually any format) or browse sample visualizations (e.g., Children in Families Below Poverty).

Data is linked, which means you can view the same datapoint from many angles. Drag your cursor across a few dozen cities and towns and watch as those data are simultaneously illuminated on a histogram and a scatter plot. Add another datapoint to find correlations or trim the data to create subsets. The software keeps track of state, which means you would be able to visually undo and redo changes and save that series of steps as an animation. The end result, powered by Flash, is easily embeddable into a web page.

This is a truly remarkable piece of work!

Having said that, the MetroBoston DataCommon illustrates the limitations of the approach. Follow the link and report back what category you would look under for:

  • Crime statistics by type of crime
  • Crime statistics by arrest/conviction
  • Parolee or sex offenders by district

I tried Public Safety and Demographics, no luck. Maybe they don’t have any crime in Boston. Maybe. That’s the problem with government choosing which data sets to release. Can produce a very odd views of local conditions.

Not the fault of Weave but thought I should mention that “released data” != accurate picture.

The New SolrCloud: Overview

Filed under: Solr,SolrCloud — Patrick Durusau @ 3:38 pm

The New SolrCloud: Overview by Rafał Kuć.

From the post:

Just the other day we wrote about Sensei, the new distributed, real-time full-text search database built on top of Lucene and here we are again writing about another “new” distributed, real-time, full-text search server also built on top of Lucene: SolrCloud.

In this post we’ll share some interesting SolrCloud bits and pieces that matter mostly to those working with large data and query volumes, but that all search lovers should find really interesting, too. If you have any questions about what we wrote (or did not write!) in this post, please leave a comment – we’re good at following up to comments! Or just ask @sematext!

Please note that functionality described in this post is now part of trunk in Lucene and Solr SVN repository. This means that it will be available when Lucene and Solr 4.0 are released, but you can also use trunk version just like we did, if you don’t mind living on the bleeding edge.

A good overview and one that could be useful with semi-technical management types.

If you need more details, see: SolrCloud wiki.

February 1, 2012

Multiple Recognitions: Reconsidered

Filed under: Context,Identification,Identifiers,Semantics — Patrick Durusau @ 4:39 pm

Yesterday I closed with these lines:

Requirement: A system of identification must support the same identifiers resolving to different identifications.

The consequences of deciding otherwise on such a requirement, I will try to take up tomorrow. (Multiple Recognitions)

Rereading that for today’s post, I don’t agree with myself.

The requirement isn’t a requirement at all but an observation that the same identifier may have multiple resolutions.

Better to say that the designer of systems of identification should be aware of that observation. To avoid situations like I posed yesterday with “I will call you a cab” example.

A fortuitous mistake because it leads to the next issue that I wanted to address: Do identifiers have contexts in which they have only a single resolution?

Yesterday’s mistake has made me more wary of sweeping pronouncements so I am posing the context issue as a question. 😉

Can you think of any counter-examples?

The easiest place to look would be in comedy, where mistaken identity (such as in Shakespeare), double meanings, etc., are bread and butter of the art. Two or more people hear or see the same identifier and reach different resolutions.

In those cases, if we had a rule that identifiers could only have a single resolution, we would have to simply skip over those cases. That seems like an inelegant solution.

Or would you shrink the context down to the individuals who had the different resolutions of an identifier?

Perhaps, perhaps but then what is your solution when later in the play one or more individuals discover their mistake and now hold a common resolution but still remember the one that was in error? Or perhaps more than one that was in error? How do we describe the context(s) there?

There is a long history of such situations in comedy. You may be tempted to say that recreational literature can be excluded. That “fictional” work isn’t the first place we want semantic technologies to work.

Perhaps but remember that comedy and “fiction” have their origin in our day to day affairs. The misunderstandings they parody are our misunderstandings.

The saying: “what did X know and when did they know it?” takes on new meaning when we take about the interpretation of identifiers. Perhaps “freedom fighter” is a more sympathetic term until you “know” those forces are operating death squads. And may have different legal consequences.

How do you think boundaries for contexts should be set/designated? Seems like that would be an important issue to take up.

Coca-Cola, Toucans and Charles Sanders Peirce

Filed under: Peirce,Semantics,Semiotics — Patrick Durusau @ 4:39 pm

Coca-Cola, Toucans and Charles Sanders Peirce by Mike Bergman.

I have gone back and forth about this one, even though I have to agree with:

Global is Neither Indiscriminate Nor Unambiguous

Names, references, identity and meaning are not absolutes. They are not philosophically, and they are not in human language. To expect machine communications to hold to different standards and laws than human communications is naive. To effect machine communications our challenge is not to devise new rules, but to observe and apply the best rules and practices that human communications instruct.

There has been an unstated hope at the heart of the semantic Web enterprise that simply expressing statements in the right way (syntax) and in the right form (RDF) is sufficient to facilitate machine communications. But this hope, too, is naive and silly. Just as we do not accept all human utterances as truth, neither will we accept all machine transmissions as reliable. Some of the information will be posted in error; some will be wrong or ill-fitting to our world view; some will be malicious or intended to deceive. Spam and occasionally lousy search results on the Web tell us that Web documents are subject to these sources of unsuitability, why is not the same true of data?

Thus, global data access via the semantic Web is not — and can never be — indiscriminate nor unambiguous. We need to understand and come to trust sources and provenance; we need interpretation and context to decide appropriateness and validity; and we need testing and validation to ensure messages as received are indeed correct. Humans need to do these things in their normal courses of interaction and communication; our machine systems will need to do the same.

These confirmations and decisions as to whether the information we receive is actionable or not will come about via still more information. Some of this information may come about via shared conventions. But most will come about because we choose to provide more context and interpretation for the core messages we hope to communicate.

It is well-written and so pleasant to read. See what you think about the process by which Mike reaches his conclusions.

[HBase] Coprocessor Introduction

Filed under: HBase,HBase Coprocessor — Patrick Durusau @ 4:39 pm

[HBase] Coprocessor Introduction by Trend Micro Hadoop Group: Mingjie Lai, Eugene Koontz and Andrew Purtell.

From the post:

HBase has very effective MapReduce integration for distributed computation over data stored within its tables, but in many cases – for example simple additive or aggregating operations like summing, counting, and the like – pushing the computation up to the server where it can operate on the data directly without communication overheads can give a dramatic performance improvement over HBase’s already good scanning performance.

Also, before 0.92, it was not possible to extend HBase with custom functionality except by extending the base classes. Due to Java’s lack of multiple inheritance this required extension plus base code to be refactored into a single class providing the full implementation, which quickly becomes brittle when considering multiple extensions. Who inherits from whom? Coprocessors allow a much more flexible mixin extension model.

In this article I will introduce the new Coprocessors feature of HBase, a framework for both flexible and generic extension, and of distributed computation directly within the HBase server processes. I will talk about what it is, how it works, and how to develop coprocessor extensions.

If you are using HBase, this looks like a must read article. It also covers how to write extensions to the coprocessor.

I first saw this at myNoSQL.

GraphInsight

Filed under: Data Analysis,Data Structures,Graphs,Visualization — Patrick Durusau @ 4:38 pm

GraphInsight

From the webpage:

Interative graph exploration

GraphInsight is a visualization software that lets you explore graph data through high quality interactive representations.

(video omitted)

Data exploration and knowledge extraction from graphs is of great interest nowadays: Knowledge is disseminated in social networks, and services are powered by cloud computing platforms. Data miners deal with graphs every day.

Humans are extremely good in identifying patterns and outliers. We believe that interacting visually with your data can give you a better intuition, and higher confidence on what you are looking for.

The video is just a little over one (1) minute long and is worth seeing.

Won’t tell you how to best display your data but does illustrate some of the capabilities of the software.

There are a number of graph rendering packages already but interactive ones are less common.

Now if we can have interactive graph software that hides/displays the graph underlying a text with all of the sub-graphs related to its content. So that it starts to mimic regular reading practice that goes off on tangents and finds support for ideas in unlikely spaces, that would be something really different.

Amazon S3 Growth for 2011 – Now 762 Billion Objects

Filed under: Amazon Web Services AWS,Semantics — Patrick Durusau @ 4:37 pm

Amazon S3 Growth for 2011 – Now 762 Billion Objects

Just a quick illustration of how one data locale is out stripping efforts to embed semantics in web based content.

Journal of Web Semantics: Special Issue on Scalability

Filed under: Scalability,Semantics — Patrick Durusau @ 4:36 pm

Journal of Web Semantics: Special Issue on Scalability

Navigation Note:

Title of article goes to download page for the article.

Author’s names are hyperlinks to pages that list the author’s publications in the Journal of Web Semantics.

Editorial – Special Issue “Scalability”
Jeff Heflin, Heiner Stuckenschmidt

Scalable Distributed Indexing and Query Processing over Linked Data
Marcel Karnstedt, Kai-Uwe Sattler, Manfred Hauswirth

Searching Web Data: an Entity Retrieval and High-Performance Indexing Model
Renaud Delbru, Stephane Campinas, Giovanni Tummarello

WebPIE: A Web-scale parallel inference engine using MapReduce
Jacopo Urbani, Spyros Kotoulas, Jason Massen, Frank van Harmelen, Henri Bal

Scalable and Distributed Methods for Entity Matching, Consolidation and Disambiguation over Linked Data Corpora
Aidan Hogan, Antoine Zimmermann, Juergen Umbrich, Axel Polleres, Stefan Decker

Pentaho open sources ‘big data’ integration tools under Apache 2.0

Filed under: BigData,Data Integration,Kettle — Patrick Durusau @ 4:35 pm

Pentaho open sources ‘big data’ integration tools under Apache 2.0

Chris Kanaracus writes:

Business intelligence vendor Pentaho is releasing as open source a number of tools related to “big data” in the 4.3 release of its Kettle data-integration platform and has moved the project overall to the Apache 2.0 license, the company announced Monday.

While Kettle had always been available in a community edition at no charge, the tools being open sourced were previously only available in the company’s commercialized edition. They include integrations for Hadoop’s file system and MapReduce as well as connectors to NoSQL databases such as Cassandra and MongoDB.

Those technologies are some of the most popular tools associated with the analysis of “big data,” an industry buzzword referring to the ever-larger amounts of unstructured information being generated by websites, sensors and other sources, along with transactional data from enterprise applications.

The big data components will still be offered as part of a commercial package, Pentaho Business Analytics Enterprise Edition, which bundles in tech support maintenance and additional functionality, said Doug Moran, company co-founder and big data product manager.

Who would have thought as recently as two years ago that big data analysis would face an embarrassment of open source riches?

Even though “open source,” production use of any of the “open source” tools in a big data environment requires a substantial investment of human and technical resources.

I see the usual promotional webinars but for unstructured data, I wonder why we don’t see the usual suspects in competitions like TREC?

Ranking in such an event should not be the only consideration but at least would be a public test of the various software offerings.

Bio4j: A pioneer graph based database…

Filed under: Bio4j,Bioinformatics,Neo4j — Patrick Durusau @ 4:35 pm

Bio4j: A pioneer graph based database for the integration of biological Big Data by Pablo Pareja Tobes.

Great slide deck by the principal developer for Bio4j.

Take a close look at slide 19 and tell me what it reminds you of?

😉

One year on: 10 times bigger, masses more data… and a new API

Filed under: Corporate Data,Dataset,Government Data — Patrick Durusau @ 4:35 pm

One year on: 10 times bigger, masses more data… and a new API

From the post:

Was it just a year ago that we launched OpenCorporates, after just a couple months’ coding? When we opened up over 3 million companies and allowed searching across multiple jurisdictions (admittedly there were just three of them to start off with)?

Who would have thought that 12 months later we would have become 10 times bigger, with over 30 million companies and over 35 jurisdictions, and lots of other data too. So we could use this as an example to talk about some of the many milestones in that period, about all the extra data we’ve added, about our commitment to open data, and the principles behind it.

We’re not going to do that however, instead we’d rather talk about the new API we’ve just launched, allowing full access to all the info, and importantly allowing searches via the API too. In fact, we’ve now got a full website devoted to the api, http://api.opencorporates.com, and on it you’ll find all the documentation, example API calls, versioning information, error messages, etc.

Congratulations to OpenCorporates on a stellar year!

The collection of dots to connect has gotten dramatically larger!

« Newer Posts

Powered by WordPress