Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 9, 2011

British Museum Semantic Web Collection Online

Filed under: British Museum,Linked Data,SPARQL — Patrick Durusau @ 8:24 pm

British Museum Semantic Web Collection Online

From the webpage:

Welcome to this Linked Data and SPARQL service. It provides access to the same collection data available through the Museum’s web presented Collection Online, but in a computer readable format. The use of the W3C open data standard, RDF, allows the Museum’s collection data to join and relate to a growing body of linked data published by other organisations around the world interested in promoting accessibility and collaboration.

The data has also been organised using the CIDOC-CRM (Conceptual Reference Model) crucial for harmonising with other cultural heritage data. The current version is beta and development work continues to improve the service. We hope that the service will be used by the community to develop friendly web applications that are freely available to the community.

Please use the SPARQL menu item to use the SPARQL user interface or click here.

With the British National Bibliography, the British Museum both accessible via SPARQL and Bob DuCharme’s Learning SPARQL book, the excuses for not knowing SPARQL cold are few and far in between.

Twenty Rules for Good Graphics

Filed under: Graphics,Marketing — Patrick Durusau @ 8:23 pm

Twenty Rules for Good Graphics

Rob J Hyndman outlines twenty (20) rules for production of good graphics.

Written for graphics in statistical publications but applicable to other graphics as well.

Communicating topic maps to others is hard enough without the burden of poor graphics.

Apache Flume – Architecture of Flume NG

Filed under: Flume — Patrick Durusau @ 8:22 pm

Apache Flume – Architecture of Flume NG by Arvind Prabhakar.

From the post:

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/flume. Flume NG is work related to new major revision of Flume and is the subject of this post.

Prior to entering the incubator, Flume saw incremental releases leading up to version 0.9.4. As Flume became adopted it became clear that certain design choices would need to be reworked in order to address problems reported in the field. The work necessary to make this change began a few months ago under the JIRA issue FLUME-728. This work currently resides on a separate branch by the name flume-728, and is informally referred to as Flume NG. At the time of writing this post Flume NG had gone through two internal milestones – NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.

At a high-level, Flume NG uses a single-hop message delivery guarantee semantics to provide end-to-end reliability for the system. To accomplish this, certain new concepts have been incorporated into its design, while certain other existing concepts have been either redefined, reused or dropped completely.

In this blog post, I will describe the fundamental concepts incorporated in Flume NG and talk about it’s high-level architecture. This is a first in a series of blog posts by Flume team that will go into further details of it’s design and implementation.

Log data from disparate sources is one likely use case for topic maps. See what you think of the new architecture for Apache Flume.

Good pointers to additional information as well.

Crunch for Dummies

Filed under: Crunch,Flow-Based Programming (FBP),Hadoop — Patrick Durusau @ 8:21 pm

Crunch for Dummies by Brock Noland

From the post:

This guide is intended to be an introduction to Crunch.

Introduction

Crunch is used for processing data. Crunch builds on top of Apache Hadoop to provide a simpler interface for Java programmers to process data. In Crunch you create pipelines, not unlike Unix pipelines, such as the command below:

Interesting coverage of Crunch.

I don’t know that I agree with the characterization:

… using Hadoop …. require[s] learning a complex process called MapReduce or a higher level language such as Apache Hive or Apache Pig.

True, to use Hadoop means learning MapReduce or Hive or PIg but I don’t think of them as being all that complex. Besides, once you have learned them, the benefits are considerable.

But, to each his own.

You might also be interested in: Introducing Crunch: Easy MapReduce Pipelines for Hadoop.

Maltego

Filed under: Intelligence,Maltego — Patrick Durusau @ 8:20 pm

Maltego

From the website:

What is Maltego?

With the continued growth of your organization, the people and hardware deployed to ensure that it remains in working order is essential, yet the threat picture of your “environment” is not always clear or complete. In fact, most often it’s not what we know that is harmful – it’s what we don’t know that causes the most damage. This being stated, how do you develop a clear profile of what the current deployment of your infrastructure resembles? What are the cutting edge tool platforms designed to offer the granularity essential to understand the complexity of your network, both physical and resource based?

Maltego is a unique platform developed to deliver a clear threat picture to the environment that an organization owns and operates. Maltego’s unique advantage is to demonstrate the complexity and severity of single points of failure as well as trust relationships that exist currently within the scope of your infrastructure.

The unique perspective that Maltego offers to both network and resource based entities is the aggregation of information posted all over the internet – whether it’s the current configuration of a router poised on the edge of your network or the current whereabouts of your Vice President on his international visits, Maltego can locate, aggregate and visualize this information.

Maltego offers the user with unprecedented information. Information is leverage. Information is power. Information is Maltego.

What does Maltego do?

  • Maltego is a program that can be used to determine the relationships and real world links between:
    • People
    • Groups of people (social networks)
    • Companies
    • Organizations
    • Web sites
    • Internet infrastructure such as:
      • Domains
      • DNS names
      • Netblocks
      • IP addresses
    • Phrases
    • Affiliations
    • Documents and files
  • These entities are linked using open source intelligence.
  • Maltego is easy and quick to install – it uses Java, so it runs on Windows, Mac and Linux.
  • Maltego provides you with a graphical interface that makes seeing these relationships instant and accurate – making it possible to see hidden connections.
  • Using the graphical user interface (GUI) you can see relationships easily – even if they are three or four degrees of separation away.
  • Maltego is unique because it uses a powerful, flexible framework that makes customizing possible. As such, Maltego can be adapted to your own, unique requirements.

I just encountered this today and have downloaded the community edition client. Have also registered for an account for the client.

More news as it develops.

The Simple Magic of Consistent Hashing

Filed under: Hashing — Patrick Durusau @ 8:18 pm

The Simple Magic of Consistent Hashing by Mathias Meyer.

From the post:

The simplicity of consistent hashing is pretty mind-blowing. Here you have a number of nodes in a cluster of databases, or in a cluster of web caches. How do you figure out where the data for a particular key goes in that cluster?

You apply a hash function to the key. That’s it? Yeah, that’s the whole deal of consistent hashing. It’s in the name, isn’t it?

The same key will always return the same hash code (hopefully), so once you’ve figured out how you spread out a range of keys across the nodes available, you can always find the right node by looking at the hash code for a key.

It’s pretty ingenious, if you ask me. It was cooked up in the lab chambers at Akamai, back in the late nineties. You should go and read the original paper right after we’re done here.

A must read, for a variety of reasons. One of which is to build expandable and robust data structures.

Another is to reach a deeper understanding of hashing, consistent or otherwise.

Question: Does consistency mean within a system or across systems?

Redis in Practice: Who’s Online?

Filed under: NoSQL,Redis — Patrick Durusau @ 8:17 pm

Redis in Practice: Who’s Online?

From the post:

Redis is one of the most interesting of the NOSQL solutions. It goes beyond a simple key-value store in that keys’ values can be simple strings, but can also be data structures. Redis currently supports lists, sets and sorted sets. This post provides an example of using Redis’ Set data type in a recent feature I implemented for Weplay, our social youth sports site.

See, having complex key values isn’t all that weird.

TinkerPop 2011 Winter release!

Filed under: Blueprints,Frames,Gremlin,Pipes,Rexster — Patrick Durusau @ 8:16 pm

TinkerPop 2011 Winter release!

Which includes:

New homepage design: http://tinkerpop.com

Blueprints 1.1 (Blueberry):
https://github.com/tinkerpop/blueprints/wiki/Release-Notes

Frames 0.6 (Truss):
https://github.com/tinkerpop/frames/wiki/Release-Notes

Gremlin 1.4 (Ain’t No Thing But a Chicken Wing):
https://github.com/tinkerpop/gremlin/wiki/Release-Notes

Pipes 0.9 (Sink):
https://github.com/tinkerpop/pipes/wiki/Release-Notes

Rexster 0.7 (Brian)
https://github.com/tinkerpop/rexster/wiki/Release-Notes

Rexster-Kibbles 0.7
http://rexster-kibbles.tinkerpop.com

You didn’t really want to spend all weekend holiday shopping and hanging out with relatives did you? 😉

Database Indexes for the Inquisitive Mind

Filed under: Database,Indexing — Patrick Durusau @ 8:14 pm

Database Indexes for The Inquisitive Mind by Nuno Job

From the post:

I’ve used to be a developer advocate an awesome database product called MarkLogic, a NoSQL Document Database for the Enterprise. Now it’s pretty frequent that people ask me about database stuff.

In here I’m going to try to explain some fun stuff you can do with indexes. Not going to talk about implementing them but just about what they solve.

The point here is to help you reason about the choices you have when you are implementing stuff to speed up your applications. I’m sure if you think an idea is smart and fun you’ll research what’s the best algorithm to implement it.

If you are curious about MarkLogic you can always check the Inside MarkLogic white-paper.

Very nice introduction to database indexes. There is more to learn, as much if not more than you would care to. 😉

December 8, 2011

RDFa 1.1

Filed under: RDFa,Semantic Web — Patrick Durusau @ 8:01 pm

RDFa 1.1

From the draft:

The last couple of years have witnessed a fascinating evolution: while the Web was initially built predominantly for human consumption, web content is increasingly consumed by machines which expect some amount of structured data. Sites have started to identify a page’s title, content type, and preview image to provide appropriate information in a user’s newsfeed when she clicks the “Like” button. Search engines have started to provide richer search results by extracting fine-grained structured details from the Web pages they crawl. In turn, web publishers are producing increasing amounts of structured data within their Web content to improve their standing with search engines.

A key enabling technology behind these developments is the ability to add structured data to HTML pages directly. RDFa (Resource Description Framework in Attributes) is a technique that allows just that: it provides a set of markup attributes to augment the visual information on the Web with machine-readable hints. In this Primer, we show how to express data using RDFa in HTML, and in particular how to mark up existing human-readable Web page content to express machine-readable data.

This document provides only a Primer to RDFa. The complete specification of RDFa, with further examples, can be found in the RDFa 1.1 Core [RDFA-CORE], the XHTML+RDFa 1.1 [XHTML-RDFA], and the HTML5+RDFa 1.1 [HTML-RDFA] specifications.

I am sure this wasn’t an intentional contrast, but compare this release with that of RDFa Lite 1.1.

Which one would you rather teach a room full of newbie (or even experienced) HTML hackers?

Don’t be shy, keep your hands up!

I don’t know that RDFa Lite 1.1 is “lite” enough but I think it is getting closer to a syntax that might actually be used.

RDFa Lite 1.1

Filed under: RDFa,Semantic Web — Patrick Durusau @ 8:00 pm

RDFa Lite 1.1 (new draft)

From the W3C:

One critique of RDFa is that is has too much functionality, leaving first-time authors confused about the more advanced features. RDFa Lite is a minimalist version of RDFa that helps authors easily jump into the structured data world. The goal was to outline a small subset of RDFa that will work for 80% of the Web authors out there doing simple data markup.

Well, it’s short enough.

Comments are being solicited so here’s your chance.

Still using simple identifiers for subjects, which may be sufficient in some cases. Depends. The bad part is that doesn’t improve as you go up the chain to more complex forms of RDFa/RDF.

BTW, does anyone have a good reference for what it means to have a web of things?

Just curious what is going to be left on the cutting room floor from the Semantic Web and its “web of things?”

Will the Semantic Web be the Advertising Web that pushes content at me, whether I am interested or not?

Multilingual Graph Traversals

Filed under: Gremlin,Groovy,Java,Scala — Patrick Durusau @ 8:00 pm

OK the real title is: JVM Language Implementations. 😉 I like mine better.

From the webpage:

Gremlin is a style of graph traversing that can be hosted in any number of languages. The benefit of this is that users can make use of the programming language they are most comfortable with and still be able to evaluate Gremlin-style traversals. This model is different than, lets say, using SQL in Java where the query is evaluated by passing a string representation of the query to the SQL engine. On the contrary, with native Gremlin support for other JVM languages, there is no string passing. Instead, simple method chaining in Gremlin’s fluent style. However, the drawback of this model is that for each JVM language, there are syntactic variations that must be accounted for.

The examples below demonstrate the same traversal in Groovy, Scala, and Java, respectively.

Seeing is believing.

Cloudera Manager

Filed under: Hadoop — Patrick Durusau @ 7:59 pm

Cloudera Manager: End-to-End Administration for Apache Hadoop

From the post:

Cloudera Manager is the industry’s first end-to-end management application for Apache Hadoop. With Cloudera Manager, you can easily deploy and centrally operate a complete Hadoop stack. The application automates the installation process, reducing deployment time from weeks to minutes; gives you a cluster wide, real time view of nodes and services running; provides a single, central place to enact configuration changes across your cluster; and incorporates a full range of reporting and diagnostic tools to help you optimize cluster performance and utilization.

This looks very cool!

I need to get some shelving for commodity boxes this coming year so I can test this sort of thing. 😉

QL.IO

Filed under: Aggregation,DSL,JSON,SQL — Patrick Durusau @ 7:59 pm

QL.IO – A declarative, data-retrieval and aggregation gateway for quickly consuming HTTP APIs.

From the about page:

A SQL and JSON inspired DSL

SQL is quite a powerful DSL to retrieve, filter, project, and join data — see efforts like A co-Relational Model of Data for Large Shared Data Banks, LINQ, YQL, or unQL for examples.

ql.io combines SQL, JSON, and a few procedural style constructs into a compact language. Scripts written in this language can make HTTP requests to retrieve data, perform joins between API responses, project responses, or even make requests in a loop. But note that ql.io’s scripting language is not SQL – it is SQL inspired.

Orchestration

Most real-world client apps need to mashup data from multiple APIs in one go. Data mashup is often complicated as client apps need to worry about order of requests, inter-dependencies, error handling, and parallelization to reduce overall latency.

ql.io’s scripts are procedural in appearance but are executed out of order based on dependencies. Some statements may be scheduled in parallel and some in series based on a dependency analysis done at script compile time. The compilation is an on-the-fly process.

Consumer Centric Interfaces

APIs are designed for reuse, and hence they cater to the common denominator. Getting new fields added, optimizing responses, or combining multiple requests into one involve drawn out negotiations between API producing teams and API consuming teams.

ql.io lets API consuming teams move fast by creating consumer-centric interfaces that are optimized for the client – such optimized interfaces can reduce bandwidth usage and number of HTTP requests.

I can believe the “SQL inspired” part since it looks like keys/column headers are opaque. That is you an specify a key/column header but you can’t specify the identity of the subject it represents.

So, if you don’t know the correct term, you are SOL. Which isn’t the state of being inspired.

Still, it looks like an interesting effort that could develop to be non-opaque with regard to keys and possibly values. (The next stage is how do you learn what properties a subject representative has for the purpose of subject recognition.)

Why I’m pretty excited about using Neo4j for a CMDB backend

Filed under: CMDB,Neo4j — Patrick Durusau @ 7:58 pm

Why I’m pretty excited about using Neo4j for a CMDB backend

From the post:

Skybase is my first open source configuration management database (CMDB) effort, but it’s not the first time I’ve built a CMDB. At work a bunch of us built–and continue to build–a proprietary, internal system CMDB called deathBURRITO as part of our deployment automation effort. We built deathBURRITO using Java, Spring, Hibernate and MySQL. deathBURRITO even has a robotic donkey (really) whose purpose we haven’t quite yet identified.

So far deathBURRITO has worked out well for us. Some of its features–namely, those that directly support deployment automation–have proven more useful than others. But the general consensus seems to be that deathBURRITO addresses an important configuration management (CM) gap, where previously we were “managing” CM data on a department wiki, in spreadsheets, in XML files and in Visio diagrams. While there’s more work to do, what we’ve done so far has been reasonably right-headed, and we’ve been able to evolve it as our needs have evolved.

That’s not to say that there’s nothing I would change. I think there’s an opportunity to do something better on the backend. That was indeed the impetus for Skybase.

Nothing hard and fast yet, but the start of a discussion about using Neo4j and Spring Data for a configuration management database.

When used with information systems, topic maps are the documentation that everyone wishes they had but were too busy to write. Think about it. Everyone intends to document the database tables and their relationships to other tables. But it isn’t all that much fun and it is easy to put off. And put off. And put off. And pretty soon, you are either moving up in the company or moving to another company.

But a topic map isn’t dead documentation in a file draw or bytes on decaying drive media. A topic map is the equivalent of a CMDB system that keeps your expertise with a prior system alive for as long as the system lives. And it gives you something to point to for management (they always like things you can point to) as a demonstration of ROI for their investment in the infrastructure.

Because topic maps can document not only what you were attempting to talk about (values in your data) but the way you were trying to talk about it (keys/column headers), you stand a much better chance at preserving the usefulness of the data long term.

Nutch Tutorial: Supplemental

Filed under: Nutch,Solr — Patrick Durusau @ 7:58 pm

If you have tried following the Nutch Tutorial you have probably encountered one or more problems getting Nutch to run. I have installed Nutch on an Ubutu 10.10 system and suggest the following additions or modifications to the instructions you find there, as of 8 December 2011.

Steps 1 and 2, perform as written.

Step 3, at least the first part has serious issues.

The example:

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

is mis-leading.

The content of <value></value> cannot contain spaces.

Therefore, <value>Patrick Durusau Nutch Spider</value> is wrong and produces:

Fetcher: No agents listed in ‘http.agent.name’ property.

Which then results in a Java exception that kills the process.

If I enter, <value>pdurusau</value>, the process continues to the next error in step 3.

Correct instruction:

3. Crawl your first website:

In $HOME/nutch-1.4/conf, open the nutch-site.xml file, which has the following content the first time you open the file:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

</configuration>

Inside the configuration element you will insert:

<property>
<name>http.agent.name</name>
<value>noSpaceName</value>
</property>

Which will result in a nutch-site.xml file that reads:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>
<name>http.agent.name</name>
<value>noSpaceName</value>
</property>

</configuration>

Save the nutch-site.xml file and we can move onto the next setup error.

Step 3 next says:

mkdir -p urls

You are already in the conf directory so that is where you create urls, yes?

No! That results in:

Input path does not exist: [$HOME]/nutch-1.4/runtime/local/urls

So you must create the urls directory under[$HOME]/nutch-1.4/ runtime/local by:

cd $HOME/nutch-1.4/runtime/local

mkdir urls

Don’t worry about the “-p” option on mkdir. It is an option that allows you to…, well, run man mkdir at a *nix command prompt if you are really interested. It would take more space than I need to spend here to explain it clearly.

The nutch-default.xml file, located under [$HOME]/nutch-1.4/conf/, sets a number of default settings for Nutch. You should not edit this file but copy properties of interest to [$HOME]/nutch1.4/conf/nutch-site.xml to create settings that override the default settings in nutch-default.xml.

If you look at nutch-default.xml or the rest of the Nutch Tutorial at the Apache site, you may be saying to yourself, but, but…, there are a lot of other settings and possible pitfalls.

Yes, yes that’s true.

I am setting up Nutch straight from the instruction as given, to encounter the same ambiguities users fresh to it will encounter. My plan is to use Nutch with Solr (and other search engines as well) to explore data for my blog as well as developing information for creating topic maps.

So, I will be covering the minimal set of options we need to get Nutch/Solr up and running but then expanding on other options as we need them.

Will pick up with corrections/suggestions tomorrow on the rest of the tutorial.

Suggestions and/or corrections/expansions are welcome!

PS: On days when I am correcting/editing user guides, expect fewer posts overall.

December 7, 2011

Monitoring Apache Solr

Filed under: Solr — Patrick Durusau @ 8:19 pm

Monitoring Apache Solr

From the post:

Apache Solr is an open source enterprise search service from the Lucene project. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat.

Like any service or component in your architecture, you’ll want to monitor it to ensure that it’s available and gather performance data to help with tuning.

In this post, we’ll look at how we can monitor Solr, what performance metrics we might want to gather and how we can easily achieve this with Opsview.

We’ll use Opsview as it is built on Nagios and thus has access to a wide range of plugins, yet provides a more approachable user interface for configuring service checks.

Just in case you need to monitor a Solr service as part of your setup.

Rattle: A Graphical User Interface for Data Mining using R

Filed under: Data Mining,R,Rattle — Patrick Durusau @ 8:18 pm

Rattle: A Graphical User Interface for Data Mining using R

From the webpage:

Rattle (the R Analytical Tool To Learn Easily) presents statistical and visual summaries of data, transforms data into forms that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets.

I think I found Data Mining: Desktop Survival Guide before I located Rattle. Either way, both look like resources you will find useful.

SP-Sem-MRL 2012

Filed under: Conferences,Parsing,Semantics,Statistics — Patrick Durusau @ 8:13 pm

ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages (SP-Sem-MRL 2012)

Important dates:

Submission deadline: March 31, 2012 (PDT, GMT-8)
Notification to authors: April 21, 2012
Camera ready copy: May 5, 2012
Workshop: TBD, during the ACL 2012 workshop period (July 12-14, 2012)

From the website:

Morphologically Rich Languages (MRLs) are languages in which grammatical relations such as Subject, Predicate, Object, etc., are indicated morphologically (e.g. through inflection) instead of positionally (as in, e.g. English), and the position of words and phrases in the sentence may vary substantially. The tight connection between the morphology of words and the grammatical relations between them, and the looser connection between the position and grouping of words to their syntactic roles, pose serious challenges for syntactic and semantic processing. Furthermore, since grammatical relations provide the interface to compositional semantics, morpho-syntactic phenomena may significantly complicate processing the syntax–semantics interface. In statistical parsing, which has been a cornerstone of research in NLP and had seen great advances due to the widespread availability of syntactically annotated corpora, English parsing performance has reached a high plateau in certain genres, which is however not always indicative of parsing performance in MRLs, dependency-based and constituency-based alike . Semantic processing of natural language has similarly seen much progress in recent years. However, as in parsing, the bulk of the work has concentrated on English, and MRLs may present processing challenges that the community is as of yet unaware of, and which current semantic processing technologies may have difficulty coping with. These challenges may lurk in areas where parses may be used as input, such as semantic role labeling, distributional semantics, paraphrasing and textual entailments, or where inadequate pre-processing of morphological variation hurts parsing and semantic tasks alike.

This joint workshop aims to build upon the first and second SPMRL workshops (at NAACL-HLT 2010 and IWPT 2011, respectively) while extending the overall scope to include semantic processing where MRLs pose challenges for algorithms or models initially designed to process English. In particular, we seek to explore the use of newly available syntactically and/or semantically annotated corpora, or data sets for semantic evaluation that can contribute to our understanding of the difficulty that such phenomena pose. One goal of this workshop is to encourage cross-fertilization among researchers working on different languages and among those working on different levels of processing. Of particular interest is work addressing the lexical sparseness and out-of-vocabulary (OOV) issues that occur in both syntactic and semantic processing.

The exploration of non-English languages will replicate many of the outstanding entity recognition/data integration problems experienced in English. Considering that there are massive economic markets that speak non-English languages, the first to make progress on such issues will have a commercial advantage. How much of one I suspect depends on how well your software works in a non-English language.

DIM 2012 : IEEE International Workshop on Data Integration and Mining

Filed under: Conferences,Data Integration,Data Mining — Patrick Durusau @ 8:12 pm

DIM 2012 : IEEE International Workshop on Data Integration and Mining

Important Dates:

When Aug 8, 2012 – Aug 10, 2012
Where Las Vegas, Nevada, USA
Submission Deadline Mar 31, 2012
Notification Due Apr 30, 2012
Final Version Due May 14, 2012

From the website:

Given the emerging global Information-centric IT landscape that has tremendous social and economic implications, effectively processing and integrating humungous volumes of information from diverse sources to enable effective decision making and knowledge generation have become one of the most significant challenges of current times. Information Reuse and Integration (IRI) seeks to maximize the reuse of information by creating simple, rich, and reusable knowledge representations and consequently explores strategies for integrating this knowledge into systems and applications. IRI plays a pivotal role in the capture, representation, maintenance, integration, validation, and extrapolation of information; and applies both information and knowledge for enhancing decision-making in various application domains.

This conference explores three major tracks: information reuse, information integration, and reusable systems. Information reuse explores theory and practice of optimizing representation; information integration focuses on innovative strategies and algorithms for applying integration approaches in novel domains; and reusable systems focus on developing and deploying models and corresponding processes that enable Information Reuse and Integration to play a pivotal role in enhancing decision-making processes in various application domains.

The IEEE IRI conference serves as a forum for researchers and practitioners from academia, industry, and government to present, discuss, and exchange ideas that address real-world problems with real-world solutions. Theoretical and applied papers are both included. The conference program will include special sessions, open forum workshops, panels and keynote speeches.

Note the emphasis on integration. In topic maps we would call that merging.

I think that bodes well for the future of topic maps. Provided that we “steal the march” so to speak.

We have spent years, decades for some of us thinking about data integration issues. Let’s not hide our bright lights under a basket.

YaCy Search Engine

Filed under: Search Engines,Webcrawler — Patrick Durusau @ 8:11 pm

YaCy – Decentralized Web Search

Has anyone seen this?

From the homepage:

YaCy is a free search engine that anyone can use to build a search portal for their intranet or to help search the public internet. When contributing to the world-wide peer network, the scale of YaCy is limited only by the number of users in the world and can index billions of web pages. It is fully decentralized, all users of the search engine network are equal, the network does not store user search requests and it is not possible for anyone to censor the content of the shared index. We want to achieve freedom of information through a free, distributed web search which is powered by the world’s users.

Limited demo interface: http://search.yacy.net/

Interesting idea.

It would be more interesting if it used a language that permitted dynamic updating of software while it is running. Otherwise, you are going to have the YaCy search engine you installed and nothing more.

Reportedly Google improves its search algorithm many times every quarter. How many of those changes are ad-driven they don’t say.

The documentation for YaCy is slim at best. Particularly on technical details. For example, uses a NoSQL database. OK, a custom one or one of the standard ones? I could go on but it would not produce any answers. As I explore the software I will post what I find out about it.

Distributed User Interfaces: Collaboration and Usability

Filed under: Conferences,Interface Research/Design,Users — Patrick Durusau @ 8:10 pm

2nd Workshop on Distributed User Interfaces: Collaboration and Usability (CHI 2012 Workshop)

Important Dates:

  • Submission Deadline: January 13th, 2012
  • Author Notification: February 10th, 2012
  • Camera-Ready Deadline: April 1st, 2012
  • Workshop Date: May 5th or 6th, 2012 (to be confirmed)

From the website:

Distributed User Interfaces (DUIs) have recently become a new field of research and development in Human-Computer Interaction (HCI). The DUIs have brought about drastic changes affecting the way interactive systems are conceived. DUIs have gone beyond the fact that user interfaces are controlled by a single end user on the same computing platform in the same environment.

Traditional interaction is focused on the use of mobile devices such as, smartphones, tablets, laptops, and so on, tearing apart other environmental interaction resources such as large screens and multi-tactile displays, or tables. Under a collaborative scenario, users sharing common goals may take advantage of DUIs to carry out their tasks because they provide a shared environment where they are allowed to manipulate information in the same space at the same time. Under this hypothesis, collaborative DUIs scenarios open new challenges to usability evaluation techniques and methods.

Thus, the goal of this workshop is to promote the discussion about the emerging topic of DUIs, answering a set of key questions: how collaboration can be improved by using DUI? , in which situations a DUI is suitable to ease the collaboration among users?, how usability standards can be employed to evaluate the usability of systems based on DUIs?

Topics of Interest:

  • Conceptual models for DUIs
  • DUIs on ubiquitous environments
  • Distributed User Interface design
  • Public display interaction and DUIs
  • DUIs and coupled displays
  • DUIs and ambient intelligence
  • Human factors in DUIs design
  • Collaboration and DUIs
  • Usability evaluation in DUIs
  • DUIs on learning environment

If you aren’t already dealing with distributed topic map interfaces and collaboration issues, you will be.

USEWOD 2012 Data Challenge

Filed under: Contest,Linked Data,RDF,Semantic Web,Semantics — Patrick Durusau @ 8:08 pm

USEWOD 2012 Data Challenge

From the website:

The USEWOD 2012 Data Challenge invites research and applications built on the basis of USEWOD 2012 Dataset.

Accepted submissions will be presented at USEWOD2012, where a winner will be chosen. Examples of analyses and research that could be done with the dataset are the following (but not limited to those):

  • correlations between linked data requests and real-world events
  • types of structured queries
  • linked data access vs. conventional access
  • analysis of user agents visiting the sites
  • geographical analysis of requests
  • detection and visualisation of trends
  • correlations between site traffic and available datasets
  • etc. – let your imagination run wild!

USEWOD 2012 Dataset

The USEWOD dataset consists of server logs from from two major web servers publishing datasets on the Web
of linked data. In particular, the dataset contains logs from:

  • DBPedia: slices of log data
    spanning several months from
    the linked data twin of Wikipedia, one of the focal points of the Web of data.
    The logs were kindly made available to us for the challenge
    by OpenLink Software!
    Further details about this part of the dataset to follow.
  • SWDF:
    Semantic Web Dog Food is a
    constantly growing dataset of publications, people and organisations in the Web and Semantic Web area,
    covering several of the major conferences and workshops, including WWW, ISWC and ESWC. The logs
    contain two years of requests to the server from about 12/2008 until 12/2010.
  • Linked Open Geo Data A dataset about geographical data.
  • Bio2RDF Linked Data for life sciences.

Data sets are still under construction. Organizers advise that data sets should be available next week.

Your results should be reported as short papers and are due by 15 February 2011.

USEWOD 2012 : USAGE ANALYSIS AND THE WEB OF DATA

Filed under: Conferences,Semantic Web,Semantics — Patrick Durusau @ 8:06 pm

USEWOD 2012 : USAGE ANALYSIS AND THE WEB OF DATA

Location: Lyon, France (co-located with WWW2012)

Dates:

Release of Dataset for the USEWOD2012 Challenge: 15 December 2011
Paper submission deadline: 15 February 2012
Acceptance notification: 3 March 2012
Workshop and Prize for USEWOD Challenge: 16 or 17 April 2012

From the webpage:

The purpose of this workshop is to investigate new developments concerning the synergy between semantics and semantic-web technology on the one hand, and the analysis and mining of usage data on the other hand. As the first USEWOD workshop at WWW 2011 has shown, these two fields complement each other well. First, semantics can be used to enhance the analysis of usage data. Second, usage data analysis can enhance semantic resources as well as Semantic Web applications. Traces of users can be used to evaluate, adapt or personalise Semantic Web applications and logs can form valuable resources from which semantic knowledge can be extracted bottom-up.

The emerging Web of Data demands a re-evaluation of existing evaluation techniques: the Linked Data community is recognising that it needs to move beyond triple counts. Usage analysis is a key method for the evaluation of a datasets and applications. New ways of accessing information enabled by the Web of Data requires the development or adaptation of algorithms, methods, and techniques to analyse and interpret the usage of Web data instead of Web pages, a research endeavour that can profit from what has been learned in more than a decade of Web usage mining. The results can provide fine-grained insights into how semantic datasets and applications are being accessed and used by both humans and machines – insights that are needed for optimising the design and ultimately ensuring the success of semantic resources.

The primary goals of this workshop are to foster the emerging community of researchers from various fields sharing an interest in usage mining and semantics, to evaluate the developments of the past year, and to further develop a roadmap for future research in this direction.

Dr. Watson?

I got up thinking that there needs to be a project for automated authoring of a topic map and the name, Dr. Watson suddenly occurred to me. After all, Dr. Watson was Sherlock Holmes’ sidekick so it would not be like saying it could stand on its own. Plus there would be some name recognition and/or confusion with the real Dr. Watson, or rather imaginary Dr. Watson of Sherlock Holmes’ fame.

And there would be confusion with the Dr. Watson that is the internal debugger for Windows (MS, I never can remember if the ™ goes on Windows or MS. Not that anyone else would want to call themselves MS. 😉 ) Plus the Watson research center at IBM.

Well, I suspect being an automated, probabilistic topic map authoring system will be enough to distinguish it from the foregoing uses.

Any others that I should be aware of?

I say probabilistic because even with the TMDM’s string matching on URIs, it is only probable that two or more topics actually represent the same topic. It is always possible that a URI has been incorrectly used to identity the subject that a topic represents. And in such cases, the error perpetuates itself across a topic map.

So we start off with the realization that even string matching results in a probability of less than 1.0 (where 1.0 is absolute certainty) that two or more topics represent the same subject.

Since we are already being probabilistic, why not be openly so?

But, before we get into the weeds and details, the project has to have a cool name. (As in not an acronym that is cool and we make up a long name to fit the acronym.)

All those in favor of Dr. Watson, please signify by raising your hands (or the beer you are holding).

More to follow.

Information Field Theory

Filed under: Bayesian Models,Information Field Theory — Patrick Durusau @ 4:14 pm

Information Field Theory

May be something, may be nothing.

I saw a news flash about the use of this technique to combine 41,000 observations to create a magnetic map of the Milky Way. Subject to a lot of noise and smoothing of the data.

Which made me think that perhaps, just perhaps this technique could be used across a semantic field?

From the webpage:

Information field theory (IFT) is information theory, the logic of reasoning under uncertainty, applied to fields. A field can be any quantity defined over some space, e.g. the air temperature over Europe, the magnetic field strength in the Milky Way, or the matter density in the Universe. IFT describes how data and knowledge can be used to infer field properties. Mathematically it is a statistical field theory and exploits many of the tools developed for such. Practically, it is a framework for signal processing and image reconstruction.

All the examples I found were in the physical sciences but I would check closely before claiming to be the first to use the technique in a social science context.

December 6, 2011

White House to open source Data.gov as open government data platform

Filed under: eGov,Government Data,Open Source — Patrick Durusau @ 8:10 pm

White House to open source Data.gov as open government data platform by Alex Howard.

From the post:

As 2011 comes to an end, there are 28 international open data platforms in the open government community. By the end of 2012, code from new “Data.gov-in-a-box” may help many more countries to stand up their own platforms. A partnership between the United States and India on open government has borne fruit: progress on making the open data platform Data.gov open source.

In a post this morning at the WhiteHouse.gov blog, federal CIO Steven VanRoekel (@StevenVDC) and federal CTO Aneesh Chopra (@AneeshChopra) explained more about how Data.gov is going global:

As part of a joint effort by the United States and India to build an open government platform, the U.S. team has deposited open source code — an important benchmark in developing the Open Government Platform that will enable governments around the world to stand up their own open government data sites.

The development is evidence that the U.S. and India are indeed still collaborating on open government together, despite India’s withdrawal from the historic Open Government Partnership (OGP) that launched in September. Chopra and VanRoekel explicitly connected the move to open source Data.gov to the U.S. involvement in the Open Government Partnership today. While we’ll need to see more code and adoption to draw substantive conclusions on the outcomes of this part of the plan, this is clearly progress.

Data.gov in a boxThe U.S. National Action Plan on Open Government, which represents the U.S. commitment to the OGP, included some details about this initiative two months ago, building upon a State Department fact sheet that was released in July. Back in August, representatives from India’s National Informatics Center visited the United States for a week-long session of knowledge sharing with the U.S. Data.gov team, which is housed within the General Services Administration.

“The secretary of state and president have both spent time in India over the past 18 months,” said VanRoekel in an interview today. “There was a lot of dialogue about the power of open data to shine light upon what’s happening in the world.”

The project, which was described then as “Data.gov-in-a-box,” will include components of the Data.gov open data platform and the India.gov.in document portal. Now, the product is being called the “Open Government Platform” — not exactly creative, but quite descriptive and evocative of open government platforms that have been launched to date. The first collection of open source code, which describes a data management system, is now up on GitHub.

During the August meetings, “we agreed upon a set of things we would do around creating excellence around an open data platform,” said VanRoekel. “We owned the first deliverable: a dataset management tool. That’s the foundation of an open source data platform. It handles workflow, security and the check in of data — all of the work that goes around getting the state data needs to be in before it goes online. India owns the next phase: the presentation layer.”

If the initiative bears fruit in 2012, as planned, the international open government data movement will have a new tool to apply toward open data platforms. That could be particularly relevant to countries in the developing world, given the limited resources available to many governments.

What’s next for open government data in the United States has yet to be written. “The evolution of data.gov should be one that does things to connect to web services or an API key manager,” said VanRoekel. “We need to track usage. We’re going to double down on the things that are proving useful.”

Interests which already hold indexes of government documents should find numerous opportunities when government platforms provide opportunities for mapping into agency data as part of open government platforms.

Lexical Analysis with Flex

Filed under: Flex,Lexical Analyzer — Patrick Durusau @ 8:09 pm

Lexical Analysis with Flex

From the introduction:

flex is a tool for generating scanners. A scanner is a program which recognizes lexical patterns in text. The flex program reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. flex generates as output a C source file, lex.yy.c by default, which defines a routine yylex(). This file can be compiled and linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

For when you have serious scanning tasks.

Lecture Fox

Filed under: CS Lectures,Mathematics — Patrick Durusau @ 8:07 pm

Lecture Fox

A nice collection of links to university lectures.

Has separate pages on computer science and math, but also physics and chemistry. The homepage is a varied collection of those subjects and others.

Good to see someone collecting links for lectures beyond the usual ones.

Trivia from one of the CS lectures: What language was started by the U.S. DoD in the mid to late 1970’s to consolidate more than 500 existing languages and dialects?

Try to answer before peeking! Computer Science 164, Spring 2011, Berkeley byw, the materials for Computer Science 164.

Common Lisp is the best language to learn programming

Filed under: Authoring Topic Maps,Lisp,Programming,Topic Maps — Patrick Durusau @ 8:06 pm

Common Lisp is the best language to learn programming

From the post:

Now that Conrad Barski’s Land of Lisp (see my review on Slashdot) has come out, I definitely think Common Lisp is the best language for kids (or anyone else) to start learning computer programming.

Not trying to start a language war but am curious about two resources cited in this post:

Common Lisp HyperSpec

and,

Common Lisp the language, 2nd edition

My curiosity?

How would you map these two resources into a single topic map on Lisp?

Is there any third resource, perhaps the “Land of Lisp” that you would like to add?

Any blogs, mailing list posts, etc.?

Would that topic map be any different if you decided to add Scheme or Haskell to your topic map?

If this were a “learning lisp” resource for beginning programmers, how would you limit the amount of information exposed?

« Newer PostsOlder Posts »

Powered by WordPress