September « 2011 « Another Word For It

September 13, 2011

Electronic Discovery Reference Model

Filed under: e-Discovery,Legal Informatics — Patrick Durusau @ 7:14 pm

Electronic Discovery Reference Model (EDRM)

From the webpage:

EDRM develops guidelines, sets standards and delivers resources to help e-discovery consumers and providers improve quality and reduce costs associated with e-discovery

EDRM consists of 9 projects, each designed to help reach those goals:

Data Set, Evergreen, IGRM (Information Governance Reference Model), Jobs, Metrics, Model Code of Conduct, Search, Testing, XML.

Definitely on your radar if you are working on topic maps and legal discovery.

I will be returning to the projects to treat them individually. The “Data Set” project alone may take longer than my usual post to simply summarize.

Comments Off

Hadoop Tuesdays!

Filed under: Hadoop — Patrick Durusau @ 7:13 pm

Hadoop Tuesdays with Joe McKendrick: 7-Part Live Webinar Series

I know, I know, the “registration” form is fairly lame. I was tempted to put down “hospitality/travel” as my industry just to see if I started getting free trip offers or something. 😉

Can’t say for sure until I have attended the sessions but this doesn’t have the appearance of a real technical set of webinars. Still, need to know the “cliff notes” version of the story circulating about Hadoop in governmental and business circles.

From the webpage:

Data experts, Informatica and Cloudera are jointly producing a 7-part webinar series, called Hadoop Tuesdays. Host Joe McKendrick is an independent industry analyst and contributing editor to Database Trends & Applications.

Why register for Hadoop Tuesdays webinar series?

You are interested in learning more about Hadoop but don’t know how to get started

You have goals for storing, processing and extracting value from unstructured data so that you can combine and unleash the value of both structured and unstructured data

You wish to form a roadmap for adding Hadoop to your environment.

What will be covered in the Hadoop Tuesdays webinar series?

What is Hadoop?

What are the most popular use cases driving Hadoop adoption?

What are the biggest expected benefits?

How should you evaluate Hadoop and fit it into your information architecture?

Comments Off

Analyzing Apache Logs with Riak

Filed under: Riak — Patrick Durusau @ 7:12 pm

Analyzing Apache Logs with Riak

From the post:

This article will show you how to do some Apache log analysis using Riak and MapReduce. Specifically it will give an example of how to extract URLs from Apache logs stored in Riak (the map phase) and provide a count of how many times each URL was requested (the reduce phase).

First steps with Riak and MapReduce.

Comments Off

The Science of Timing

Filed under: Marketing,Topic Maps — Patrick Durusau @ 7:12 pm

The Science of Timing

Webinar on when to tweet, blog, update FaceBook, etc.

Just like other aspects of marketing there are patterns that are more successful than others.

Successful marketing of topic maps, whoever gets the contracts, is something I would really like to see.

What is it they say? A rising tide helps all boats?

Comments Off

Reference Interview

Filed under: Librarian/Expert Searchers,Library — Patrick Durusau @ 7:11 pm

Reference Interview by Jimmy Ghaphery.

More for library students in my topic maps class but others may benefit as well.

The same questions that librarians ask in a “reference interview” are the same sort of questions that will help you identify subjects that should go in a particular topic map.

Your users may not realize there are subjects that will make their finding of other subjects easier in the future. Or that are used by other users in the same department, etc.

Comments Off

September 12, 2011

SAGA: A DSL for Story Management

Filed under: DSL,TMDM,TMRM,Topic Maps,Vocabularies — Patrick Durusau @ 8:29 pm

SAGA: A DSL for Story Management by Lucas Beyak and Jacques Carette (McMaster University).

Abstract:

Video game development is currently a very labour-intensive endeavour. Furthermore it involves multi-disciplinary teams of artistic content creators and programmers, whose typical working patterns are not easily meshed. SAGA is our first effort at augmenting the productivity of such teams.

Already convinced of the benefits of DSLs, we set out to analyze the domains present in games in order to find out which would be most amenable to the DSL approach. Based on previous work, we thus sought those sub-parts that already had a partially established vocabulary and at the same time could be well modeled using classical computer science structures. We settled on the ‘story’ aspect of video games as the best candidate domain, which can be modeled using state transition systems.

As we are working with a specific company as the ultimate customer for this work, an additional requirement was that our DSL should produce code that can be used within a pre-existing framework. We developed a full system (SAGA) comprised of a parser for a human-friendly language for ‘story events’, an internal representation of design patterns for implementing object-oriented state-transitions systems, an instantiator for these patterns for a specific ‘story’, and three renderers (for C++, C# and Java) for the instantiated abstract code.

I mention this only in part because of Jack Park’s long standing interest in narrative structures.

The other reason I mention this article is it is a model for how to transition between vocabularies in a useful way.

Transitioning between vocabularies is as nearly a constant theme in computer science as data storage. Not to mention that disciplines, domains, professions, etc., have been transitioning between vocabularies for thousands of years. Some more slowly than other, some terms in legal vocabularies date back centuries.

We need vocabularies and data structures, but with the realization that none of them are final. If you want blind interchangea of topic maps I would strongly suggest that you use one of the standard syntaxes.

But with the realization that you will encounter data that isn’t in a standard topic map syntax. What subjects are represented there? How would you tell others about them? And those vocabularies are going to change over time, just as there were vocabularies before RDF and topic maps.

If you ask an honest MDM advocate, they will tell you that the current MDM effort is not really all that different from MDM in the ’90’s. And MDM may be what you need, depends on your requirements. (Sorry, master data management = MDM.)

The point being that there isn’t any place where a particular vocabulary or “solution” is going to freeze the creativity of users and even programmers, to say nothing of the rest of humanity. Change is the only constant and those who aren’t prepared to deal with it, will be the worse off for it.

Comments Off

QUDT – Quantities, Units, Dimensions and Data Types in OWL and XML

Filed under: Data Types,Dimensions,Ontology,OWL,Quantities,Units — Patrick Durusau @ 8:29 pm

QUDT – Quantities, Units, Dimensions and Data Types in OWL and XML

From background:

The QUDT Ontologies, and derived XML Vocabularies, are being developed by TopQuadrant and NASA. Originally, they were developed for the NASA Exploration Initiatives Ontology Models (NExIOM) project, a Constellation Program initiative at the AMES Research Center (ARC). The goals of the QUDT ontology are twofold:

to provide a unified model of, measurable quantities, units for measuring different kinds of quantities, the numerical values of quantities in different units of measure and the data structures and data types used to store and manipulate these objects in software;

to populate the model with the instance data (quantities, units, quantity values, etc.) required to meet the life-cycle needs of the Constellation Program engineering community.

If you are looking for measurements, this would be one place to start.

Comments Off

A Variational HEM Algorithm for Clustering Hidden Markov Models

Filed under: Clustering,Hidden Markov Model — Patrick Durusau @ 8:27 pm

A Variational HEM Algorithm for Clustering Hidden Markov Models by: Emanuele Coviello, Antoni B. Chan, and Gert R.G. Lanckriet.

Abstract:

The hidden Markov model (HMM) is a generative model that treats sequential data under the assumption that each observation is conditioned on the state of a discrete hidden variable that evolves in time as a Markov chain. In this paper, we derive a novel algorithm to cluster HMMs through their probability distributions. We propose a hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i.e., a novel HMM that is representative for the group. We present several empirical studies that illustrate the benefits of the proposed algorithm.

Warning: Heavy sledding but the examples of improved hierarchical motion clustering, music tagging, and online hand writing recognition are quite compelling.

Comments Off

Linking linked data to U.S. law

Filed under: Law - Sources,Linked Data — Patrick Durusau @ 8:27 pm

Linking linked data to U.S. law

Bob DuCharme does an excellent job of covering resources that will help you create meaningful links to US court decisions, laws and regulations.

That will be useful for readers/researchers but I can’t shake the feeling that it is very impoverished linking.

You can link out to a court decision, law or regulation but you can’t say why, in any computer processable way, a link is being made.

Even worse, if I start from the court decision, law or regulation, all I can search for are invocations of that court decision, law or regulation, but I won’t know why it was being invoked.

There are specialized resources in the legal community (Shepard’s Citations) that alter that result but the need for general solution of more robust linking remains.

Comments Off

e-Discovery Zone

Filed under: e-Discovery,Legal Informatics — Patrick Durusau @ 8:26 pm

e-Discovery Zone

Vendor sponsored site but looks like a fairly rich collection of links to e-discovery (law/legal) materials.

Comments Off

Apache Camel

Filed under: Data Analysis,Data Engine,Data Integration — Patrick Durusau @ 8:25 pm

Apache Camel

New release as of 25 July 2011.

The Apache Camel site self describes as:

Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration.

Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.

Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options. Apache Camel is a small library which has minimal dependencies for easy embedding in any Java application. Apache Camel lets you work with the same API regardless which kind of Transport used, so learn the API once and you will be able to interact with all the Components that is provided out-of-the-box.

Apache Camel has powerful Bean Binding and integrated seamless with popular frameworks such as Spring and Guice.

Apache Camel has extensive Testing support allowing you to easily unit test your routes.

….

So don’t get the hump, try Camel today! 🙂

Comments/suggestions?

I am going to be working through some of the tutorials and other documentation. Anything I should be looking for?

Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration.

Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.

Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options. Apache Camel is a small library which has minimal dependencies for easy embedding in any Java application. Apache Camel lets you work with the same API regardless which kind of Transport used, so learn the API once and you will be able to interact with all the Components that is provided out-of-the-box.

Apache Camel has powerful Bean Binding and integrated seamless with popular frameworks such as Spring and Guice.

Apache Camel has extensive Testing support allowing you to easily unit test your routes.

Comments Off

LinkedGeoData Release 2

Filed under: Geo Analytics,Geographic Data,Geographic Information Retrieval,Linked Data,RDF — Patrick Durusau @ 8:25 pm

LinkedGeoData Release 2

From the webpage:

The aim of the LinkedGeoData (LGD) project is to make the OpenStreetMap (OSM) datasets easily available as RDF. As such the main target audience is the Semantic Web community, however it may turn out to be useful to a much larger audience. Additionally, we are providing interlinking with DBpedia and GeoNames and integration of class labels from translatewiki and icons from the Brian Quinion Icon Collection.

The result is a rich, open, and integrated dataset which we hope to be useful for research and application development. The datasets can be publicly accessed via downloads, Linked Data, and SPARQL-endpoints. We have also launched an experimental “Live-SPARQL-endpoint” that is synchronized with the minutely updates from OSM whereas the changes to our store are republished as RDF.

More geographic data.

Comments Off

September 11, 2011

Solr architecture diagram

Filed under: Graphics,Solr,Visualization — Patrick Durusau @ 7:17 pm

Solr architecture diagram

From the post:

We at Cominvent have often had the need to visualize the internal architecture of Apache Solr in order to explain both the relationships of the components as well as the flow of data and queries. The result is this conceptual architecture diagram, clearly showing how Solr relates to the app-server, how cores relate to a Solr instance, how documents enter through an UpdateRequestHandler, through an UpdateChain and Analysis and into the Lucene index etc.

Very useful if you are using (or plan to use) Solr in connection with your topic mapping.

Also a reminder that graphic presentation, can, doesn’t always, force you to be clear about structure and relationships.

Comments Off

Solr and Autocomplete

Filed under: Interface Research/Design,Solr — Patrick Durusau @ 7:16 pm

Rafał Kuć has a multi-part series on Solr and its autocomplete feature:

Solr and autocomplete (part 1)

Solr and autocomplete (part 2)

Solr and autocomplete (part 3)

Part 4 to appear.

Autocompletion is common enough that I suspect it has or will shortly become a user expectation for search interfaces.

Customizing the dictionary (in part 3) will help you provide better guidance to users than seen in general search engines.

Comments Off

New Challenges in Distributed Information Filtering and Retrieval

Filed under: Filters,Image Recognition,Information Retrieval,Ontology,Recommendation,Semantic Web,SPARQL,Summarization,Temporal Semantic Analysis — Patrick Durusau @ 7:14 pm

New Challenges in Distributed Information Filtering and Retrieval

Proceedings of the 5th International Workshop on New Challenges in Distributed Information Filtering and Retrieval
Palermo, Italy, September 17, 2011.

Edited by:

Cristian Lai – CRS4, Loc. Piscina Manna, Building 1 – 09010 Pula (CA), Italy

Giovanni Semeraro – Dept. of Computer Science, University of Bari, Aldo Moro, Via E. Orabona, 4, 70125 Bari, Italy

Eloisa Vargiu – Dept. of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123 Cagliari, Italy

Table of Contents:

Experimenting Text Summarization on Multimodal Aggregation
Giuliano Armano, Alessandro Giuliani, Alberto Messina, Maurizio Montagnuolo, Eloisa Vargiu
From Tags to Emotions: Ontology-driven Sentimental Analysis in the Social Semantic Web
Matteo Baldoni, Cristina Baroglio, Viviana Patti, Paolo Rena
A Multi-Agent Decision Support System for Dynamic Supply Chain Organization
Luca Greco, Liliana Lo Presti, Agnese Augello, Giuseppe Lo Re, Marco La Cascia, Salvatore Gaglio
A Formalism for Temporal Annotation and Reasoning of Complex Events in Natural Language
Francesco Mele, Antonio Sorgente
Interaction Mining: the new Frontier of Call Center Analytics
Vincenzo Pallotta, Rodolfo Delmonte, Lammert Vrieling, David Walker
Context-Aware Recommender Systems: A Comparison Of Three Approaches
Umberto Panniello, Michele Gorgoglione
A Multi-Agent System for Information Semantic Sharing
Agostino Poggi, Michele Tomaiuolo
Temporal characterization of the requests to Wikipedia
Antonio J. Reinoso, Jesus M. Gonzalez-Barahona, Rocio Muñoz-Mansilla, Israel Herraiz
From Logical Forms to SPARQL Query with GETARUN
Rocco Tripodi, Rodolfo Delmonte
ImageHunter: a Novel Tool for Relevance Feedback in Content Based Image Retrieval
Roberto Tronci, Gabriele Murgia, Maurizio Pili, Luca Piras, Giorgio Giacinto

Comments Off

On Clustering on Graphs with Multiple Edge Types

Filed under: Clustering,Graphs — Patrick Durusau @ 7:12 pm

On Clustering on Graphs with Multiple Edge Types by Matthew Rocklin and Ali Pinar.

Abstract:

We study clustering on graphs with multiple edge types. Our main motivation is that similarities between objects can be measured in many different metrics. For instance similarity between two papers can be based on common authors, where they are published, keyword similarity, citations, etc. As such, graphs with multiple edges is a more accurate model to describe similarities between objects. Each edge/metric provides only partial information about the data; recovering full information requires aggregation of all the similarity metrics. Clustering becomes much more challenging in this context, since in addition to the difficulties of the traditional clustering problem, we have to deal with a space of clusterings. We generalize the concept of clustering in single-edge graphs to multi-edged graphs and investigate problems such as: Can we find a clustering that remains good, even if we change the relative weights of metrics? How can we describe the space of clusterings efficiently? Can we find unexpected clusterings (a good clustering that is distant from all given clusterings)? If given the ground-truth clustering, can we recover how the weights for edge types were aggregated? %In this paper, we discuss these problems and the underlying algorithmic challenges and propose some solutions. We also present two case studies: one based on papers on Arxiv and one based on CIA World Factbook.

From the introduction:

In many real-world problems, similarities between objects can be defined by many different relationships. For instance, similarity between two scientific articles can be defined based on authors, citations to, citations from, keywords, titles, where they are published, text similarity and many more. Relationships between people can be based on the nature of the relationship (e.g., business, family, friendships) a means of communication (e.g., emails, phone, in person), etc. Electronic files can be grouped by their type (Latex, C, html), names, the time they are created, or the pattern they are accessed. In these examples, there are multiple graphs that define relationships between the subjects. We may choose to reduce this multivariate information to construct a single composite graph. This is convenient as it enables application of many strong results from the literature. However, information being lost during this aggregation may be crucial, and we believe working on graphs with multiple edge types is a more precise representation of the problem, and thus will lead to more accurate analyses. Despite its importance, the literature on clustering graphs with multiple edge types is very rare….

Sounds familiar doesn’t it? A good deal like associations in a topic map. Where implicit information, such as roles and which role is being played becomes explicit.

The clustering described in this paper is, of course, a way to group and identify collective subjects.

You will probably be interested in the Graclus software that the authors use for splitting graphs.

Comments Off

Fast Clustering using MapReduce

Filed under: Clustering,MapReduce — Patrick Durusau @ 7:09 pm

Fast Clustering using MapReduce by Alina Ene, Sungjin Im, and Benjamin Moseley.

Abstract:

Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, $k$-center and $k$-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in $\mathcal{MRC}^0$, a theoretical MapReduce class introduced by Karloff et al. \cite{KarloffSV10}. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd’s algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the $k$-median problem. The experiments show that our algorithms’ solutions are similar to or better than the other algorithms’ solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

In the Introduction the authors note:

In the clustering problems that we consider in this paper, the goal is to partition the data into subsets, called clusters, such that the data points assigned to the same cluster are similar according to some metric

At first a trivial observation but then on reflection, perhaps not.

First, clustering is described as a means to collect up “data points” (I would say “subjects” maybe, read further) that are “similar” by some measure. A collective subject, such as all the members of a football team, books in a library, users in a chat room at some point in time.

Second, and I think this goes unsaid too often, if the degree of similarity is high enough, we may well conclude that a single “subject” is described by the clustering.

But the choice between collective versus single subject is one that is one that has no permanent or universal resolution. So it is best to state which one is in operation.

Algorithms for MapReduce is an exciting area to watch.

Comments (1)

The Open Graph Archive: A Community-Driven Effort

Filed under: Graphs — Patrick Durusau @ 7:08 pm

The Open Graph Archive: A Community-Driven Effort by Christian Bachmaier, Franz J. Brandenburg, Philip Effinger, Carsten Gutwenger, Jyrki Katajainen, Karsten Klein, Miro Spönemann, Matthias Stegmaier, and Michael Wybrow.

Abstract:

In order to evaluate, compare, and tune graph algorithms, experiments on well designed benchmark sets have to be performed. Together with the goal of reproducibility of experimental results, this creates a demand for a public archive to gather and store graph instances. Such an archive would ideally allow annotation of instances or sets of graphs with additional information like graph properties and references to the respective experiments and results. Here we examine the requirements, and introduce a new community project with the aim of producing an easily accessible library of graphs. Through successful community involvement, it is expected that the archive will contain a representative selection of both real-world and generated graph instances, covering significant application areas as well as interesting classes of graphs.

The prototype for the proposed archive can be found at: http://graphdrawing.org/grapharchive/.

Be aware the link takes you to options to see a prior or current version of the site. Choosing the “current” version takes you to a registration page. After registering, confirming, etc., you reach the new interface. Very clean and focused on presenting the database of graphs.

Try the URI for any graph entry. You will like the result.

If you are interested in graphs at all, this project merits your participation and support. Please forward to all graph advocates or others involved in graph research.

Comments Off

RavenDB Webinar #1

Filed under: NoSQL,RavenDB — Patrick Durusau @ 7:06 pm

RavenDB Webinar #1 was announced at: Hibernating Rhinos: Zero friction databases.

From the webpage:

The first ever RavenDB webinar aired last week, Thursday the 8th, and it was a great success. We announced it only about 12 hours in advance, yet 260+ people registered. Unfortunately the software we were using only allowed 100 people in – our apologies for all of you who wanted to participate but couldn’t get in, or heard of it too late.

CouchDB jQuery Plugin Reference

Filed under: CouchDB,JQuery,NoSQL — Patrick Durusau @ 7:04 pm

CouchDB jQuery Plugin Reference by Bradley Holt.

I’ve had a difficult time finding documentation on the CouchDB jQuery plugin that ships with CouchDB. So, I’ve decided to create my own reference and share it with you. This should cover almost the entire CouchDB API that is available through the version of the plugin that ships with CouchDB 1.1.0.

What’s your “favorite” lack of documentation?

Comments Off

Efficient P2P Ensemble Learning with Linear Models on Fully Distributed Data

Filed under: Ensemble Methods,Machine Learning,P2P — Patrick Durusau @ 7:02 pm

Efficient P2P Ensemble Learning with Linear Models on Fully Distributed Data by Róbert Ormándi, István Hegedűs, and Márk Jelasity.

Abstract:

Machine learning over fully distributed data poses an important problem in peer-to-peer (P2P) applications. In this model we have one data record at each network node, but without the possibility to move raw data due to privacy considerations. For example, user profiles, ratings, history, or sensor readings can represent this case. This problem is difficult, because there is no possibility to learn local models, yet the communication cost needs to be kept low. Here we propose gossip learning, a generic approach that is based on multiple models taking random walks over the network in parallel, while applying an online learning algorithm to improve themselves, and getting combined via ensemble learning methods. We present an instantiation of this approach for the case of classification with linear models. Our main contribution is an ensemble learning method which-through the continuous combination of the models in the network-implements a virtual weighted voting mechanism over an exponential number of models at practically no extra cost as compared to independent random walks. Our experimental analysis demonstrates the performance and robustness of the proposed approach.

Interesting. In a topic map context, I wonder about creating associations based on information that is not revealed to the peer making the association? Or the peer suggesting the association?

Comments Off

September 10, 2011

The Language Problem: Jaguars & The Turing Test

Filed under: Ambiguity,Authoring Topic Maps,Indexing,Language — Patrick Durusau @ 6:10 pm

The Language Problem: Jaguars & The Turing Test by Gord Hotchkiss.

The post begins innocently enough:

“I love Jaguars!”

When I ask you to understand that sentence, I’m requiring you to take on a pretty significant undertaking, although you do it hundreds of times each day without really thinking about it.

The problem comes with the ambiguity of words.

If you appreciate discussions of language, meaning and the short falls of our computing companions, you will really like this article and the promised following posts.

Not to mention bringing into sharp relief the issues that topic map authors (or indexers) face when trying to specify a subject that will be recognized and used by N unknown users.

I suppose that is really the tricky part, or at least part of it, the communication channel for an index or topic map is only one way. There is no opportunity for correcting a reading/mis-reading by the author. All that lies with the user/reader alone.

Comments Off

GTD – Global Terrorism Database

Filed under: Authoring Topic Maps,Data,Data Integration,Data Mining,Dataset — Patrick Durusau @ 6:08 pm

GTD – Global Terrorism Database

From the homepage:

The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2010 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 98,000 cases.

While chasing down a paper that didn’t make the cut I ran across this data source.

Lacking an agreed upon definition of terrorism (see Chomsky for example), you may or may not find what you consider to be incidents of terrorism in this dataset.

Never the less, it is a dataset of events of popular interest and can be used to attract funding for your data integration project using topic maps.

Comments Off

TV Tropes

Filed under: Authoring Topic Maps,Data,Interface Research/Design — Patrick Durusau @ 6:06 pm

TV Tropes

Sam Hunting forwarded this to my attention.

From the homepage:

What is this about? This wiki is a catalog of the tricks of the trade for writing fiction.

Tropes are devices and conventions that a writer can reasonably rely on as being present in the audience members’ minds and expectations. On the whole, tropes are not clichés. The word clichéd means “stereotyped and trite.” In other words, dull and uninteresting. We are not looking for dull and uninteresting entries. We are here to recognize tropes and play with them, not to make fun of them.

The wiki is called “TV Tropes” because TV is where we started. Over the course of a few years, our scope has crept out to include other media. Tropes transcend television. They reflect life. Since a lot of art, especially the popular arts, does its best to reflect life, tropes are likely to show up everywhere.

We are not a stuffy encyclopedic wiki. We’re a buttload more informal. We encourage breezy language and original thought. There Is No Such Thing As Notability, and no citations are needed. If your entry cannot gather any evidence by the Wiki Magic, it will just wither and die. Until then, though, it will be available through the Main Tropes Index.

I rather like the definition of trope as “devices and conventions that a writer can reasonably rely on as present in the audience members’ minds and expecations.” I would guess under some circumstances we could call those “subjects” which we can include in a topic map. And then, for example, map the occurrences of those subjects in TV shows, for example.

As the site points out, it is called TV Tropes because it started with TV, but tropes have a much larger range than TV.

Being aware of and able to invoke (favorable) tropes in the minds of your users is one part of selling your topic map solution.

Comments (2)

Solr Digest, Spring-Summer 2011, Part 1

Filed under: Authoring Topic Maps,Solr — Patrick Durusau @ 6:04 pm

Solr Digest, Spring-Summer 2011, Part 1

Don’t miss this issue of the Solr Digest.

It covers Solr releases 3.2, 3.3 and the upcoming 3.4 so there is no shortage of material. Part 2 is in the works.

Of particular interest to topic map authors will be the result grouping/field collapsing.

From the Apache wiki:

Field Collapsing and Result Grouping are different ways to think about the same Solr feature.

Field Collapsing collapses a group of results with the same field value down to a single (or fixed number) of entries. For example, most search engines such as Google collapse on site so only one or two entries are shown, along with a link to click to see more results from that site. Field collapsing can also be used to suppress duplicate documents.

Result Grouping groups documents with a common field value into groups, returning the top documents per group, and the top groups based on what documents are in the groups. One example is a search at Best Buy for a common term such as DVD, that shows the top 3 results for each category (“TVs & Video”,”Movies”,”Computers”, etc)

For example, collapsed results could be exported to a representation as a topic and either occurrences or associations of a particular type. Other uses will suggest themselves.

Comments (1)

SearchWorkings

Filed under: ElasticSearch,Lucene,Mahout,Solr — Patrick Durusau @ 6:02 pm

SearchWorkings

From the About Us page:

SearchWorkings.org was created by a bunch of really passionate search technology professionals who realised that the world (read: other search professionals) doesn’t have a single point of contact or comprehensive resource where they can learn and talk about all the exciting new developments in the wonderful world of open source search solutions. These professionals all work at JTeam, a leading supplier of high-quality custom-built applications and end-to-end solutions provider, and moreover a market leader when it comes to search solutions.

A wide variety of materials, from whitepapers and articles, forums (Lucene, Solr, ElasticSearch, Mahout), training videos, news, and blogs.

You do have to register/join (free) to get access to the good stuff.

Comments Off

A Uniform Fixpoint Approach to the Implementation of Inference Methods for Deductive Databases

Filed under: Database,Deductive Databases,Inference — Patrick Durusau @ 6:00 pm

A Uniform Fixpoint Approach to the Implementation of Inference Methods for Deductive Databases by Andreas Behrend.

Abstract:

Within the research area of deductive databases three different database tasks have been deeply investigated: query evaluation, update propagation and view updating. Over the last thirty years various inference mechanisms have been proposed for realizing these main functionalities of a rule-based system. However, these inference mechanisms have been rarely used in commercial DB systems until now. One important reason for this is the lack of a uniform approach well-suited for implementation in an SQL-based system. In this paper, we present such a uniform approach in form of a new version of the soft consequence operator. Additionally, we present improved transformation-based approaches to query optimization and update propagation and view updating which are all using this operator as underlying evaluation mechanism.

This one will take a while and discussions with people more familiar than I am with deductive databases.

But, having said that, it looks important. The approach has been validated for stock market data streams and management of airspace. Not to mention:

EU Project INFOMIX (IST-2001-33570)

Information system of University “La Sapienza” in Rome.

14 global relations,

29 integrity constraints,

29 relations (in 3 legacy databases) and 12 web wrappers,

More than 24MB of data regarding students, professors and exams of the University.

Comments Off

Generic Multiset Programming for Language-Integrated Querying

Filed under: Multisets,SQL — Patrick Durusau @ 2:53 pm

More than one resource on generic multiset programming by Fritz Henglein and Ken Friis Larsen:

Paper: Generic Multiset Programming for Language-Integrated Querying

Video: Generic Multiset Programming for Language-Integrated Querying

From the introduction to the paper:

We introduce a library for generic multiset programming. It supports algebraic programming based on Codd’s original relational algebra with select (filter), project (map), cross product, (multi)set union and (multi)set difference as primitive operators and extends it with SQL-style functions corresponding to GROUP BY, SORT BY, HAVING, DISTINCT and aggregation functions.

It generalizes the querying core of SQL as follows: Multisets may contain elements of arbitrary first-order data types, including references (pointers), recursive data types and nested multisets. It contains an expressive embedded domain specific language for user-definable equivalence and ordering relations. And it allows in principle arbitrary user-defined functions for selection, projection and aggregation.

Under Contributions:

In this paper we provide a library for SQL-style programming with multisets that

supports all the classic features of the data query sublanguage of SQL;

admits multisets of any element type, including nested multisets and trees;

admits user-definable equivalences (equijoin conditions), predicates, and functions;

admits naïaut;ve programming with cross-products, but avoids spending quadratic time on computing them;

offers an object-oriented (object-method-parameters) syntax using infix binary operators;

and is easy to implement.

To demonstrate this we include the complete source code (without basic routines for generic discrimination reported elsewhere) and a comple of paradigmatic examples.

Looks interesting to me!

Comments Off

September 9, 2011

High Wizardry in the Land of Scala

Filed under: Scala — Patrick Durusau @ 7:19 pm

High Wizardry in the Land of Scala by Daniel Spiewak.

Daniel is obviously not a decaf fan. 😉

Covers some advanced features of Scala:

Higher-Kinds
Type Classes
Type-Level Encoding
Continuations

He mentions a series of posts at Apocalisp, which starts with: Type-Level Programming in Scala, which has a listing of the following post, sort of. The list is incomplete and not really consistent with all the following articles. A complete listing would be handy. To save the author time I may contribute one.

I am wondering if type encoding of data structures would be useful with complex subject identifiers?

BTW, is Ruby a better language for conferences? Daniel mentions that at Ruby conferences they have porn in their slides. All that you find here is a nasty looking math slide. Would Benjamin be the right person to ask?

Comments Off

Hibernate Search with Lucene

Filed under: Hibernate,Lucene — Patrick Durusau @ 7:17 pm

Hibernate Search with Lucene

From the post:

This post is in continuation of my last post – http://blogs.globallogic.com/introduction-to-lucene – in which I gave a brief introduction to Lucene.

There are many Web applications out there to provide access to data stored in a relational database, but what’s the easiest way to enable users to search through that data and find what they need? There are a number of query types that RDBMSs in general do not support without vendor extensions:

Fuzzy queries, in which “fuzzy” and “wuzzy” are considered matches

Word stemming queries, which consider “take,” “took,” and “taken” to be identical

Sound-like queries, which consider “cat” and “kat” to be identical

Synonym queries, which consider “jump,” “hop,” and “leap” to be identical

Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents

Hibernate Search brings the power of full text search engines to the persistence domain model by combining Hibernate Core with the capabilities of the Apache Lucene™ search engine. Even though Hibernate Search is using Apache Lucene™ under the hood you can always fallback to the native Lucene APIs if the need arises.

These posts were written against Hibernate 3.4.1. Just so you know, Hibernate 4.0.0 Alpha2 is out (8 September 2011).

Introduces the basics of Hibernate search.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 13, 2011

September 12, 2011

September 11, 2011

September 10, 2011

September 9, 2011