Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 25, 2012

RuleML 2012

Filed under: Conferences,RuleML — Patrick Durusau @ 3:31 pm

RuleML 2012: The 6th International Symposium on Rules: Research Based and Industry Focused

Important dates:

Abstract submission: March 25, 2012
Paper submission: April 1, 2012
Notification of acceptance/rejection: May 20, 2012
Camera-ready copy due: June 10, 2012
RuleML-2012 dates: August 27-29, 2012

The International Symposium on Rules, RuleML, has evolved from an annual series of international workshops since 2002, international conferences in 2005 and 2006, and international symposia since 2007. This year the RuleML Symposium will be held in conjunction with ECAI 2012, the 20th biennial European Conference on Artificial Intelligence, in Montpellier, France, August 27-29, 2012.

RuleML-2012@ECAI is a research-based, industry-focused symposium: its main goal is to build a bridge between academia and industry in the field of rules and semantic technology, and so to stimulate the cooperation and interoperability between business and research, by bringing together rule system providers, participants in rule standardization efforts, open source communities, practitioners, and researchers. The concept of the symposium has also advanced continuously in the face of extremely rapid progress in practical rule and event processing technologies. As a result, RuleML-2012 will feature hands-on demonstrations and challenges alongside a wide range of thematic tracks. It will thus be an exciting venue to exchange new ideas and experiences on all issues related to the engineering, management, integration, interoperation, and interchange of rules in distributed enterprise intranets and open distributed environments.

We invite you to share your ideas, results, and experiences: as an industry practitioner, rule system provider, technical expert and developer, rule user or researcher, exploring foundations, developing systems and applications, or using rule-based systems.

We invite high-quality submissions related to (but not limited to) one or more of the following topics:

  • Rules and Automated Reasoning
  • Logic Programming and Non-monotonic Reasoning
  • Int. Conference track on Pragmatic Web (see track description below)
  • Rule-Based Policies, Reputation and Trust
  • Rule-based Event Processing and Reaction Rules
  • Fuzzy Rules and Uncertainty
  • Rule Transformation, Extraction and Learning
  • Vocabularies, Ontologies, and Business rules
  • Rules in online-market research and online marketing
  • Rule Markup Languages and Rule Interchange
  • General Rule Topics

Late summer, Montpellier, France, an interesting meeting, what more would you want?

Searching and Browsing Linked Data with SWSE: the SemanticWeb Search Engine

Filed under: Linked Data,RDF,Search Engines,Semantic Web — Patrick Durusau @ 3:30 pm

Searching and Browsing Linked Data with SWSE: the SemanticWeb Search Engine by Aidan Hogan, Andreas Harth, Jürgen Umbrich, Sheila Kinsella, Axel Polleres and Stefan Decker.

Abstract:

In this paper, we discuss the architecture and implementation of the SemanticWeb Search Engine (SWSE). Following traditional search engine architecture, SWSE consists of crawling, data enhancing, indexing and a user interface for search, browsing and retrieval of information; unlike traditional search engines, SWSE operates over RDF Web data { loosely also known as Linked Data { which implies unique challenges for the system design, architecture, algorithms, implementation and user interface. In particular, many challenges exist in adopting Semantic Web technologies for Web data: the unique challenges of the Web { in terms of scale, unreliability, inconsistency and noise { are largely overlooked by the current Semantic Web standards. Herein, we describe the current SWSE system, initially detailing the architecture and later elaborating upon the function, design, implementation and performance of each individual component. In so doing, we also give an insight into how current Semantic Web standards can be tailored, in a best-eff ort manner, for use on Web data. Throughout, we o ffer evaluation and complementary argumentation to support our design choices, and also off er discussion on future directions and open research questions. Later, we also provide candid discussion relating to the diffculties currently faced in bringing such a search engine into the mainstream, and lessons learnt from roughly six years working on the Semantic Web Search Engine project.

This is the paper that Ivan Herman mentions at Nice reading on Semantic Search.

It covers a lot of ground in fifty-five (55) pages but it doesn’t take long to hit an issue I wanted to ask you about.

At page 2, Google is described as follows:

In the general case, Google is not suitable for complex information gathering tasks requiring aggregation from multiple indexed documents: for such tasks, users must manually aggregate tidbits of pertinent information from various recommended heterogeneous sites, each such site presenting information in its own formatting and using its own navigation system. In e ffect, Google’s limitations are predicated on the lack of structure in HTML documents, whose machine interpretability is limited to the use of generic markup-tags mainly concerned with document rendering and linking. Although Google arguably makes the best of the limited structure available in such documents, most of the real content is contained in prose text which is inherently diffcult for machines to interpret. Addressing this inherent problem with HTML Web data, the Semantic Web movement provides a stack of technologies for publishing machine-readable data on the Web, the core of the stack being the Resource Description Framework (RDF).

A couple of observations:

Although Google needs no defense from me, I would argue that Google never set itself the task of aggregating information from indexed documents. Historically speaking, IR has always been concerned with returning relevant documents and not returning irrelevant documents.

Second, the lack of structure in HTML documents (although the article mixes in sites with different formatting) is no deterrent to a human reader aggregating “tidbits of pertinent information….” I rather doubt that writing all the documents in valid Springer LaTeX would make that much difference on the “tidbits of pertinent information” score.

This is my first pass through the article and I suspect it will take three or more to become comfortable with it.

Do you agree/disagree that the task of IR is to retrieve documents, not “tidbits of pertinent information?”

Do you agree/disagree that HTML structure (or lack thereof) is that much of an issue for interpretation of document?

Thanks!

Mule Studio – Getting Started

Filed under: Mule — Patrick Durusau @ 3:29 pm

Very rough notes on the basic introduction to Mule Studio.

Getting Started with Mule Studio

Under What’s Next, the first line reads:

Cloud computing is closer than ever before. Why not start by checking out our fifteen-minute Basic Tutorial?

What does “Cloud computing is closer than ever before.” have to do with Mule Studio? The implication is that the “fifteen-minute Basic Tutorial” is going to teach me about “cloud computing.”

Basic Studio Tutorial

Suggestion: Either call the software MuleStudio, or Studio or Mule, or start with one and say: “we use herein ….” and then use it. Consistent naming isn’t that hard.

BTW, documentation should be updated to reflect current directory structure used with examples.

Documentation says: \MuleStudio\Examples\SpellChecker\InXML.

MuleStudio\examples\SpellChecker\spellcheck.xml is the actual directory

OK, so the directory is missing the InXML and OutXML directories.

Ah, it is expecting an empty InXML directory so it fails if you point to the directory structure as delivered.

So:

MuleStudio\examples\SpellChecker mkdir InXML, and

MuleStudio\examples\SpellChecker mkdir OutXML

I left spellcheck.xml in the SpellChecker directory (remember to copy, not move so you don’t lose the file, or you could create another one).

Works.

For a confidence building example I would not switch back and forth between Windows and Unix file/path syntax. Pick one and stay with it.

I would not suggest other thing to explore during the first example. Could have a staged follow up after the first example, but only afterwards.

Oh, and fix the paths or say they have to be built in order for the first example to run. If possible configure the default window so the message box isn’t so small. Users need to see something happening.

One more thing, the example is vague about whether the OS moves the file or if the file can be moved in MuleStudio. I used the OS just out of habit. I did look and there was no obvious way to copy/move the file in the application.

The narrative could be smoother and with the technical errors fixed it would be an adequate introduction to the software. It would be a better introduction if there was some motive giving for the example application, why would I care? sort of thing. Leveraging the power of Google or something like that.

Watch for notes on the intermediate tutorial in a day or so.


Summary:

In the SpellChecker directory (under MuleStudio/example) add InXML and OutXML directories, thus:

mkdir InXML

mkdir OutXML

Leave spellcheck.xml in the SpellChecker directory until told to copy it into InXML

Blogging Prize – Mule Studio

Filed under: Mule — Patrick Durusau @ 3:28 pm

From my inbox:

Blog About Mule Studio, Get a T-Shirt

Are you as excited about Mule Studio as we are? If so, blog about it and send a link to your blog and your postal address to reply@mulesoft.com and we’ll send you a Mule T-Shirt.

Even if you don’t need the T-Shirt download Mule Studio and take the tour.

Interfaces are always slightly different, range/ease of operations offered vary, documentation varies wildly, so if nothing else, you will learn something in the process.

And, if you blog about it, etc., you will get a new T-Shirt.

Something to look forward to in the mailbox!


A couple of notes on getting started, not that you need them but someone else may:

Step 1 reads:

Before you unzip the muleStudio
package, ensure that it has the permissions required for installation.
To set these permissions, open a console and execute the following command:
chmod u+x muleStudio

The muleStudio folder or directory appears when the unzip operation completes.

Err? Permission to install?

Permission to install is a user privilege question, not setting the file to be executable.

On Linux (Ubuntu 10.10) I just tossed it into my /home/patrick/working directory where I keep all manner of software. It’s just me on the box so I don’t have to worry about making apps available to others.

But, after you unzip the file you do have to:

chmod u+x muleStudio*

BTW, the folder I got was: MuleStudio, so my path is /home/patrick/working/MuleStudio.

Step 2 Execute reads:

Unzip the muleStudio package, which is located in the following path:
/MuleStudio
Enter the following command in the console to launch muleStudio:
./muleStudio
Alternatively, double click the muleStudio file in the Linux graphic interface, as shown above

Err, but we just unzipped it, yes?

Let’s re-write steps 1 and 2:

Step 1:

Unzip the MuleSoft package for your system into a convenient location.

The folder or directory name will be MuleSoft

Step 2:

Change to the MuleSoft directory.

Make the muleStudio* file executable with the command:

chmod u+x muleStudio*

Start the program by:

Double-clicking on muleStudio* in the graphic interface, or

entering the command:

/.muleStudio*

That is trivial in terms of improving the use of MuleStudio but when clear writing becomes a habit, more difficult topics become easier for users.

Documents as geometric objects: how to rank documents for full-text search

Filed under: PageRank,Search Engines,Vector Space Model (VSM) — Patrick Durusau @ 3:27 pm

Documents as geometric objects: how to rank documents for full-text search Michael Nielsen on July 7, 2011.

From the post:

When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine the importance of different webpages. But what should we do when the documents aren’t on the web, and there is no link structure? How should we determine which documents most closely match the intent of the query?

In this post I explain the basic ideas of how to rank different documents according to their relevance. The ideas used are very beautiful. They are based on the fearsome-sounding vector space model for documents. Although it sounds fearsome, the vector space model is actually very simple. The key idea is to transform search from a linguistic problem into a geometric problem. Instead of thinking of documents and queries as strings of letters, we adopt a point of view in which both documents and queries are represented as vectors in a vector space. In this point of view, the problem of determining how relevant a document is to a query is just a question of determining how parallel the query vector and the document vector are. The more parallel the vectors, the more relevant the document is.

This geometric way of treating documents turns out to be very powerful. It’s used by most modern web search engines, including (most likely) web search engines such as Google and Bing, as well as search libraries such as Lucene. The ideas can also be used well beyond search, for problems such as document classification, and for finding clusters of related documents. What makes this approach powerful is that it enables us to bring the tools of geometry to bear on the superficially very non-geometric problem of understanding text.

Very much looking forward to future posts in this series. There is no denying the power of “vector space model” but that leaves unasked what is lost in the transition from linguistic to geometric space?

3rd Globals Challenge

Filed under: Contest,Globalsdb,NoSQL — Patrick Durusau @ 3:25 pm

3rd Globals Challenge

Contest starts: 10 Feb 12 18:00 EST
Contest ends: 17 Feb 12 18:00 EST

Topic mappers take note:

All applications must be built using Globals. However, you are also allowed to use additional technologies to supplement Globals (emphasis added, additional technologies, unlike some linked data competitions)

The email I got reports:

  • A cash prize of USD $3,500 for the winning entry
  • A press release announcing the winning participant and solution
  • A chance to win a free registration for the InterSystems Global Summit

You might want to drop by Globals to grab a copy of the software and read up on the documentation.

You can also see the prior challenges. These are non-trivial events but that also means you will learn a lot in the process.

Nice reading on Semantic Search

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 3:25 pm

Nice reading on Semantic Search by Ivan Herman.

From the post:

I had a great time reading a paper on Semantic Search[1]. Although the paper is on the details of a specific Semantic Web search engine (DERI’s SWSE), I was reading it as somebody not really familiar with all the intricate details of such a search engine setup and operation (i.e., I would not dare to give an opinion on whether the choice taken by this group is better or worse than the ones taken by the developers of other engines) and wanting to gain a good image of what is happening in general. And, for that purpose, this paper was really interesting and instructive. It is long (cca. 50 pages), i.e., I did not even try to understand everything at my first reading, but it did give a great overall impression of what is going on.

Interested to hear your take on Ivan’s comments on owl:sameAs.

The semantics of words, terms, ontology classes are not stable over time and/or users. If you doubt that statement, leaf through the Oxford English Dictionary for ten (10) minutes.

Moreover, the only semantics we “see” in words, terms or ontology classes are those we assign them. We can discuss the semantics of Hebrew words in the Dead Sea Scrolls but those are our semantics, not those of the original users of those words. May be close to what they meant, may not. Can’t say for sure because we can’t ask and would lack the context to understand the answer if we could.

Adding more terms to use as supplements to owl:sameAs just increases the chances for variation. And error if anyone is going to enforce their vision of broadMatch on usages of that term by others.

Berlin Buzzwords 2012

Filed under: BigData,Conferences,ElasticSearch,Hadoop,HBase,Lucene,MongoDB,Solr — Patrick Durusau @ 3:24 pm

Berlin Buzzwords 2012

Important Dates (all dates in GMT +2)

Submission deadline: March 11th 2012, 23:59 MEZ
Notification of accepted speakers: April 6st, 2012, MEZ
Publication of final schedule: April 13th, 2012
Conference: June 4/5. 2012

The call:

Call for Submission Berlin Buzzwords 2012 – Search, Store, Scale — June 4 / 5. 2012

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions
  • NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others
  • Large Data Processing – Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are more than welcome. We are looking for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.

…(moved dates to top)…

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Here is your chance to experience summer in Berlin (Berlin Buzzwords 2012) and in Montreal (Balisage).

Seriously, both conferences are very strong and worth your attention.

Released Neo4j 1.6 GA “Jörn Kniv”!

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:18 pm

Released Neo4j 1.6 GA “Jörn Kniv”!

From the Neo4j blog:

Three milestones later and we’re proud and happy to announce the release of Neo4j 1.6 GA.

We are excited about a host of great new features, all ready to be used. Let’s get to it.

Highlights

What features have been included in this release?

  • Cloud – Public beta on Heroku of the Neo4j Add-on
  • Cypher – Supports older Cypher versions, better pattern matching, better performance, improved api
  • Web admin – Full Neo4j Shell commands, including versioned Cypher syntax.
  • Kernel – Improvements, for instance the ability to ensure that key-value pairs for entities are unique.
  • Lucene upgrade – Now version 3.5.

Also, there have been many improvements behind-the-scenes:

Infrastructure – Our library repositories have moved to Amazon, providing significantly faster download times.
Quality – High availability features better logging and operational support.
Process – Better handling of breaking changes in our api and how we handle deprecated features.

If you want more info on all of this – sure you do – please keep reading. Here is a run down of the major new features in Neo4j 1.6.

January 24, 2012

LDIF – Linked Data Integration Framework (0.4)

Filed under: Hadoop,Heterogeneous Data,LDIF,Linked Data — Patrick Durusau @ 3:43 pm

LDIF – Linked Data Integration Framework (0.4)

Version 0.4 News:

Up till now, LDIF stored data purely in-memory which restricted the amount of data that could be processed. Version 0.4 provides two alternative implementations of the LDIF runtime environment which allow LDIF to scale to large data sets: 1. The new triple store backed implementation scales to larger data sets on a single machine with lower memory consumption at the expense of processing time. 2. The new Hadoop-based implementation provides for processing very large data sets on a Hadoop cluster, for instance within Amazon EC2. A comparison of the performance of all three implementations of the runtime environment is found on the LDIF benchmark page.

From the “About LDIF:”

The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain data sets such as DBpedia or Freebase. Linked Data applications that want to consume data from this global data space face the challenges that:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

This usage of different vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write SPARQL queries against Web data which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data.

Up-till-now, there have not been any integrated tools that help application developers with these tasks. With LDIF, we try to fill this gap and provide an an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI while keeping track of data provenance.

With the addition of Hadoop based processing, definitely worth your time to download and see what you think of it.

Ironic that the problem it solves:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

already existed, prior to Linked Data as:

  1. data sources use a wide range of different vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified differently within different data sources.

So the Linked Data drill is to convert data, which already has these problems, into Linked Data, which will still have these problems, and then solve the problem of differing identifications.

Yes?

Did I miss a step?

Web Data Commons

Filed under: Common Crawl,Data Source — Patrick Durusau @ 3:42 pm

Web Data Commons: Extracting Structured Data from the Common Web Crawl

From the post:

Web Data Commons will extract all Microformat, Microdata and RDFa data that is contained in the Common Crawl corpus and will provide the extracted data for free download in the form of RDF-quads as well as CSV-tables for common entity types (e.g. product, organization, location, …).

We are finished with developing the software infrastructure for doing the extraction and will start an extraction run for the complete Common Crawl corpus once the new 2012 version of the corpus becomes available in February. For testing our extraction framework, we have extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010. The results of this extraction run are provided below. We will provide the data from the complete 2010 corpus together with the data from the 2012 corpus in order to enable comparisons on how data provision has evolved within the last two years.

An interesting mining of open data.

The ability to perform comparisons on data over time is particularly interesting.

Mining Bio4j data: finding topological patterns in PPI networks

Filed under: Cypher,Neo4j — Patrick Durusau @ 3:41 pm

Mining Bio4j data: finding topological patterns in PPI networks

From the post:

That’s where I came up with the idea of looking for topological patterns through a large sub-set of the Protein-Protein interactions network included in Bio4j; – rather than focusing in a few proteins selected a priori.

I decided to mine the data in order to find circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:

Interesting use of the Neo4j Cypher query language to look for topological patterns.

CS 101: Build a Search Engine

Filed under: CS Lectures,Search Engines — Patrick Durusau @ 3:41 pm

CS 101: Build a Search Engine

David Evans and Sebastian Thrun teach CS 101 by teaching students how to build a search engine.

There is an outline syllabus but not any more detail at this time.

USAspending.gov

Filed under: Data Source,Government Data — Patrick Durusau @ 3:40 pm

USAspending.gov

This website is required by the “Federal Funding Accountability and Transparency Act (Transparency Act).”

The FAQ describes its purpose as:

To provide the public with information about how their tax dollars are spent. Citizens have a right and a need to understand where tax dollars are spent. Collecting data about the various types of contracts, grants, loans, and other types of spending in our government will provide a broader picture of the Federal spending processes, and will help to meet the need of greater transparency. The ability to look at contracts, grants, loans, and other types of spending across many agencies, in greater detail, is a key ingredient to building public trust in government and credibility in the professionals who use these agreements.

An amazing amount of data which can be searched or browsed in a number of ways.

It is missing one ingredient that would change it from an amazing information resource to a game changing information resource, you.

The site can only report information known to the federal government and covered by the Transparency Act.

For example, it can’t report on family or personal relationships between various parties to contracts or even offer good (or bad) information on performance on contacts or methods used by contractors.

However, a topic map (links into this site are stable) could combine this information with other information quite easily.

I ran across this site in Analyzing US Government Contract Awards in R by Vik Paruchuri. A very good article that scratches the surface of mining this content.

Use Prefix Operators instead of Boolean Operators

Filed under: Boolean Operators,Logic,Prefix Operators — Patrick Durusau @ 3:38 pm

Use Prefix Operators instead of Boolean Operators by Chris Hostetter.

From the post:

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

Don’t hold back Chris! It’s not good for you. Tell us how you feel about “Boolean Operators.” 😉

Seriously, Chris makes a very good case for using “Prefix Operators” and you will learn about powerful searching in both Lucene and Solr.

Well worth studying in detail.

How Google Code Search Worked

Filed under: Indexing,Regexes — Patrick Durusau @ 3:37 pm

Regular Expression Matching with a Trigram Index or How Google Code Search Worked by Russ Cox.

In the summer of 2006, I was lucky enough to be an intern at Google. At the time, Google had an internal tool called gsearch that acted as if it ran grep over all the files in the Google source tree and printed the results. Of course, that implementation would be fairly slow, so what gsearch actually did was talk to a bunch of servers that kept different pieces of the source tree in memory: each machine did a grep through its memory and then gsearch merged the results and printed them. Jeff Dean, my intern host and one of the authors of gsearch, suggested that it would be cool to build a web interface that, in effect, let you run gsearch over the world’s public source code. I thought that sounded fun, so that’s what I did that summer. Due primarily to an excess of optimism in our original schedule, the launch slipped to October, but on October 5, 2006 we did launch (by then I was back at school but still a part-time intern).

I built the earliest demos using Ken Thompson’s Plan 9 grep, because I happened to have it lying around in library form. The plan had been to switch to a “real” regexp library, namely PCRE, probably behind a newly written, code reviewed parser, since PCRE’s parser was a well-known source of security bugs. The only problem was my then-recent discovery that none of the popular regexp implementations – not Perl, not Python, not PCRE – used real automata. This was a surprise to me, and even to Rob Pike, the author of the Plan 9 regular expression library. (Ken was not yet at Google to be consulted.) I had learned about regular expressions and automata from the Dragon Book, from theory classes in college, and from reading Rob’s and Ken’s code. The idea that you wouldn’t use the guaranteed linear time algorithm had never occurred to me. But it turned out that Rob’s code in particular used an algorithm only a few people had ever known, and the others had forgotten about it years earlier. We launched with the Plan 9 grep code; a few years later I did replace it, with RE2.

Russ covers inverted indexes, tri-grams, regexes, pointers to working code and examples of how to use the code searcher locally on Linux source code for example.

Extremely useful article as an introduction to indexes and regexes.

MongoDB Indexing in Practice

Filed under: Indexing,MongoDB — Patrick Durusau @ 3:36 pm

MongoDB Indexing in Practice

From the post:

With the right indexes in place, MongoDB can use its hardware efficiently and serve your application’s queries quickly. In this article, based on chapter 7 of MongoDB in Action, author Kyle Banker talks about refining and administering indexes. You will learn how to create, build and backup MongoDB indexes.

Indexing is closely related to topic maps and the more you learn about them, the better topic maps you will be writing.

Take for example the treatment of “multiple keys” in this post.

What that means is that multiple entries in an index can point at the same document.

Not that big of a step to multiple ways to identify the the same subject.

Granting that in Kyle’s example, none of his “keys” really identify the subject. More isa, usedWith, usedIn type associations.

The Little Redis Book

Filed under: MongoDB,Redis — Patrick Durusau @ 3:35 pm

The Little Redis Book by Karl Seguin.

Weighs in at 29 pages and does a good job of creating an interest in knowing more about Redis.

Seguin is also the author of The Little MongoDB Book. (which comes in at 32 pages)

Apache HBase 0.92.0 has been released

Filed under: HBase — Patrick Durusau @ 3:34 pm

Apache HBase 0.92.0 has been released by Jonathan Hsieh

More than 670 issues were addressed in this release but Jonathan highlights nine changes/improvements for your attention:

User Features:

  • HFile v2, a new more efficient storage format
  • Faster recovery via distributed log splitting
  • Lower latency region-server operations via new multi-threaded and asynchronous implementations.

Operator Features:

  • An enhanced web UI that exposes more internal state
  • Improved logging for identifying slow queries
  • Improved corruption detection and repair tools

Developer Features:

  • Coprocessors
  • Build support for Hadoop 0.20.20x, 0.22, 0.23.
  • Experimental: offheap slab cache and online table schema change

January 23, 2012

SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)

Filed under: Solr,SolrCloud — Patrick Durusau @ 7:47 pm

SolrCloud is Coming (and looking to mix in even more ‘NoSQL’) by Mark Miller.

From the post:

The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.

Once we get Phase2 into trunk we will work on hardening and finishing a couple missing features – then SolrCloud should be ready to be part of the upcoming Lucene/Solr 4.0 release.

If you want to read more about SolrCloud and where we are with Phase 2, check out the new wiki page that we are working on at http://wiki.apache.org/solr/SolrCloud2 – feedback appreciated!

Occurs to me that tweaking SolrCloud (or just Solr) might make a nice short course for library students. If not to become Solr mavens, just to get a better feel for the range of possibilities.

Solr and Lucene Reference Guide updated for v3.5

Filed under: Lucene,Solr — Patrick Durusau @ 7:47 pm

Solr and Lucene Reference Guide updated for v3.5

From the post:

The free Solr Reference Guide published by Lucid Imagination has been updated to 3.5 – the current release version of Solr and Lucene. The changes weren’t major, but here are the key changes:

  • Support for the Hunspell stemmer
  • The new langid UpdateProcessor
  • Numeric types now support sortMissingFirst/Last
  • New parameter hl.q for use with highlighting
  • Field types supported by the StatsComponent now includes date and string fields

Almost 400 pages of rainy winter day reading.

OK, so you need a taste for that sort of thing. 😉

Mining Text Data

Filed under: Classification,Data Mining,Text Extraction — Patrick Durusau @ 7:46 pm

Mining Text Data Charu Aggarwal and ChengXiang Zhai, Springer, February 2012, Approximately 500 pages.

From the publisher’s description:

Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned.

Mining Text Data introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including the key research content on the topic, and the future directions of research in the field. There is a special focus on Text Embedded with Heterogeneous and Multimedia Data which makes the mining process much more challenging. A number of methods have been designed such as transfer learning and cross-lingual mining for such cases.

Mining Text Data simplifies the content, so that advanced-level students, practitioners and researchers in computer science can benefit from this book. Academic and corporate libraries, as well as ACM, IEEE, and Management Science focused on information security, electronic commerce, databases, data mining, machine learning, and statistics are the primary buyers for this reference book.

Not at the publisher’s site but you can see the Table of Contents and chapter 4, A SURVEY OF TEXT CLUSTERING ALGORITHMS and chapter 6, A SURVEY OF TEXT CLASSIFICATION ALGORITHMS at: www.charuaggarwal.net/text-content.pdf.

The two chapters you can download from Aggarwal’s website will give you a good idea of what to expect from the text.

While an excellent survey work, with chapters written by experts in various sub-fields, it also suffers from the survey work format.

For example, for the two sample chapters, there are overlaps in the bibliographies for both chapters. Not surprising given the closely related subject matter but as a reader I would be interested in discovering that some works are cited in both chapters. Something that is possible given the back of the chapter bibliography format, only by repetitive manual inspection.

Although I rail against examples in standards, expanding the survey reference work format to include more details and examples would only increase its usefulness and possible its life as a valued reference.

Which raises the question of having a print format for survey works at all. The research landscape is changing quickly and a shelf life of 2 to 3 years, if that long, seems a bit brief for the going rate for print editions. Printed versions of chapters as smaller and more timely works on demand, that is a value-add proposition that Springer is in a unique position to bring to its customers.

DoD Lists Key Needed Capabilities

Filed under: Marketing,Military — Patrick Durusau @ 7:46 pm

DoD Lists Key Needed Capabilities

From the post:

The Pentagon has released a list of 30 war-fighting capabilities it says it needs to fight anywhere on the globe in the future.

The 75-page document — officially called the Joint Operational Access Concept (JOAC) — lays out how the services must work together to defeat anti-access threats. It also will help shape development of future weapons and equipment.

“It’s a way to look at whether we’re correctly developing joint capabilities, not just service capabilities, to be able to get to where we need,” Lt. Gen. George Flynn, director of joint force development on the Joint Staff, said of the document during a Jan. 20 briefing at the Pentagon.

The document goes a step beyond the traditional fighting spaces — air, land and sea — to include space and cyberspace.

Interesting document that should give you the opportunity to learn something about the military view of the world and find potential areas for discussion of semantic integration.

Percona Live DC 2012 Slides (MySQL)

Filed under: Database,MySQL — Patrick Durusau @ 7:44 pm

Percona Live DC 2012 Slides

I put the (MySQL) header for the benefit of hard core TM fans who can’t be bothered with MySQL posts. 😉

I won’t say what database system I originally learned databases on but I must admit that I became enchanted with MySQL years later.

For a large number of applications, including TM backends, MySQL is entirely appropriate.

Sure, when your company goes interplanetary you are going to need a bigger solution.

But in the mean time, get a solution that isn’t larger than the problem you are trying to solve.

BTW, MySQL installations have the same mapping for BI issues I noted in an earlier post today.

Thoughts on how you would fashion a generic solution that does not require conversion of data?

50 years of linguistics at MIT

Filed under: Linguistics — Patrick Durusau @ 7:44 pm

50 years of linguistics at MIT by Arnold Zwicky.

From the post:

The videos are now out — from the 50th-anniversary celebrations (“a scientific reunion”) of the linguistics program at MIT, December 9-11, 2011. The schedule of the talks (with links to slides for them) is available here, with links to other material: a list of attendees, a list of the many poster presentations, videos of the main presentations, personal essays by MIT alumni, photographs from the event, a list of MIT dissertations from 1965 to the present, and a 1974 history of linguistics at MIT (particularly interesting for the years before the first officially registered graduate students entered the program, in 1961).

The eleven YouTube videos (of the introduction and the main presentations) can be accessed directly here.

See Arnold’s post for the links.

R Regression Diagnostics Part 1

Filed under: R,Regression — Patrick Durusau @ 7:43 pm

R Regression Diagnostics Part 1 By Vik Paruchuri.

From the post:

Linear regression can be a fast and powerful tool to model complex phenomena. However, it makes several assumptions about your data, and quickly breaks down when these assumptions, such as the assumption that a linear relationship exists between the predictors and the dependent variable, break down. In this post, I will introduce some diagnostics that you can perform to ensure that your regression does not violate these basic assumptions. To begin with, I highly suggest reading this article on the major assumptions that linear regression is predicated on.

Just like any other tool, the more you know about it, the better use you will make of it.

Scalaris

Filed under: Erlang,Key-Value Stores,Scalaris — Patrick Durusau @ 7:43 pm

Scalaris

From the webpage:

Scalaris is a scalable, transactional, distributed key-value store. It can be used for building scalable Web 2.0 services.

Scalaris uses a structured overlay with a non-blocking Paxos commit protocol for transaction processing with strong consistency over replicas. Scalaris is implemented in Erlang.

Following http://www.zib.de/CSR/Projects/scalaris I found:

Our work is similar to Amazon’s SimpleDB, but additionally supports full ACID properties. Dynamo, in contrast, restricts itself to eventual consistency only. As a test case, we chose Wikipedia, the free encyclopedia, that anyone can edit. Our implementation serves approx. 2,500 transactions per second with just 16 CPUs, which is better than the public Wikipedia.

Be forewarned that the documentation is in Google Docs, which does not like Firefox on Ubuntu.

Sigh, back to browser wars, again? Says it will work with Google Chrome.

Degrees of Semantic Precision

Filed under: Semantic Web,Semantics — Patrick Durusau @ 7:43 pm

Reading Mike Bergman’s posts on making the Semantic Web work tripped a realization that Linked Data and other Semantic Web proposals are about creating a particular degree semantic precision.

And I suspect that is the key to the lack of adoption of Linked Data, etc.

Think about the levels of semantic precision that you use during the day. With family and children, one level of semantic precision; another level of precision with your co-workers; yet another level as you deal with merchants, public servants and others during the day. And you can switch in conversation from one level to another, such as when your child interrupts a conversation with your spouse.

To say nothing of the levels of semantic precision that vary from occupation to occupation, with ontologists/logicians at the top, followed closely by computer scientists, and then doctors, lawyers, computer programmers and a host of others. All of who also use varying degrees of semantic precision during the course of a day.

We communicate with varying degrees of semantic precision and the “semantic” Web reflects that practice.

I say lower-case “semantic” Web because the web had semantics long before current efforts to prescribe only one level of semantic precision.

Semantic Web – Sweet Spot(s) and ‘Gold Standards’

Filed under: OWL,RDF,UMBEL,Wikipedia,WordNet — Patrick Durusau @ 7:43 pm

Mike Bergman posted a two-part series on how to make the Semantic Web work:

Seeking a Semantic Web Sweet Spot

In Search of ‘Gold Standards’ for the Semantic Web

Both are worth your time to read but the second sets the bar for “Gold Standards” for the Semantic Web as:

The need for gold standards for the semantic Web is particularly acute. First, by definition, the scope of the semantic Web is all things and all concepts and all entities. Second, because it embraces human knowledge, it also embraces all human languages with the nuances and varieties thereof. There is an immense gulf in referenceability from the starting languages of the semantic Web in RDF, RDFS and OWL to this full scope. This gulf is chiefly one of vocabulary (or lack thereof). We know how to construct our grammars, but we have few words with understood relationships between them to put in the slots.

The types of gold standards useful to the semantic Web are similar to those useful to our analogy of human languages. We need guidance on structure (syntax and grammar), plus reference vocabularies that encompass the scope of the semantic Web (that is, everything). Like human languages, the vocabulary references should have analogs to dictionaries, thesauri and encyclopedias. We want our references to deal with the specific demands of the semantic Web in capturing the lexical basis of human languages and the connectedness (or not) of things. We also want bases by which all of this information can be related to different human languages.

To capture these criteria, then, I submit we should consider a basic starting set of gold standards:

  • RDF/RDFS/OWL — the data model and basic building blocks for the languages
  • Wikipedia — the standard reference vocabulary of things, concepts and entities, plus other structural guidances
  • WordNet — lexical language references as an aid to natural language processing, and
  • UMBEL — the structural reference for the connectedness of things for basic coherence and inference, plus a vocabulary for mapping amongst reference structures and things.

Each of these potential gold standards is next discussed in turn. The majority of discussion centers on Wikipedia and UMBEL.

There is one criteria that Mike leaves out: Choice of a majority of users.

Use by a majority of users is a sweet spot that brooks no argument.

January 22, 2012

Big Data Success in Government (Are you a “boutique” organization?)

Filed under: BigData,Government,Marketing — Patrick Durusau @ 7:42 pm

Big Data Success in Government by Alex Olesker.

From the post:

On January 19, Carahsoft hosted a webinar on Big Data success in government with Bob Gourley and Omer Trajman of Cloudera. Bob began by explaining the current state of Big Data in the government. There are 4 areas of significant activity in Big Data. Federal integrators are making large investments in research and development of solutions. Large firms like Lockhead Martin as well as boutique organizations have made major contributions. The Department of Defense and the Intelligence Community have been major adopters of Big Data solutions to handle intelligence and information overload. Typically, they use Big Data technology to help analysts “connect the dots” and “find a needle in a haystack.” The national labs under the Department of Energy have been developing and implementing Big Data solutions for research as well, primarily in the field of bioinformatics, the application of computer science to biology. This ranges from organizing millions of short reads to sequence a genome to better tracking of patients and treatments. The last element in government use of Big Data are the Office of Management and Budget and the General Service Administration, which primarily ensure the sharing of lessons and solutions.

Just background reading that may give you some ideas on where in government to pitch semantic integration using topic maps or other technologies, such as graph databases.

Remember that no matter how the elections turn out this year, the wheels are turning for “consolidation” of government offices and IT is going to be in demand to make that “consolidation” work.

You may be a “boutique organization,” and unable to afford a member of Congress but most agencies have small contractor officers (I don’t think they call them boutique officers) who are supposed to parcel out some work to smaller firms. Doesn’t hurt to call.

« Newer PostsOlder Posts »

Powered by WordPress