Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 10, 2011

“Karuna Stol” 1.4 – Milestone 4

Filed under: Graphs,Neo4j — Patrick Durusau @ 6:34 pm

“Karuna Stol” 1.4 – Milestone 4

From the post:

We’re on the fast track to the next major Neo4j release, “Kiruna Stol”. With today’s milestone, we’ve added some brand new features, some experimental aspects that we’re looking for feedback on, and of course numerous enhancements to everyone’s favorite graph database.

I guess a new human query language (Cypher), batch-oriented REST API, memory usage improvements, not to mention documentation qualifies as being on the fast track!

There goes my weekend! 😉 It will be time well spent.

PragPub

Filed under: Computer Science,Programming — Patrick Durusau @ 6:34 pm

PragPub

Edited by Michael Swaine so this is thinking specific publication with a fairly wide range.

That it is also entertaining is just an added bonus.

Not topic map specific but it never hurts to improve one’s thinking skills. (Or at least that is what I was told as a child.)

Lucene Revolution 2011

Filed under: Conferences,Lucene — Patrick Durusau @ 6:33 pm

Lucene Revolution 2011

A materials from Lucene Revolution 2011 is now online.

I must admit that Searching The United States Code with Solr/Lucene caught my eye first. 😉

That presentation and the others are worth a close reading!

5 Important Factors for Pricing Data in
the Information (Overload) Age

Filed under: Marketing — Patrick Durusau @ 6:33 pm

5 Important Factors for Pricing Data in the Information (Overload) Age

Dick Cudoff, Co-founder and CEO of Infochimps, identifies five (5) factors to consider in pricing data.

Selling the results of the application of topic maps to data could follow his advice quite easily.

Improvements in Bio4j Go Tools
(Graph visualization)

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:32 pm

Improvements in Bio4j Go Tools (Graph visualization)

From the website:

A new version of Bio4j Go Tools viewer is available, it includes improvements in the graph visualization of GO annotation results.
These are the new features:

  • Load GO annotation results from URL: There’s no need anymore to upload the XML file with the results everytime you want to see the graph visualization. Just enter the publicly accessible URL of the file and the server will directly get the file for you.
  • Restrict the visualization to only one GO sub-ontology at a time: Terms belonging to different sub-ontologies (cellular component, biological process, molecular function) are not mixed up anymore.
  • Choice of layout algorithms: You can choose between two different layout algorithms for the visualization, (Yifan Hu and Fruchterman Reingold).
  • Customizable layout algorithm time: Range of 1-10 minutes.

A tutorial is also linked from this page that demonstrates the features of Bio4j.

June 9, 2011

Ontologies As Low-Lying Subjects

Filed under: Mapping,Ontology — Patrick Durusau @ 6:35 pm

While writing up a call for papers on “integration” of ontologies, it occurred to me that ontologies are really low lying subjects for topic maps.

Any text corpus or database is going to require extraction of its content as a first step.

Your second step is going to be processing that extracted content to identify subjects.

Your third step is going to be creating topics and associations between topics, along with the properties of topics.

Your fourth step, depending on the purpose of your topic map, will be to create pointers back into the content for users (occurrences).

And finally, your fifth step, is going to be fashioning the interface your users will use for the topic map.

Compare those steps to topic mapping ontologies:

Your first step isn’t extraction of the data because while ontologies may exist in some format, they are free standing sets of subjects.

Your second step won’t be to identify subjects because the ontology already has subjects identified. (Yes, there are other subjects you could identify but this is the low-lying fruit version).

You avoid the third step because subjects in an ontology already have properties and relationships to other subjects.

You don’t need pointers because the entire ontology fits into your topic map, so no fourth step.

You have a familiar interface, the ontology itself, which leaves you with no fifth step.

Well, that’s a slight exaggeration. 😉

You do need the third step where subjects in the ontology get properties in addition to the ones they have in their respective ontologies. Those added properties enable the same subjects in different ontologies to merge. If their respective properties are also subjects in the ontology, that is they can have properties, you should be able to merge those properties as well.

I realize that the originators of some ontologies may disagree with the mappings but that really isn’t the appropriate question. The question is whether users find the mapping useful for some particular purpose. I am not sure what other test one would have?

Paper: A Study of Practical Deduplication

Filed under: Deduplication,Marketing,Topic Maps — Patrick Durusau @ 6:34 pm

Paper: A Study of Practical Deduplication

From the post:

With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice. That’s the radically simple and powerful notion behind data deduplication. If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you’ll love this one, but it is a powerful feature and one I don’t hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code.

Someone asked the other day about how to make topic maps profitable.

Well, selling a solution to issues like duplication of data would be one of them.

You do know that the kernel of the idea for topic maps arose out of a desire to avoid paying 2X, 3X, 4X, or more for the same documentation on military equipment. Yes? Didn’t fly ultimately because of the markup that contractors get on documentation, which then funds their hiring military retirees. That doesn’t mean the original idea was a bad one.

Now, applying a topic map to military documentation systems and demonstrating the duplication of content, perhaps using one of Lars Marius Garshol’s similarity measures, that sounds like a rocking topic map application. Particularly in budget cutting times.

CUBRID Contest

Filed under: Topic Maps — Patrick Durusau @ 6:34 pm

CUBRID Contest

Followed a tweet by Alex Popescu and wound up here:

CUBRID it! is an online programming contest which requires finding and coding the best solution for a SQL related problem we will be proposing.

This programming contest is organized by the CUBRID Open Source community once every year.

This year the contest will be held between June 1st and June 22nd.

CUBRID it! is open to all talented developers who are willing to challenge their skills and compare themselves to other passionate developers from all around the world.

In the end, it’s all about coding and having fun, together with other CUBRID fans!

See the website for the full details but the problem:

Given all the tables in a database (but only the user tables, not also the system tables), determine:

  • The value which occurs the most times (is duplicated the most times) in the database and it is “non-numeric”! (“non-numeric” means that the respective value does not contain only digits – it must contain at least one non-digit character; for example, “1298” is “numeric”, but “12,98” is “non-numeric” ( and also “12.98” ), because it contains the “,” character ; same goes for “-100” – it is considered “non-numeric” as well, because of the “-” sign).
  • The number of occurrences for the value found above.

just sounds like a topic map problem doesn’t it?

CouchDB 1.1 Feature Guide

Filed under: CouchDB,NoSQL — Patrick Durusau @ 6:34 pm

CouchDB 1.1 Feature Guide

From Alex Popescu’s myNoSQL, news of a feature guide for CouchDB 1.1 and related links.

June 8, 2011

Everything is Connected – Building a
collaborative environment on Neo4J

Filed under: Collaboration,Neo4j — Patrick Durusau @ 10:26 am

Everything is Connected – Building a collaborative environment on Neo4J

Abstract:

Neil Ellis will be giving a talk on the value of using the Neo4J graph database to build a collaborative infrastructure. He will explore how the choice of storage mechanism affects application design and ultimately what functionality systems provide for their users. Finally, by looking at the value of serendipity in collaborative systems he will hopefully convince you that Neo4J is more than just an alternative to an RDMS.

An interesting presentation on serendipity and graph databases but not very long on technical content about Neo4J. A number of interesting observations and so I do think the presentation is worth the time to watch.

I have written asking that the slides be posted.

structr – First Draft User Guide

Filed under: Graphs,Neo4j — Patrick Durusau @ 10:25 am

structr – First Draft User Guide

This is a CMS that is based on Neo4J.

Suggestions/comments?

Topic Maps: “Hello World” Example

Filed under: Examples,Topic Maps — Patrick Durusau @ 10:24 am

Topic Maps: “Hello World” Example

In a tweet on 6 June 2011, Inge Henriksen reminded us of this “Hello World” example for topic maps.

Thanks Inge!

Justice Department … E-Discovery Review and Advanced Text Analytics

Filed under: e-Discovery,Legal Informatics — Patrick Durusau @ 10:23 am

United States Justice Department Implements Relativity for E-Discovery Review and Advanced Text Analytics

From the announcement:

…Relativity Analytics powers functionality such as clustering, the automatic grouping of documents by similar concepts, as well as concept search, and the ability for end users to train the system to group documents based on concepts and issues they define.

Relativity is being deployed in EOUSA’s Litigation Technology Service Center (LTSC) to provide electronic discovery services for all U.S. Attorneys’ Offices, which include over 6,000 attorneys nationwide. EOUSA will use Relativity Analytics to empower U.S. Attorney teams to do more with limited resources by allowing them to quickly locate key documents and increase their review speeds through enormous data sets in compressed time frames.

I like the training the system to group documents idea. Not that far from interactive merging based on user criteria. Would be more useful to colleagues if portions of documents could be grouped, so they don’t have to wade through documents for the relevant bits.

There is a lot of e-discovery management software on the market but two quick points:

1) The bar for good software goes up every year, and,

2) Topic maps have unique features that could make them players in this ever expanding market.

Microsoft Research Watch: AI, NoSQL and
Microsoft’s Big Data Future

Filed under: Artificial Intelligence,BigData,NoSQL — Patrick Durusau @ 10:23 am

Microsoft Research Watch: AI, NoSQL and Microsoft’s Big Data Future

From the post:

Probase is a Microsoft Research project described as an “ongoing project that focuses on knowledge acquisition and knowledge serving.” Its primary goal is to “enable machines to understand human behavior and human communication.” It can be compared to Cyc, DBpedia or Freebase in that it is attempting to compile a massive collection of structured data that can be used to power artificial intelligence applications.

It’s powered by a new graph database called Trinity, which is also a Microsoft Research project. Trinity was spotted today by MyNoSQL blogger Alex Popescu, and that led us to Probase. Neither project seems to be available to the public yet.

These and other projects shed some light on Microsoft’s search and big data ambitions.

Doesn’t hurt to keep track of what people with a proven track record of making money, if not always producing useful software, are up to.

BTW, when you read the article quoting MS on Probase where it says:

…as evidences that can add to or modify the claims and beliefs in Probase. This means Probase is able to integrate information of varied quality from heterogeneous data sources.

Whoa! There is a long step from treating statements about a commonly identified subject, statement X is about the birth certificate of Barack Obama, to integrating information from heterogeneous data sources, which identify a subject differently, hospital record of B.H. Obama.

Looking forward to learning more about Trinity.

Knoodl!

Filed under: OWL,RDF — Patrick Durusau @ 10:22 am

Knoodl!

From the webpage:

Knoodl is a tool for creating, managing, and analyzing RDF/OWL descriptions. Its many features support collaboration in all stages of these activities. Knoodl’s key component is a semantic software architecture that supports Emergent Analytics. Knoodl is hosted in the Amazon EC2 cloud and can be used for free. It may also be licensed for private use as MyKnoodl.

Mapping to or between the components of RDF/OWL descriptions as subjects will require analysis. Or simply use of RDF/OWL descriptions. In either case this could be a useful tool.

Assignment Ops – TempleScript

Filed under: TMQL — Patrick Durusau @ 10:21 am

A posting on assignment ops by Robert Barta to PASTEBIN, which I reproduce here for your reading pleasure:

# templescript assignment ops

“Robert” => name \ rho # unconditional, replaces all names

“rho” => nickname \ rho # unconditional, replaces only nicknames

“der-rho-ist” +> nickname \ rho # unconditional, adds

() => name \ rho # unconditional, removes all names

#–

“Robert” ||> name \ rho # conditional, assigns only if no name existed

This reminds me of a post I need to finish on suggested comparison operations.

June 7, 2011

Topic Map Travel Alert!

Filed under: Marketing,Topic Maps — Patrick Durusau @ 6:55 pm

Government harassing and intimidating Bradley Manning supporters was brought to my attention earlier today. Since I have announced a number of conferences that involve travel to destinations outside the United States, for U.S. citizens and residents, I wanted to bring it to your attention.

Best advice: Do not attempt to return the United States with electronic devices.

What is particularly disheartening about this story is that it illustrates the continuing incompetence of the United States with regard to IT issues. The agents in question may as well be using paper towel tubes to look for clues about the Wikileaks security breach.

With a topic map, one of the first subjects to represent would be the deeply flawed system design that enabled access to diplomatic cables over an extended time period. The contractor, sysadmins and others would definitely be nodes in that map. Not to mention a physical audit of all the equipment with access to that data would be another issue, not just a premature focus on one possible suspect. It isn’t all that hard to imagine other compromised hardware, given that thousands of people had access to the same data.

I would exclude from such a topic map as noise the reports/discussion about Private Manning because that is a diversion of attention from the poor security practices and accountability for those practices (none to speak of) that lead to this breech, whoever was responsible.

Given how unperturbed the DoD seems to about the leak of the State Department cables, one has to wonder how seriously the chain of command communicated the alleged need to maintain security with regard to this cables? That could be another set of nodes in such a map.

A far cry from the “quick, we need a suspect and therefore the suspect must be guilty, that’s why we are abusing the suspect,” approach taken in this case. Stopping security leaks is different from saving face in the aftermath of security leaks. Maybe that’s the difference.

Marketing Topic Maps to Geeks

Filed under: Marketing,RDF,Topic Maps — Patrick Durusau @ 6:54 pm

Another aspect of the “oh woe is topic maps” discussion is the lack of interest in topic maps by geeks. There are open source topic map projects, presentations at geeky conferences, demos, etc., but no real geek swell for topic maps. But that same isn’t true for ontologies, RDF, description logic (ok, maybe less for DL), etc.

In retrospect, that isn’t all that surprising. Take a gander inside any of the software project categories at sourceforge.org. Any of those projects could benefit from more participation but every year sees more projects in the same categories and oft times covering the same capabilities.

Does any of that say to you: There is an answer and it has to be my answer? I won’t bother with collecting the stats for the lack of code reuse, another aspect of this issue. It is too well known to belabor.

Topic maps made the fatal mistake of saying answers are supplied by users and not developers. If you don’t think that was a mistake, take a look at any RDF vocabulary and tell me it was written by a typical user community. Almost without exception (I am sure there must be some somewhere), RDF vocabularies are written by experts and imposed on users. Hence their popularity, at least among experts anyway.

Topic map inverted the usual world view to say that since users are the source of the semantics in the texts they read, that we should start with their views. Imposing world views is always more popular than learning them, particularly among the geek community. They know what users should be doing and they damned well better do it.

Oh, the other mistake that topic maps made was to say there was more than one world view. Multiple world views that could be aligned together. The ontologists scotched that idea decades ago, although they haven’t been able to agree on the one world view that should be in place. I suppose there may be (small letters), multiple world views, but that is composed of the correct World View and numerous incorrect world views.

That would certainly be the position of US intelligence and diplomatic circles, who map into the correct World View all “incorrect world views,” which may account for their notable lack of successes over the last fifty or so years.

We should market topic maps to audiences who are interested in their own goals, not the goals of others, even geeks.

Goals from group to group. Some groups want to engage in disruptive behavior, other groups wish to prevent disruptive behavior, some want to advance research, still others want to be patent trolls.

Topic maps: Advance your goals with military grade IT. (How’s that for a new topic map slogan?)

Marketing Indexing

Filed under: Indexing,Marketing,Topic Maps — Patrick Durusau @ 6:25 pm

The episodic “oh, woe is topic maps! We aren’t as successful as ..(insert some semantic technology)..” posts are back on topicmapmail@infoloom.com. I don’t dispute that topic maps could improve its market share. I remember the “we’re #2, so we try harder” advertising campaign and take our present position as a reason to try harder, not to bewail our fate as ordained.

Let’s talk about how to market something closely related to topic maps, indexing.

I come to you with this great new idea, indexing. Instead of starting on page 1 and going through page n every time a reader wants to find information, the index points right to it. A real time saver.

You get excited and so we discuss two different marketing approaches:

1) We can do presentations, paper, demos, etc., on the theory of indexing, models of indexing, write software that does indexing, with a lot of effort, etc.

or,

2) We present a publisher/reviewer/reader with a book without an index and we have a copy of the same book with an index, plus a list of ten subjects to find in the book.

Show of hands. Which one do you think would be more effective?

Machine Learning and Knowledge Discovery for Semantic Web

Filed under: Machine Learning,Semantic Web — Patrick Durusau @ 6:20 pm

Machine Learning and Knowledge Discovery for Semantic Web

Description:

Machine Learning and Semantic web are covering conceptually different sides of the same story – Semantic Web’s typical approach is top-down modeling of knowledge and proceeding down towards the data while Machine Learning is almost entirely data-driven bottom-up approach trying to discover the structure in the data and express it in the more abstract ways and rich knowledge formalisms. The talk will discuss possible interaction and usage of Machine Learning and Knowledge discovery for Semantic Web with emphases on ontology construction. In the second half of the talk we will take a look at some research using machine learning for Semantic Web and demos of the corresponding prototype systems.

Slides.

The presentation runs 80+ minutes but three quick points:

First, the “Semi-Automatic Data-Driven Ontology Construction, http://ontogen.ijs.si, from a slightly different point of view, could be converted into a topic map authoring tool for working with data.

Second, the “jaguar” search example at 39:29 was particularly compelling. Definitely improves the usefulness of the search results but still working at the document level. The document level is the wrong level for search, unless you just like wasting time repeating what other people have already done.

Third, there are lots of other tools and resources at: http://ailab.ijs.si/. I am going to be slowly mining this site but if you encounter something really interesting, please make a comment or drop me a note.

Definitely a group to watch.

Sterling: Isolated Storage on Windows Phone 7

Filed under: Database,Software,Topic Map Software — Patrick Durusau @ 6:18 pm

Sterling: Isolated Storage on Windows Phone 7

Not topic map specific but if you need a backend on for a topic map on Windows Phone 7, this might be of interest.

The launch of Windows Phone 7 provided an estimated 1 million Silverlight developers with the opportunity to become mobile coders practically overnight.

Applications for Windows Phone 7 are written in the same language (C# or Visual Basic) and on a framework that’s nearly identical to the browser version of Silverlight 3, which includes the ability to lay out screens using XAML and edit them with Expression Blend. Developing for the phone provides its own unique challenges, however, including special management required when the user switches applications (called “tombstoning”) combined with limited support for state management.

Sterling is an open source database project based on isolated storage that will help you manage local objects and resources in your Windows Phone 7 application as well as simplify the tombstoning process. The object-oriented database is designed to be lightweight, fast and easy to use, solving problems such as persistence, cache and state management. It’s non-intrusive and works with your existing type definitions without requiring that you alter or map them.

In this article, Windows Phone 7 developers will learn how to leverage the Sterling library to persist and query data locally on the phone with minimal effort, along with a simple strategy for managing state when an application is deactivated during tombstoning.

I use a basic cell phone about once a month. Someone else will have to comment on topic map apps on cell phones. 😉

June 6, 2011

OBML 2011 – 3. Workshop of
Ontologies in Biomedicine and Life Sciences

Filed under: Biomedical,Conferences,Ontology — Patrick Durusau @ 2:00 pm

OBML 2011 – 3. Workshop of Ontologies in Biomedicine and Life Sciences

Important Dates

Submission of papers June 30, 2011
Notification of review results August 10, 2011
Deadline for revised versions September 9, 2011
Workshop October 6-7, 2011

Goals of the OBML

The series “Ontologies in Biomedicine and Life Sciences” (OBML workshop) was initiated by the workgroup for OBML of the German Society for Computer Science in 2009. The OBML aims to bring together scientists who are working in this area to exchange ideas and discuss new results, to start collaborations and to initiate new projects. The OBML workshop is held once annually and deals with all fundamental aspects of biomedical ontologies as well as additional “hot” topics.

Submissions are requested especially for the following topics:

  • Ontologies and terminologies in biology, medicine, and clinical research;
  • Ontologies for knowledge representation, methods of reasoning, integration and interoperability of ontologies;
  • Methods and tools for the construction and management of ontologies; and 
  • Applications of the Semantic Web in biomedicine and the life sciences.

The focus of the OBML-2011 is Phenotype ontologies in medicine and biomedical research

“Integration” and “interoperability,” it sounds like they are singing the topic map song! 😉

GTC 2012

Filed under: Conferences,GPU,Graphic Processors — Patrick Durusau @ 1:59 pm

GTC (GPU Technology Conference) 2012

Important Dates

GTC 2012 in San Jose, May 14-17, 2012

Session proposals has closed but posters proposals is open until June 27, 2011. Both will re-open September 27, 2011.

From the website:

GTC advances awareness of high performance computing, and connects the scientists, engineers, researchers, and developers who use GPUs to tackle enormous computational challenges.

GTC 2012 will feature the latest breakthroughs and the most amazing content in GPU-enabled computing. Spanning 4 full days of world-class education delivered by some of the greatest minds in GPU computing, GTC will showcase the dramatic impact that parallel computing is having on scientific research and commercial applications.

BTW, hundreds of hours of video is available from GTC 2010 at this website.

If you are concerned with scaling topic maps and other semantic technologies or just high performance computing in general, the 2010 recordings look like a good place to start while awaiting the 2012 conference.

2nd Workshop on the Multilingual Semantic Web

Filed under: Conferences,Cross-lingual,Linked Data,Multilingual — Patrick Durusau @ 1:58 pm

2nd Workshop on the Multilingual Semantic Web

Collocated with the 10th International Semantic Web Conference (ISWC2011) in Bonn, Germany.

Important Dates

August 15th – submission deadline
September 5th – notification
September 10th – camera-ready deadline
October 23th or 24th – workshop

Abstract:

Given the substantial growth of Web users that create and update knowledge all over the world in languages other than English, multilingualism has become an issue of major interest for the Semantic Web community. This process has been accelerated due to initiatives such as the Linked Data project, which encourages not only governments and public institutes to make their data available to the public, but also private organizations in domains such as medicine, geography, music etc. These actors often publish their data sources in their respective languages, and as such, in order to make this information interoperable and accessible to members of other linguistic communities, multilingual knowledge representation, access and translation are an impending need.

Items of special focus:

  • representation of multilingual information and language resources in Semantic Web and Linked Data formats
  • cross-lingual discovery and representation of mappings between multilingual Linked Data vocabularies and datasets
  • cross-lingual querying of knowledge repositories and Linked Data
  • machine translation and localization strategies for the Semantic Web

The first three are classic topic map fare and the last one isn’t that much of a reach.

Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)

Filed under: Challenges,Conferences,Dataset,Semantic Web — Patrick Durusau @ 1:57 pm

Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)

Full Day Workshop in conjunction with the 10th International Semantic Web Conference 2011 23/24 October 2011, Bonn, Germany

Important Dates

Deadline for paper submission: 8 August 2011 23:59 (11:59pm) Hawaii time
Notification of Acceptance: 29 August 2011 23:59 (11:59pm) Hawaii time
Camera-ready version: 8 September 2011
Workshop: 23 or 24 October 2011

Abstract:

The goal of DeRiVE 2011 is to strengthen the participation of the semantic web community in the recent surge of research on the use of events as a key concept for representing knowledge and organizing and structuring media on the web. The workshop invites contributions to three central questions, and the goal is to formulate answers to these questions that advance and reflect the current state of understanding of events in the semantic web. Each submission will be expected to address at least one question explicitly, and, if possible, include a system demonstration. We have released an event challenge dataset for use in the preparation of contributions, with the goal of supporting a shared understanding of their impact. A prize will be awarded for the best use(s) of the dataset; but the use of other datasets will also be allowed.

See the CFP for questions papers must address.

Also note the anticipated release of a dataset:

We will release a dataset of event data. In addition to regular papers, we invite everybody to submit a Data Challenge paper describing work on this dataset. We welcome analyses, extensions, alignments or modifications of the dataset, as well as applications and demos. The best Data Challenge paper will get a prize.

The dataset consists of over 100.000 events from three sources: the music website Last.fm, and the entertainment websites upcoming.yahoo.com and eventful.com. All three are represented in the LODE schema. Next to events, they contain artists, venues and location and time information. Some links between the instances of the three datasets are provided.

Suggestions for modeling events in topic maps?

Apache Lucene 3.2 / Solr 3.2

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 1:54 pm

Apache Lucene 3.2 / Solr 3.2 released!

From the website:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/ and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/

Highlights of the Lucene release include:

  • A new grouping module, under lucene/contrib/grouping, enables search results to be grouped by a single-valued indexed field
  • A new IndexUpgrader tool fully converts an old index to the current format.
  • A new Directory implementation, NRTCachingDirectory, caches small segments in RAM, to reduce the I/O load for applications with fast NRT reopen rates.
  • A new Collector implementation, CachingCollector, is able to gather search hits (document IDs and optionally also scores) and then replay them. This is useful for Collectors that require two or more passes to produce results.
  • Index a document block using IndexWriter’s new addDocuments or updateDocuments methods. These experimental APIs ensure that the block of documents will forever remain contiguous in the index, enabling interesting future features like grouping and joins.
  • A new default merge policy, TieredMergePolicy, which is more efficient due to being able to merge non-contiguous segments. See http://s.apache.org/merging for details.
  • NumericField is now returned correctly when you load a stored document (previously you received a normal Field back, with the numeric value converted string).
  • Deleted terms are now applied during flushing to the newly flushed segment, which is more efficient than having to later initialize a reader for that segment.

Highlights of the Solr release include:

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format.
  • TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
  • DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString.
  • Improvements to the UIMA and Carrot2 integrations.
  • Highlighting performance improvements.
  • A test-framework jar for easy testing of Solr extensions.
  • Bugfixes and improvements from Apache Lucene 3.2.

DiscoverText Free Tutorial Webinar

Filed under: Classifier,Indexing,Searching,Software — Patrick Durusau @ 1:53 pm

DiscoverText Free Tutorial Webinar

Tuesday June 7 at 12:00 PM EST (Noon)

From the webinar announcement:

This Webinar introduces new and existing DiscoverText users to the basic document ingest, search & code features, takes your questions, and demonstrates our newest tool, a machine-learning classifier that is currently in beta testing. This is also a chance to preview our “New Navigation” and advanced filters.

DiscoverText’s latest additions to our “Do it Yourself” platform can be easily trained to perform customized mood, sentiment and topic classification. Any custom classification scheme or topic model can be created and implemented by the user. You can also generate tag clouds and drill into the most frequently occurring terms or use advanced search and filters to create “buckets” of text.

The system makes it possible to capture, share and crowd source text data analysis in novel ways. For example, you can collect text content off Facebook, Twitter & YouTube, as well as other social media or RSS feeds. Dataset owners can assign their “peers” to coding tasks. It is simple to measure the reliability of two or more coder’s choices. A distinctive feature is the ability to adjudicate coder choices for training purposes or to report validity by code, coder or project.

So please join us Tuesday June 7 at 12:00 PM EST (Noon) for an interactive Webinar. Find out why sorting thousands of items from social media, email and electronic document repositories is easier than ever. Participants in the Webinar will be invited to become beta testers of the new classification application.

I haven’t tried the software, free version or otherwise but will try to attend the webinar and report back.

June 5, 2011

HyperGraphDB

Filed under: Graphs,Hypergraphs,Topic Map Software,Topic Maps — Patrick Durusau @ 3:23 pm

HyperGraphDB has changed in appearance since my last visit.

From the website:

HyperGraphDB is a general purpose, open-source data storage mechanism based on a powerful knowledge management formalism known as directed hypergraphs. While a persistent memory model designed mostly for knowledge management, AI and semantic web projects, it can also be used as an embedded object-oriented database for Java projects of all sizes. Or a graph database. Or a (non-SQL) relational database.

Read Alex Popescu’s HyperGraphDB interview with Borislav Iordanov for a high-level overview.

Watch Borislav Iordanov’s HyperGraphDB Presentation at StrangeLoop 2010.

Feature Summary

  • Powerful data modeling and knowledge representation.
  • Graph-oriented storage.
  • N-ary, higher order relationships (edges) between graph nodes.
  • Graph traversals and relational-style queries.
  • Customizable indexing.
  • Customizable storage management.
  • Extensible, dynamic DB schema through custom typing.
  • Out of the box Java OO database.
  • Fully transactional and multi-threaded, MVCC/STM.
  • P2P framework for data distribution.

HyperGraphDB implements TopicMaps 1.0, TuProlog and a number of other models/standards.

Definitely worth taking out for a spin!

MLDemos

Filed under: Machine Learning,Visualization — Patrick Durusau @ 3:22 pm

MLDemos

From the website:

During my PhD I’ve come across a number of machine learning algorithms for classification, regression and clustering. While there is a great number of libraries, source code and binaries for different algorithms, it is always difficult to get a good grasp of what they do. Moreover, one ends up spending a great amount of time just getting the algorithm to display the results in an understandable way. Change the algorithm and you will have to do the work all over again. Some people have tried, and succeeded, to combine several algorithms into a single multi-purpose library, making their libraries extremely useful (you will find many of their names in the acknowledgements below), but still they didn’t solve the problem of visualization and ease of use. Matlab is an easy answer to that, but while extremely easy to use for generating and displaying single instances of data processing (data, results, models), Matlab is annoyingly slow and cumbersome when it comes to creating an interactive GUI. While preparing the exercice sessions for the Applied Machine Learning class at EPFL, I decided to combine the code snippets, example programs and libraries I had at hand into a seamless application where the different algorithms could be compared and studied easily.

This is an awesome piece of work! You can change the parameters and get immediate feedback on the impact of those changes.

Some minor issues (Windows XP, version 0.3.2):

The files in the /help directory have “open with” information set to Adobe Acrobat.

As far as I can tell, the “help” files don’t appear under the help menu or elsewhere.

There is no base directory so files unpack into whatever directory is selected. Suggest creating mldemo directory as target.

Exiting MLDemos is treated as an error and generates and error report for Microsoft.

Documentation, however brief, about the various algorithms and their parameters would be a welcome addition. Perhaps keyed to one or more leading texts on machine learning. That sounds like something that should be contributed by an interested user doesn’t it? 😉

Kasabi

Filed under: Dataset,Graphs,Marketing,Topic Maps — Patrick Durusau @ 3:21 pm

Kasabi

A dataset collection, curation and interface website that is currently in a public beta.

Summarized in part as:

Search, Browse, Explore

You can browse through the catalog to find datasets based on their category, or search via keywords. From each dataset’s homepage you can quickly find useful information about its provenance, licensing and a snapshot of useful metrics such as when the dataset was last updated.

Using the Explore tools will get you deeper into the dataset: drilling down into detailed documentation and sample data.

Datasets and APIs

Every dataset in Kasabi has a range of core APIs listed right on the dataset homepage or discoverable through the search and browse tools. Choose the API that best supports what you need to do, whether its a search over the data or more complex queries. Subscribe to an API to immediately gain access using your API key. Your dashboard lists all your subscribed APIs, and each has a useful reference card of parameters and response formats available from its homepage. Need more detailed docs? We have those too.

Contribute APIs

Can’t find an API that matches your application? In Kasabi, you can contribute your own using our API building tools. These tools let developers create customised RESTful APIs that capture ways of querying or navigating across a dataset, producing results in a variety of built-in and custom formats. All contributed APIs are listed in the catalog, along with automatically generated documentation, allowing them to be shared with the Kasabi community.

The Contribute APIs looks quite interesting, particularly since all the datasets are stored as separate graph databases.

A bit more from the FAQ on custom APIs:

A custom API allows you tailor access to the dataset. This custom access will then be suited to your particular application or user community. By creating and maintaining a custom API over the data, you won’t be constrained by the default APIs provided by Kasabi or the data owner.

By allowing the developer community to share its skills in ways other than just creating applications, Kasabi lets us broaden the definition of data curation to cover APIs and access as well as the data itself.

Only fifty-nine (59) datasets as of June 4, 2011, with a definite UK flavor but I expect that will grow fairly quickly. The usual suspects, the CIA World Factbook, BBC, New York Times, DBpedia, are all present. More than enough information to make topic map interfaces interesting. The principal advantage of topic map interfaces is the ability to specify a basis for a mapping, thereby enabling other researchers to follow or not, as they choose.

« Newer PostsOlder Posts »

Powered by WordPress