Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 12, 2010

Outside.in Hyperlocal News API

Filed under: Data Source,Dataset — Patrick Durusau @ 6:00 pm

Outside.in Hyperlocal News API

From the website:

The Outside.in API lets you easily integrate hyperlocal news in your sites and applications by providing recent news stories and blog posts for any neighborhood, ZIP code, city, or state in the United States.

A news aggregation site that offers free developer accounts (daily limits on accesses).

Follows > 54,000 RSS feeds.

Questions:

  1. What subjects would a topic map for the postal code where you live include? What information would you use from this service? (2-3 pages)
  2. What subjects would a topic map for the region where you live include? What information would you use from this service? (2-3 pages)
  3. What subjects would a topic map for the country where you live include? What information would you from this service? (2-3 pages)

If it sounds like you weren’t given enough room for all the subjects you would want to include, consider that no topic map, dictionary, encyclopedia, etc., is ever complete.

Editorial choices always have to be made. This is an exercise to give you an opportunity to make those choices and then discuss them with your classmates. (Instead of your director or perhaps a board of supervisors.)

Daylife Developer

Filed under: Data Source,Dataset,Software — Patrick Durusau @ 5:54 pm

Daylife Developer

News aggregation and analysis service.

Offers free developer access to their API, capped at 5,000 calls per day.

From the website:

Have an idea for the next big news application? Build a great app using the Daylife API, then we’ll market it to our clients and give you 70% of the proceeds from any sales. Learn more.

I started to not mention this site so I could keep the 70% to myself but there is more than one great news app using topic maps. 😉

Oh, but that means creating an app.

An app that uses topic maps to deliver substantively different and useful aggregation of news.

Both of those are critical requirements.

The app must be substantively different in delivery of a unique value-add from the use of topic maps. Something the user can’t get somewhere else.

The app must be useful in delivery of value-add found useful by some community. A community willing to pay for that usefulness.

See you at Daylife Developer?

******
PS: Send pointers to similar resources to: patrick@durusau.net.

The more resources become available, including aggregation services, the greater the opportunity for topic maps!

Krati – A persistent high-performance data store

Filed under: Data Structures,NoSQL — Patrick Durusau @ 5:53 pm

Krati – A persistent high-performance data store

From the website:

Krati is a simple persistent data store with very low latency and high throughput. It is designed for easy integration with read-write-intensive applications with little effort in tuning configuration, performance and JVM garbage collection….

Simply put, Krati

  • supports varying-length data array
  • supports key-value data store access
  • performss append-only writes in batches
  • has write-ahead redo logs and periodic checkpointing
  • has automatic data compaction (i.e. garbage collection)
  • is memory-resident (or OS page cache resident) yet persistent
  • allows single-writer and multiple readers

Or you can think of Krati as

  • Berkeley DB JE backed by hash-based indexing rather than B-tree
  • A hashtable with disk persistency at the granularity of update batch

If you use Krati as part of a topic map application please share your experience.

Szl – A Compiler and Runtime for the Sawzall Language

Filed under: Data Mining,Software — Patrick Durusau @ 5:52 pm

Szl – A Compiler and Runtime for the Sawzall Language

From the website:

Szl is a compiler and runtime for the Sawzall language. It includes support for statistical aggregation of values read or computed from the input. Google uses Sawzall to process log data generated by Google’s servers.

Since a Sawzall program processes one record of input at a time and does not preserve any state (values of variables) between records, it is well suited for execution as the map phase of a map-reduce. The library also includes support for the statistical aggregation that would be done in the reduce phase of a map-reduce.

The reading of one record at a time reminds me of the record linkage work that was developed in the late 1950’s in medical epidemiology.

Of course, there the records were converted into a uniform presentation, losing their original equivalents to column headers, etc. So the technique began with semantic loss.

I suppose you could say it was a lossy semantic integration technique.

Of course, that’s true for any semantic integration technique that doesn’t preserve the original language of a data set.

I will have to dig out some record linkage software to compare to Szl.

December 11, 2010

Project Voldemort

Filed under: NoSQL — Patrick Durusau @ 7:45 pm

Project Voldemort

From the website:

Voldemort is not a relational database, it does not attempt to satisfy arbitrary relations while satisfying ACID properties. Nor is it an object database that attempts to transparently map object reference graphs. Nor does it introduce a new abstraction such as document-orientation. It is basically just a big, distributed, persistent, fault-tolerant hash table.

Depending upon your requirements, this could be a useful component.

Sensei

Filed under: Indexing,Lucene,NoSQL — Patrick Durusau @ 3:35 pm

Sensei

From the website:

Sensei is a distributed database that is designed to handle the following type of query:


SELECT f1,f2…fn FROM members
WHERE c1 AND c2 AND c3.. GROUP BY fx,fy,fz…
ORDER BY fa,fb…
LIMIT offset,count

Relies on zoie and hence Lucene for indexing.

Another comparison for the development of TMQL, which of course will need to address semantic sameness.

Cascalog

Filed under: Cascalog,Clojure,Hadoop,TMQL — Patrick Durusau @ 3:23 pm

Cascalog

From the website:

Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.

Most query languages, like SQL, Pig, and Hive, are custom languages — and this leads to huge amounts of accidental complexity. Constructing queries dynamically by doing string manipulation is an impedance mismatch and makes usual programming techniques like abstraction and composition difficult.

Cascalog queries are first-class within Clojure and are extremely composable. Additionally, the Datalog syntax of Cascalog is simpler and more expressive than SQL-based languages.

Follow the getting started steps, check out the tutorial, and you’ll be running Cascalog queries on your local computer within 5 minutes.

Seems like I have heard the term datalog in TMQL discussions. 😉

I wonder what it would be like to define TMQL operators in Cascalog so that all the other capabilities of Cascalog are also available?

When the next draft appears that will be an interesting question to explore.

Accidental Complexity

Filed under: Clojure,Data Mining,Software — Patrick Durusau @ 3:22 pm

Nathan Marz in Clojure at Backtype uses the term accidental complexity.

accidental complexity: Complexity caused by the tool to solve a problem rather than the problem itself

According to Nathan, Clojure helps avoid accidental complexity, something that would be useful in any semantic integration system.

The presentation is described as:

Clojure has led to a significant reduction in complexity in BackType’s systems. BackType uses Clojure all over the backend, from processing data on Hadoop to a custom database to realtime workers. In this talk Nathan will give a crash course on Clojure and using it to build data-driven systems.

Very much worth the time to view it, even more than once.

Decomposer

Filed under: Matrix,Search Engines,Vectors — Patrick Durusau @ 1:19 pm

Decomposer

From the website:

Matrix algebra underpins the way many Big Data algorithms and data structures are composed: full-text search can be viewed as doing matrix multiplication of the term-document matrix by the query vector (giving a vector over documents where the components are the relevance score), computing co-occurrences in a collaborative filtering context (people who viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest) is taking the squaring the user-item interation matrix, calculating users who are k-degrees separated from each other in a social network or web-graph can be found by looking at the k-fold product of the graph adjacency matrix, and the list goes on (and these are all cases where the linear structure of the matrix is preserved!)
….
Currently implemented: Singular Value Decomposition using the Asymmetric Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb’s paper and there is a Lanczos implementation, both single-threaded, and in the contrib/hadoop subdirectory, as a hadoop map-reduce (series of) job(s). Coming soon: stochastic decomposition.

This code is in the process of being absorbed into the Apache Mahout Machine Learning Project.

Useful in learning to use search technology but also for recognizing at a very fundamental level, the limitations of that technology.

Document and query vectors are constructed without regard to the semantics of their components.

Using co-occurrence, for example, doesn’t give a search engine greater access to the semantics of the terms in question.

It simply makes the vector longer and so matches are less frequent and hopefully, less frequent = more precise.

That may or may not be the case. It also doesn’t account for case where the vectors are different but the subject in question is the same.

Search Potpourri: Madonna

Filed under: Humor,Search Engines — Patrick Durusau @ 11:41 am

Today’s search term: Madonna

1-9: You peeked! Yes, the material girl

10. http://www.madonna.edu One of these is not like the other

11-16: more material girl

17. where to have fun with a material girl madonna inn

18-28: even more material girl

29. A general purpose differential equation solver Berkeley Madonna

30. back to material girl

It may be my Roman Catholic background or perhaps the season, but I would have expected Madonna as in Madonna and child to have been in th top 30.

Would you believe that when I used the phrase Madonna and Child that not only did I get the more traditional Madonna but:

10. Material girl launches a clothing line

11. Video of Madonna and child, as in the Material girl and her child

Attention All Search Engines!

You are blocking anyway so why not serve up results that way?

Give one, maybe two links with some text for each block. Then users can choose one of those links or the block.

Sort of like more like this but better able to offer the user meaningful alternatives.

Plus, you can charge more to be the link that shows up in the block as opposed to simply being higher in an aggregation of err…links.

No charge. I use search engines a good bit and every improvement makes my life easier.

December 10, 2010

Decoding Searcher Intent: Is “MS” Microsoft Or Multiple Sclerosis? – Post

Filed under: Authoring Topic Maps,Interface Research/Design,Search Engines,Searching — Patrick Durusau @ 7:35 pm

Decoding Searcher Intent: Is “MS” Microsoft Or Multiple Sclerosis? is a great post from searchengineland.com.

Although focused on user behavior, as a guide to optimizing content for search engines, the same analysis is relevant for construction of topic maps.

A topic map for software help files is very unlikely to treat “MS” as anything other than Microsoft.

Even if those files might contain a references to Multiple Sclerosis, written as “MS.”

Why?

Because every topic map will concentrate its identification of subjects and relationships between subjects where there is the greatest return on investment.

Just as we have documentation rot now, there will be topic map rot as some subjects near the boundary of what is being maintained.

And some subjects won’t be identified or maintained at all.

Perhaps another class of digital have-nots.

Questions:

  1. Read the post and prepare a one page summary of its main points.
  2. What other log analysis would you use in designing a topic map? (3-5 pages, citations)
  3. Should a majority of user behavior/expectations drive topic map design? (3-5 pages, no citations)

Semantically Equivalent Facets

Filed under: Authoring Topic Maps,Facets,Topic Map Software,Topic Map Systems,Topic Maps — Patrick Durusau @ 3:32 pm

I failed to mention semantically equivalent facets in either Identifying Subjects With Facets or Facets and “Undoable” Merges.

Sorry! I assumed it was too obvious to mention.

That is if you are using a facet based navigation with a topic map, it will return/navigate the facet you ask for, and also return/navigate any semantically equivalent facet.

One of the advantages of using a topic map to underlie a facet system is that users get the benefit of something familiar, a set of facet axes they recognize, while at the same time getting the benefit of navigating semantically equivalent facets without knowing about it.

I suppose I should say that declared semantically equivalent facets are included in navigation.

Declared semantic equivalence doesn’t just happen, nor is it free.

Keeping that in mind will help you ask questions when sales or project proposals gloss over the hard questions of what return you will derive from an investment in semantic technologies? And when?

Facets and “Undoable” Merges

After writing Identifying Subjects with Facets, I started thinking about the merge of the subjects matching a set of facets. So the user could observe all the associations where the members of that subject participated.

If merger is a matter of presentation to the user, then the user should be able to remove one of the members that makes up a subject from the merge. Which results in the removal of associations where that member of the subject participated.

No more or less difficult than the inclusion/exclusion based on the facets, except this time it involves removal on the basis of roles in associations. That is the playing of a role, being a role, etc. are treated as facets of a subject.

Well, except that an individual member of a collective subject is being manipulated.

This capability would enable a user to manipulate what members of a subject are represented in a merge. Not to mention being able to unravel a merge one member of a subject at a time.

An effective visual representation of such a capability could be quite stunning.

Identifying Subjects With Facets

If facets are aspects of subjects, then for every group of facets, I am identifying the subject that has those facets.

If I have the facets, height, weight, sex, age, street address, city, state, country, email address, then at the outset, my subject is the subject that has all those characteristics, with whatever value.

We could call that subject: people.

Not the way I usually think about it but follow the thought out a bit further.

For each facet where I specify a value, the subject identified by the resulting value set is both different from the starting subject and, more importantly, has a smaller set of members in the data set.

Members that make up the collective that is the subject we have identified.

Assume we have narrowed the set of people down to a group subject that has ten members.

Then, we select merge from our application and it merges these ten members.

Sounds damned odd, to merge what we know are different subjects?

What if by merging those different members we can now find these different individuals have a parent association with the same children?

Or have a contact relationship with a phone number associated with an individual or group of interest?

Robust topic map applications will offer users the ability to navigate and explore subject identities.

Subject identities that may not always be the ones you expect.

We don’t live in a canned world. Does your semantic software?

Trends in Large-Scale Subject Repositories

Filed under: Data Source — Patrick Durusau @ 11:40 am

Trends in Large-Scale Subject Repositories Authors: Jessica Adamick, Rebecca Reznik-Zellen

Abstract:

Noting a lack of broad empirical studies on subject repositories, the authors investigate subject repository trends that reveal common practices despite their apparent isolated development. Data collected on year founded, subjects, software, content types, deposit policy, copyright policy, host, funding, and governance are analyzed for the top ten most-populated subject repositories. Among them, several trends exist such as a multi- and interdisciplinary scope, strong representation in the sciences and social sciences, use of open source repository software for newer repositories, acceptance of pre- and post-prints, moderated deposits, submitter responsibility for copyright, university library or departmental hosting, and discouraged withdrawal of materials. In addition, there is a loose correlation between repository size and age. Recognizing the diversity of all subject repositories, the authors recommend that tools for assessment and evaluation be developed to guide subject repository management to best serve their respective communities.

A useful review of some of the leading subject repositories.

Crack the subject identity nut, reliably and in a cost-effective manner, for any of these repositories, and your advertising woes are over.

Search Potpourri: …My Breast Fell Out

Filed under: Humor,Search Engines,Search Potpourri — Patrick Durusau @ 8:51 am

Today’s search term: breast

  1. Breast Implants $2500
  2. Breast – Wikipedia, the free encyclopedia
  3. Naked and Funny. Opps! My Breast Fell Out (video)
  4. Feel My Breasts (video)
  5. Breasts – sexual or for breastfeeding babies?
  6. AfraidtoAsk.com >> SIZE & SHAPE
  7. Show prosthetic breast, woman told at airport
  8. Fake Doctor Jailed For Giving Breast Exams In Bars

I suppose one could argue this result set offers something for everyone.

But, the rest of the results, at least up to the first 50, were as uneven as the first ten.

Hardly encouraging say for someone seeking serious medical information.

Clustering similar entities into collections would be one way to improve upon this result.

The related search function does that to a degree. But only to a degree.

More detailed navigation would be a good thing.

Perhaps high level collection views that can be “zoomed” into for more detailed browsing.

*****
Send your favorite search term(s)/phrases and a suggested search engine (remains anonymous for reporting) to: patrick@durusau.net.

Before anyone complains the search term was unfair, there were search engines that returned less varied results.

Scala in Depth

Filed under: Scala,Software — Patrick Durusau @ 7:18 am

Scala in Depth Authors: Josh Suereth

Abstract:

Scala is a unique and powerful new programming language for the JVM. Blending the strengths of the Functional and Imperative programming models, Scala is a great tool for building highly concurrent applications without sacrificing the benefits of an OO approach. While information about the Scala language is abundant, skilled practitioners, great examples, and insight into the best practices of the community are harder to find. Scala in Depth bridges that gap, preparing you to adopt Scala successfully for real world projects. Scala in Depth is a unique new book designed to help you integrate Scala effectively into your development process. By presenting the emerging best practices and designs from the Scala community, it guides you though dozens of powerful techniques example by example. There’s no heavy-handed theory here-just lots of crisp, practical guides for coding in Scala.

For example:

  • Discover the “sweet spots” where object-oriented and functional programming intersect.
  • Master advanced OO features of Scala, including type member inheritance, multiple inheritance and composition.
  • Employ functional programming concepts like tail recursion, immutability, and monadic operations.
  • Learn good Scala style to keep your code concise, expressive and readable.

As you dig into the book, you’ll start to appreciate what makes Scala really shine. For instance, the Scala type system is very, very powerful; this book provides use case approaches to manipulating the type system and covers how to use type constraints to enforce design constraints. Java developers love Scala’s deep integration with Java and the JVM Ecosystem, and this book shows you how to leverage it effectively and work around the rough spots.

There is little doubt that concurrent programming is a dawning reality. Which languages will be the best for concurrent programming in general (if there is such a case) or for topic maps is particular isn’t as clear.

Only time and usage can answer those questions.

Efficient Spectral Neighborhood Blocking for Entity Resolution

Filed under: Entity Resolution — Patrick Durusau @ 6:54 am

Efficient Spectral Neighborhood Blocking for Entity Resolution Authors: Liangcai Shu, Aiyou Chen, Ming Xiong, Weiyi Meng

Abstract:

In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1) We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.

Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world.

Modulo my usual qualms about the real world language, this sounds useful for the construction of topic maps.

Questions:

  1. How would you suggest integrating this methodology into a topic map construction process? (3-5 pages, no citations)
  2. How would you suggest integrating rule based entity identification into this methodology? (3-5 pages, no citations)
  3. Is precision of identification an operational requirement? (3-5 pages, no citations)

December 9, 2010

Schema Design for Raik (Take 2)

Filed under: NoSQL,Riak,Schema — Patrick Durusau @ 5:48 pm

Schema Design for Riak (Take 2)

Useful exercise in schema design in a NoSQL context.

No great surprise that focus on data and application requirements are the keys (sorry) to a successful deployment.

Amazing how often that gets repeated, at least in presentations.

Equally amazing how often that gets ignored in implementations (at least to judge from how often it is repeated in presentations).

Still, we all need reminders so it is worth the time to review the slides.

Basho Riak: An Open Source Scalable Data Store

Filed under: MapReduce,NoSQL,Riak — Patrick Durusau @ 5:45 pm

Basho Riak: An Open Source Scalable Data Store

From the website:

Riak is a Dynamo-inspired key/value store that scales predictably and easily. Riak also simplifies development by giving developers the ability to quickly prototype, test, and deploy their applications

A truly fault-tolerant system, Riak has no single point of failure. No machines are special or central in Riak, so developers and operations professionals can decide exactly how fault-tolerant they want and need their applications to be.

The video from Ga Tech NoSQL conference in 2009 is worth watching.

Their implementation of MapReduce: is targeted (doesn’t have to be run against entire data set), can be setup as a stream (store and send through mapreduce), or used with the representation of relationships as links.

CS Abstraction – Bridging Data Models – JSON and COBOL

Filed under: TMRM,Topic Maps — Patrick Durusau @ 12:01 pm

I was reading Ullman’s Foundations of Computer Science on abstraction when it occurred to me:

A topic map legend is an abstraction that bridges some set of abstractions (read data models), to enable us to navigate and possibly combine data from them.

with the corollary:

Any topic map legend is itself an abstraction that is subject to being bridged for navigation or data combination purposes.

The first statement recognizes that there are no Ur abstractions that will dispel all others. Never have been, never will be.

If the history of CS teaches anything, it is the ephemeral nature of modeling.

The latest hot item is JSON but it was COBOL some, well, more years ago than I care to say. Nothing against JSON but in five years or less, it will either be fairly common or footnoted in dissertations.

The important thing is that we will have data stored in JSON for a very long time. Whether it gains in popularity or no.

We could say everyone will convert to the XXX format of years hence, but in fact that never happens.

Legacy systems (some at defense facilities, systems still require punched data entry, simply not economical to re-write/debug a new system) need the data, cost of the conversion, cost of verification, etc.

The corollary recognizes that once written, a topic map of a set of data models, the topic map itself becomes a data model for navigation/aggregation.

Otherwise we fall into the same trap as the data model paradigms that posit they will be the data model that dispels all others.

There are no cases where that has happened, either in digital times or in the millennia of data models that preceded digital times.

The emphasis on subject identify in topic maps facilitates the bridging of data models and having a useful result when we do.

What data models would you like to bridge today?

Foundations of Computer Science

Filed under: Subject Identity,TMRM,Topic Maps — Patrick Durusau @ 11:55 am

Foundations of Computer Science

Introduction to theory in computer science by Alfred V. Aho and Jeffrey D. Ullman. (Free PDF of the entire text)

The turtle on the cover is said to be a reference to the turtle on which the world rests.

This particular turtle serves as the foundation for:

I point out this work because of its emphasis on abstraction.

Topic maps, at their best, are abstractions that bridge other abstractions and make use of information recorded in those abstractions.

*****
PS: The “rules of thumb” for programming in the introduction are equally applicable to writing topic maps. You will not encounter many instances of them being applied but they remain good guidance.

Mining of Massive Datasets – eBook

Mining of Massive Datasets

Jeff Dalton, Jeff’s Search Engine Caffè reports a new data mining book by Anand Rajaraman and Jeffrey D. Ullman (yes, that Jeffrey D. Ullman, think “dragon book.”).

A free eBook no less.

Read Jeff’s post on your way to get a copy.

Look for more comments as I read through it.

Has anyone written a comparison of the recent search engine titles? Just curious.


Update: New version out in hard copy and e-book remains available. See: Mining Massive Data Sets – Update

Developing High Quality Data Models – Book

Filed under: Data Models,Data Structures,Ontology — Patrick Durusau @ 11:39 am

Developing High Quality Data Models by Dr. Matthew West is due out in January of 2011. (Pre-order: Elsevier, Amazon)

From the website:

Anyone charged with developing a data model knows that there is a wide variety of potential problems likely to arise before achieving a high quality data model. With dozens of attributes and millions of rows, data modelers are in always danger of inconsistency and inaccuracy. The development of the data model itself could result in difficulties presenting accurate data. The need to improve data models begins in getting it right in the first place.

Developing High Quality Data Models uses real-world examples to show you how to identify a number of data modeling principles and analysis techniques that will enable you to develop data models that consistently meet business requirements. A variety of generic data model patterns that exemplify the principles and techniques discussed build upon one another to give a powerful and integrated generic data model with wide applicability across many disciplines. The principles and techniques outlined in this book are applicable in government and industry, including but not limited to energy exploration, healthcare, telecommunications, transportation, military defense, transportation and so on.

Table of Contents:

Preface
Chapter 1- Introduction
Chapter 2- Entity Relationship Model Basics
Chapter 3- Some types and uses of data models
Chapter 4- Data models and enterprise architecture
Chapter 5- Some observations on data models and data modeling
Chapter 6- Some General Principles for Conceptual, Integration and Enterprise Data Models
Chapter 7- Applying the principles for attributes
Chapter 8- General principles for relationships
Chapter 9- General principles for entity types
Chapter 10- Motivation and overview for an ontological framework
Chapter 12- Classes
Chapter 13- Intentionally constructed objects
Chapter 14- Systems and system components
Chapter 15- Requirements specifications
Chapter 16- Concluding Remarks
Chapter 17- The HQDM Framework Schema

I first became familiar with the work of Dr. West from Ontolog. You can visit his publications page to see why I am looking forward to this book.

Citation of and comments on this work will follow as soon as access and time allow.

December 8, 2010

Semantic Web – Journal Issue 1/1-2

Filed under: OWL,RDF,Semantic Web — Patrick Durusau @ 8:18 pm

Semantic Web

The first issue of Semantic Web is openly viewable and now online.

In their introductory remarks the editors focus in part on the journal’s subtitle:

The journal’s subtitle – Interoperability, Usability, Applicability – re?ects the wide scope of the journal, by putting an emphasis on enabling new technologies and methods. Interoperability refers to aspects such as the seamless integration of data from heterogeneous sources, on-the-?y composition and interoperation of Web services, and next-generation search engines. Usability encompasses new information retrieval paradigms, user interfaces and interaction, and visualization techniques, which in turn require methods for dealing with context dependency, personalization, trust, and provenance, amongst others, while hiding the underlying computational issues from the user. Applicability refers to the rapidly growing application areas of Semantic Web technologies and methods, to the issue of bringing state-of-the-art research results to bear on real-world applications, and to the development of new methods and foundations driven by real application needs from various domains.

Skimming the table of contents I can see lots of opportunity for comments and rejoinders.

For the present I simply commend this new journal and its contents to you for your reading pleasure.

Barriers to Entry in Search Getting Smaller – Post

Filed under: Indexing,Interface Research/Design,Search Engines,Search Interface,Searching — Patrick Durusau @ 9:49 am

Barriers to Entry in Search Getting Smaller

Jeff Dalton, Jeff’s Search Engine Caffè, makes a good argument that the barriers to entering the search market are getting smaller.

Jeff observes that blekko can succeed with a small number of servers only because its search demand is low.

True, but how many intra-company or litigation search engines are going to have web-sized user demands?

Start-ups need not try to match Google in its own space, but can carve out interesting and economically rewarding niches of their own.

Particularly if those niches involve mapping semantically diverse resources into useful search results for their users.

For example, biomedical researchers probably have little interest in catalog entries that happen to match gene names. Or any of the other common mis-matches offered by entire web search services.

In some ways, search the entire web services have created their own problem and then attempted to solve it.

My research interests are in information retrieval broadly defined so a search engine limited to library schools, CS programs (their faculty and students), the usual suspects for CS collections, library/CS/engineering organizations, with semantic mapping, would suit me just find.

Noting that the semantic mis-match problem persists even with a narrowing of resources, but the benefit of each mapping is incrementally greater.

Questions:

  1. What resources are relevant to your research interests? (3-5 pages, web or other citations)
  2. Create a Google account to create your own custom search engine and populate it with your resources.
  3. Develop and execute 20 queries against your search engine and Google, Bing and one other search engine of your choice. Evaluate and report the results of those queries.
  4. Would semantic mapping such as we have discussed for topic maps be more or less helpful with your custom search engine versus the others you tried? (3-5 pages, no citations)

Aspects of Topic Maps

Writing about Bobo: Fast Faceted Search With Lucene, made me start to think about the various aspects of topic maps.

Authoring of topic maps is something that was never discussed in the original HyTime based topic map standard and despite several normative syntaxes, mostly even now it is either you have a topic map, or you don’t. Depending upon your legend.

Which is helpful given the unlimited semantics that can be addressed with topic maps but looks awfully hand-wavy to, ahem, outsiders.

Subject Identity or should I say: when two subject representatives are deemed for some purpose to represent the same subject. (That’s clearer. ;-)) This lies at the heart of topic maps and the rest of the paradigm supports or is consequences of this principle.

There is no one way to identify any subject and users should be free to use the identification that suits them best. Where subjects include the data structures that we build for users. Yes, IT doesn’t get to dictate what subjects can be identified or how. (Probably should have never been the case but that is another issue.)

Merging of subject representatives. Merging is an aspect of recognizing two or more subject representatives represent the same subject. What happens then is implementation, data model and requirement specific.

A user may wish to see separate representatives just prior to merger so merging can be audited or may wish to see only merged representatives for some subset of subjects or may have other requirements.

Interchange of topic maps. Not exclusively the domain of syntaxes/data models but an important purpose for them. It is entirely possible to have topic maps for which no interchange is intended or desirable. Rumor has it of the topic maps at the Y-12 facility at Oak Ridge for example. Interchange was not their purpose.

Navigation of the topic map. The post that provoked this one is a good example. I don’t need specialized or monolithic software to navigate a topic map. It hampers topic map development to suggest otherwise.

Querying topic maps. Topic maps have been slow to develop a query language and that effort has recently re-started. Graph query languages, that are already fairly mature, may be sufficient for querying topic maps.

Given the diversity of subject identity semantics, I don’t foresee a one size fits all topic maps query language.

Interfaces for topic maps. However one resolves/implements other aspects of topic maps, due regard has to be paid to the issue of interfaces. Efforts thus far range from web portals to “look its a topic map!” type interface.

In the defense of current efforts, human-computer interfaces are poorly understood. Not surprising since the human-codex interface isn’t completely understood and we have been working at that one considerably longer.

Questions:

  1. What other aspects to topic maps would you list?
  2. Would you sub-divide any of these aspects? If so, how?
  3. What suggestions do you have for one or more of these aspects?

Bayesian Model Selection and Statistical Modeling – Review

Filed under: Authoring Topic Maps,Bayesian Models,Software — Patrick Durusau @ 9:47 am

Bayesian Model Selection and Statistical Modeling by Tomohiro Ando, reviewed by Christian P. Robert.

If you are planning on using Bayesian models in your topic maps activities, read this review first.

You will thank the reviewer later.

Webinar: Revolution R is 100% R and More
9 AM Pacific 8 December 2010 (today)

Filed under: Authoring Topic Maps,R,Software — Patrick Durusau @ 7:59 am

Webinar: Revolution R is 100% R and More

Apologies for the short notice but this webinar may be of interest to those using R to mine data sets as part of topic map construction.

It was in my morning sweep of resources and was just posted yesterday.

I have a scheduling conflict but the webinar is said to be available for asynchronous viewing.

December 7, 2010

Bobo: Fast Faceted Search With Lucene

Filed under: Facets,Information Retrieval,Lucene,Navigation,Subject Identity — Patrick Durusau @ 8:52 pm

Bobo: Fast Faceted Search With Lucene

From the website:

Bobo is a Faceted Search implementation written purely in Java, an extension of Apache Lucene.

While Lucene is good with unstructured data, Bobo fills in the missing piece to handle semi-structured and structured data.

Bobo Browse is an information retrieval technology that provides navigational browsing into a semi-structured dataset. Beyond the result set from queries and selections, Bobo Browse also provides the facets from this point of browsing.

Features:

  • No need for cache warm-up for the system to perform
  • multi value sort – sort documents on fields that have multiple values per doc, .e.g tokenized fields
  • fast field value retrieval – over 30x faster than IndexReader.document(int docid)
  • facet count distribution analysis
  • stable and small memory footprint
  • support for runtime faceting
  • result merge library for distributed facet search

I had to go look up the definition of facet. Merriam-Webster (I remember when it was just Webster) says:

any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)

So a faceted search could search/browse, in theory at any rate, based on any property of a subject, even those I don’t recognize.

Different languages being the easiest example.

I could have aspects of a hotel room described in both German and Korean, both describing the same facets of the room.

Questions:

  1. How would you choose the facets for a subject to be included in faceted browsing? (3-5 pages, no citations)
  2. How would you design and test the presentation of facets to users? (3-5 pages, no citations)
  3. Compare the current TMQL proposal (post-Barta) with the query language for facet searching. If a topic map were treated (post-merging) as faceted subjects, which one would you prefer and why? (3-5 pages, no citations)
« Newer PostsOlder Posts »

Powered by WordPress