Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 23, 2011

geocommons

Filed under: Dataset,Geographic Information Retrieval,Mapping,Maps — Patrick Durusau @ 9:27 pm

geocommons

A very impressive resource for mapping data against a common geographic background.

Works for a lot of reasons, not the least of which is the amount of effort that has gone into the site and its tools.

But, I think having a common frame of reference, that is geographic locations, simplifies the problem addressed by topic maps.

That is the data is seen through the common lens of geographic boundaries and/or locations.

To make it closer to the problem faced by topic maps, what if geographic locations had to be brought into focus, before data could be mapped against them?

That seems to me to be the harder problem.

Multi-Relational Graph Structures: From Algebra to Application

Filed under: Data Structures,Graphs,Neo4j — Patrick Durusau @ 4:55 pm

Multi-Relational Graph Structures: From Algebra to Application

Important review of graph structures and the development of research on the same over the last couple of decades.

Doesn’t answer the question of what will be the hot application that puts topic maps on every desktop.

Does bring us a little closer to an application that would merit that kind of usage.

A Path Algebra for Mapping Multi-Relational Networks to Single-Relational Networks

Filed under: Data Structures,Graphs,Neo4j,Networks — Patrick Durusau @ 4:54 pm

A Path Algebra for Mapping Multi-Relational Networks to Single-Relational Networks

A proposal for re-use of existing algorithms, designed for single relational networks with multi-relational networks.

By mapping multi-relational networks onto single relational networks.

Makes me wonder if heterogeneous identifications could be mapped in a similar way to a single identifier?

Or would there be too much information loss?

Depends on the circumstances and goals.

Distributed Graph Databases and the Emerging Web of Data

Filed under: Data Structures,Graphs,Neo4j — Patrick Durusau @ 4:52 pm

Distributed Graph Databases and the Emerging Web of Data

Marko A. Rodriguez on distributed graph databases.

I follow his presentation up to the point where he says: Directed multi-relational graph: heterogeneous set of links. (page 6 of 79) and then The multi-relational graph is a very natural representation of the world. (page 22 of 79).

I fully agree that a multi-relational graph is a good start, but what about heterogeneous ways to identify the subjects represented by nodes and links (edges)?

I suppose that goes hand in hand with using URIs as the single identifiers for the subjects represented by nodes and edges.

Presuming one identifier is one way to resolve heterogeneous identification but not a very satisfactory one, at least to me.

Rogue – Some Details

Filed under: MongoDB,NoSQL — Patrick Durusau @ 1:41 pm

Rogue: A Type-Safe Scala DSL for querying MongoDB a blog post from Foursquare that gives some examples and details on Rogue. More to follow.

January 22, 2011

Advanced HBase – Post

Filed under: HBase,NoSQL — Patrick Durusau @ 7:14 pm

Advanced HBase by Lars George from Alex Popescu’s MyNoSQL blog.

40 Fascinating Blogs for the Ultimate Statistics Geek – Post

Filed under: Data Mining,Statistics — Patrick Durusau @ 1:29 pm

40 Fascinating Blogs for the Ultimate Statistics Geek

A varied collection of blogs on statistics.

Either for data mining, modeling or interpreting the data mining/modeling of others, you are going to need statistics.

Blogs are not a replacement for a good statistics book and a copy of Mathematica but it’s a place to start.

Heads they win, tales we lose: Discovery tools will never deliver on their promise – Post

Filed under: Library,Searching — Patrick Durusau @ 10:01 am

Heads they win, tales we lose: Discovery tools will never deliver on their promise is an enlightening tail on a problem topic maps may or may not solve.

From the post:

As strange as it may sound, the future is not in unified databases powering discovery tools, Matt told me yesterday. He can’t foresee a time when the major database vendors will find it profitable to combine their metadata for our benefit. Instead, the future is in hybrid systems that combine discovery and federation. As I see it, libraries will have to decide if they care whether their EBSCO products or their ProQuest products are seamlessly integrated, choose the discovery layer that matches the company of their choice, and then federate in the content from the other database providers. Federated search is dead; long live federated search. And I’m sure the thinking at EBSCO is that we’ll be paying someone for a discovery tool, and that someone should be them.

(Matt in that quote is Matt Andros, Vice President of Field Sales at EBSCO, a major online content vendor.)

One wonders what it would look like to have a local, federated overlay for viewing a vendor’s resources.

Create metadata about their metadata/data. Have to show it in order for you to use it.

Making Linked Data work isn’t the problem – Post

Filed under: Linked Data,Semantic Web,Topic Maps — Patrick Durusau @ 7:13 am

Making Linked Data work isn’t the problem

Georgi Kobilarov captures an essential question when he says:

But neither Linked Open Data nor the Semantic Web have really took off from there yet. I know many people will disagree with me and point to the famous Linked Open Data cloud diagram, which shows a large (and growing) number of data sets as part of the Linked Data Web. But where are the showcases of problems being solved?

If you can’t show me problems being solved then something is wrong with the solution. “we need more time” is rarely the real issue, esp. when there is some inherent network effect in the system. Then there should be some magic tipping point, and you’re just not hitting it and need to adjust your product and try again with a modified approach.

My point here is not that I want to propose any particular direction or change, but instead I want to stress what I believe is an issue in the community: too few people are actually trying to understand the problem that Linked Data is supposed to be the solution to. If you don’t understand the problem you can not develop a solution or improve a half-working one. Why? Well, what do you do next? Which part to work on? What to change? There is no ground for those decisions if you don’t have at least a well informed guess (or better some evidence) about the problem to solve. And you can’t evaluate your results.

You could easily substitute topic maps in place of linked data in that quote.

Questions:

Putting global claims to one side, write a 5 – 8 page paper, with citations, answering the following questions:

  1. What specific issue in your library would topic maps help solve? As opposed to what other solutions?
  2. Would topic maps require more or less resources than other solutions?
  3. Would topic maps offer any advantages over other solutions?
  4. How would you measure/estimate the answers in #2 and #3 for a proposal to your library board/director?

(Feel free to suggest and answer other questions I have overlooked.)

January 21, 2011

Feldspar: A System for Finding Information by Association

Filed under: Associations,Query Language,TMQL,Visual Query Language — Patrick Durusau @ 5:28 pm

Feldspar: A System for Finding Information by Association

…use non-specific requirements to find specific things.

Uses associations to build queries.

Associations developed by Google Desktop.

Very cool!

Workshop on Human-Computer Interaction and Information Retrieval

Workshop on Human-Computer Interaction and Information Retrieval

From the website:

Human-computer Information Retrieval (HCIR) combines research from the fields of human-computer interaction (HCI) and information retrieval (IR), placing an emphasis on human involvement in search activities.

The HCIR workshop has run annually since 2007. The workshop unites academic researchers and industrial practitioners working at the intersection of HCI and IR to develop more sophisticated models, tools, and evaluation metrics to support activities such as interactive information retrieval and exploratory search. It provides an opportunity for attendees to informally share ideas via posters, small group discussions and selected short talks.

Workshop participants present interfaces (including mockups, prototypes, and other early-stage designs), research results from user studies of interfaces, and system demonstrations related to the intersection of Human Computer Interaction (HCI) and Information Retrieval (IR). The intent of the workshop is not archival publication, but rather to provide a forum to build community and to stimulate discussion, new insight, and experimentation on search interface design.

Proceedings from 2007 to date are available.

I would point to the workshops separately or even some of the papers but the site helpfully returns its base URL for all resources.

Good weekend or even weekday reading!

GraphDB-Bench

Filed under: Graphs,Software — Patrick Durusau @ 4:11 pm

GraphDB-Bench

From the website:

GraphDB-Bench is an extensible graph database benchmarking tool. Its goal is to provide an easy-to-use library for defining and running application/domain-specific benchmarks against different graph database implementations. To achieve this the core code-base has been kept relatively simple, through extensive use of lower layers in the TinkerPop stack.

Should be useful for experimenting with different graph database implementations as well as generating test graphs with different typologies.

Hey Jude Flowchart – Post – Topic Map Challenge

Filed under: Examples,Exercises,Topic Maps — Patrick Durusau @ 11:38 am

Hey Jude Flowchart from Flowingdata.com.

An amusing visualization of a popular song from my youth.

As well as an opportunity for a topic map challenge!

Create a topic map of Hey Jude, using this flowchart as your starting point.

You can include other subjects but points are awarded for subjects derived from the lyrics as represented in this flowchart.

Feel free to suggest more contemporary songs if you like, but be prepared to lead the effort topic map them!

GRAPHITE: A Visual Query System for Large Graphs

Filed under: Graphs,Software,Visual Query Language,Visualization — Patrick Durusau @ 6:45 am

GRAPHITE: A Visual Query System for Large Graphs

Watch the video, then imagine not having to convert from one data model to another but being able to treat aspects of data models as subjects.

Such that every users queries the graph using their data model but can retrieve information entered by others, using different data models.

For a more formal treatment, see: GRAPHITE: A Visual Query System for Large Graphs.

Graph Exploration System (GUESS)

Filed under: Graphs,Visualization — Patrick Durusau @ 6:20 am

Graph Exploration System (GUESS)

Graph visualization software with a number of very interesting features.

If you like the video, grab a copy of the software from: GEUSS: The Graph Exploration System.

Gremlin: A Graph Traversal Language (Tutorial 1)

Filed under: Graphs,Gremlin — Patrick Durusau @ 5:51 am

Gremlin: A Graph Traversal Language (Tutorial 1)

Warning: Watch only after several very strong cups of coffee! 😉

I started watching this video without my headset on and was struck by the frequent backspace/error correction.

It gave the impression of someone hurriedly demonstrating commands to another users in real time.

Then I put my headset on. Wow!

Has to be one of the fastest spoken video’s I have ever seen!

Quite amazing.

You won’t remember every command but by the end of the video you will want start practicing with Gremlin!

January 20, 2011

IMMM 2011: The First International Conference on Advances in Information Mining and Management

Filed under: Conferences,Data Mining,Information Retrieval,Searching — Patrick Durusau @ 7:40 pm

IMMM 2011: The First International Conference on Advances in Information Mining and Management.

July 17-22, 2011 – Bournemouth, UK

See the Call for Papers for details but general areas include:

  • Mining mechanisms and methods
  • Mining support
  • Type of information mining
  • Pervasive information retrieval
  • Automated retrieval and mining
  • Mining features
  • Information mining and management
  • Mining from specific sources
  • Data management in special environments
  • Mining evaluation
  • Mining tools and applications

Important deadlines:
Submission (full paper) March 1, 2011
Notification April 10 , 2011
Registration April 25, 2011
Camera ready April 28, 2011

Record Linkage: Similarity Measures and Algorithms

Filed under: Record Linkage,Similarity — Patrick Durusau @ 11:30 am

Record Linkage: Similarity Measures and Algorithms Authors Nick Koudas, Sunita Sarawagi, Divesh Srivastava

A little dated (2006) but still a very useful review of similarity measures under the rubric of record linkage.

Crowdsourcing for Search Evaluation

Filed under: Subject Identity — Patrick Durusau @ 6:22 am

Crowdsourcing for Search Evaluation

An interesting workshop held in connection with the 33rd Annual ACM SIGIR Conference.

Think of this in terms of crowdsourcing subject identification instead of search evaluation and its relevance to topic maps becomes clearer.

Comments to follow on some of the specific papers.

HBase 0.90.0 Released: Over 1000 Fixes and Improvements – Post

Filed under: HBase,NoSQL — Patrick Durusau @ 6:21 am

HBase 0.90.0 Released: Over 1000 Fixes and Improvements

From Alex Popescu news that HBase 0.90.0 has been released!

HBase homepage

80-50 Rule?

Filed under: Crowd Sourcing,Interface Research/Design,Search Interface,Uncategorized — Patrick Durusau @ 6:18 am

Watzlawick1 recounts the following experiment:

That there is no necessary connection between fact and explanation was illustrated in a recent experiment by Bavelas (20): Each subject was told he was participating in an experimental investigation of “concept formation” and was given the same gray, pebbly card about which he was to “formulate concepts.” Of every pair of subjects (seen separately but concurrently) one was told eight out of ten times at random that what he said about the card was correct; the other was told five out of ten times at random what he said about the card was correct. The ideas of the subject who was “rewarded” with a frequency of 80 per cent remained on a simple level, which the subject who was “rewarded” only at a frequency of 50 per cent evolved complex, subtle, and abstruse theories about the card, taking into consideration the tiniest detail of the card’s composition. When the two subjects were brought together and asked to discuss their findings, the subject with the simpler ideas immediately succumbed to the “brilliance” of the other’s concepts and agreed the other had analyzed the card correctly.

I repeat this account because it illustrates the impact that “reward” systems can have on results.

Whether the “rewards” are members of a crowd or experts.

Questions:

  1. Should you randomly reject searches in training to search for subjects?
  2. What literature supports your conclusion in #1? (3-5 pages)

This study does raise the interesting question of whether conferences should track and randomly reject authors to encourage innovation.

1. Watzlawick, Paul, Janet Beavin Bavelas, and Don D. Jackson. 1967. Pragmatics of human communication; a study of interactional patterns, pathologies, and paradoxes. New York: Norton.

January 19, 2011

Quantum Mechanics of Topic Maps

Filed under: Category Theory,Mapping,Maps,Topic Maps — Patrick Durusau @ 6:47 pm

I ran across Alfred Korzybski’s dictum “…the map is not the territory…” the other day.

I’ve repeated it and have heard others repeat it.

Not to mention it being quoted in any number of books on mapping and mapping technologies.

It’s a natural distinction, between the artifact of a map and the territory it is mapping.

But it is important to note that Korzbski did not say “…a map cannot be a territory….”

Like the wave/particle duality in quantum mechanics, maps can be maps or they can be territories.

Depends upon the purpose with which we are viewing them.

A rather wicked observer effect that changes the formal properties of a map vis-a-vis a territory to being the properties of a territory vis-a-vis a map.

Maps (that is syntaxes/data models) try to avoid that observer effect by proclaiming themselves to be the best possible of all possible maps in the traditional of Dr. Pangloss.

They may be the best map for some situation, but they remain subject to being viewed as a territory, should the occasion arise.

(If that sounds like category theory to you, give yourself a gold star.)

The map-as-territory principle is what enables the viewing of subject representatives in different maps as representatives of the same subjects.

Otherwise, we must await the arrival of the universal mapping strategy.

It is due to arrive on the same train as the universal subject identifier for all subjects, for all situations and time periods.

NCIBI – National Center for Integrative Biomedical Informatics

Filed under: Bioinformatics,Biomedical,Heterogeneous Data,Merging — Patrick Durusau @ 2:13 pm

NCIBI – National Center for Integrative Biomedical Informatics

From the website:

The National Center for Integrative Biomedical Informatics (NCIBI) is one of seven National Centers for Biomedical Computing (NCBC) within the NIH Roadmap. The NCBC program is focused on building a universal computing infrastructure designed to speed progress in biomedical research. NCIBI was founded in September 2005 and is based at the University of Michigan as part of the Center for Computational Medicine and Bioinformatics (CCMB).

Note the use of integrative in the name of the center.

They “get” that part.

They are in fact working on mappings to support integration of data even as I write these lines.

There is a lot to be learned about their strategies for integration and to better understand the integration issues they face in this domain. This site is a good starting place to do both.

MIMI Merge Process

Filed under: Bioinformatics,Biomedical,Data Source,Merging — Patrick Durusau @ 2:01 pm

Michigan Molecular Interactions

From the website:

MiMI provides access to the knowledge and data merged and integrated from numerous protein interactions databases. It augments this information from many other biological sources. MiMI merges data from these sources with “deep integration” (see The MiMI Merge Process section) into its single database. A simple yet powerful user interface enables you to query the database, freeing you from the onerous task of having to know the data format or having to learn a query language. MiMI allows you to query all data, whether corroborative or contradictory, and specify which sources to utilize.

MiMI displays results of your queries in easy-to-browse interfaces and provides you with workspaces to explore and analyze the results. Among these workspaces is an interactive network of protein-protein interactions displayed in Cytoscape and accessed through MiMI via a MiMI Cytoscape plug-in.

MiMI gives you access to more information than you can get from any one protein interaction source such as:

  • Vetted data on genes, attributes, interactions, literature citations, compounds, and annotated text extracts through natural language processing (NLP)
  • Linkouts to integrated NCIBI tools to: analyze overrepresented MeSH terms for genes of interest, read additional NLP-mined text passages, and explore interactive graphics of networks of interactions
  • Linkouts to PubMed and NCIBI’s MiSearch interface to PubMed for better relevance rankings
  • Queriying by keywords, genes, lists or interactions
  • Provenance tracking
  • Quick views of missing information across databases.
  • I found the site looking for tracking of provenance after merging and then saw the following description of merging:

    MIMI Merge Process

    Protein interaction data exists in a number of repositories. Each repository has its own data format, molecule identifier, and supplementary information. MiMI assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers data from well-known protein interaction databases and deep-merges the information.

    Utilizing an identity function, molecules that may have different identifiers but represent the same real-world object are merged. Thus, MiMI allows the user to retrieve information from many different databases at once, highlighting complementary and contradictory information.

    There are several steps needed to create the final MiMI dataset. They are:

    1. The original source datasets are obtained, and transformed into the MiMI schema, except KEGG, NCBI Gene, Uniprot, Ensembl.
    2. Molecules that can be rolled into a gene are annotated to that gene record.
    3. Using all known identifiers of a merged molecule, sources such as Organelle DB or miBLAST, are queried to annotate specific molecular fields.
    4. The resulting dataset is loaded into a relational database.

    Because this is an automated process, and no curation occurs, any errors or misnomers in the original data sources will also exist in MiMI. For example, if a source indicates that the organism is unknown, MiMI will as well.

    If you find that a molecule has been incorrectly merged under a gene record, please contact us immediately. Because MiMI is completely automatically generated, and there is no data curation, it is possible that we have merged molecules with gene records incorrectly. If made aware of the error, we can and will correct the situation. Please report any problems of this kind to mimi-help@umich.edu.

    Tracking provenance is going to be a serious requirement for mission critical, financial and medical usage topic maps.

    CEUR-WS

    Filed under: Computer Science,Conferences — Patrick Durusau @ 1:48 pm

    CEUR-WS

    From the website:

    CEUR-WS.org: fast and cost-free provision of online proceedings for scientific workshops

    As of 2011-01-19, 692 proceedings, with 132 of those from meetings in 2010.

    An excellent source of research, both recent and not so recent.

    Scrapy

    Filed under: Data Mining,Searching,Software — Patrick Durusau @ 1:34 pm

    Scrapy

    From the website:

    Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

    Another tool to assist with data gathering for topic map authoring.

    Curation is the New Search is the New Curation – Post

    Filed under: Indexing,Search Engines,Search Interface,Searching — Patrick Durusau @ 1:22 pm

    Curation is the New Search is the New Curation

    Paul Kedrosky sees a return to curation as the next phase in searching. In part because search algorithms can be gamed…, but read the post. He has an interesting take on the problem.

    The one comment I would add is that curation will mean not everything is curated.

    Should it be?

    What criteria would you use for excluding material to be curated from your index of (insert your favorite topic)?

    Proposition: It is an error to think everything that can be searched is worth indexing (or curation).

    Topic-based Index Partitions for Efficient and Effective Selective Search

    Filed under: Clustering,Search Interface,Searching — Patrick Durusau @ 11:10 am

    Topic-based Index Partitions for Efficient and Effective Selective Search Authors: Anagha Kulkarni and Jamie Callan

    Abstract:

    Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search quality. Three types of allocation policies (random, source-based and topic-based) are studied. K-means clustering is used to create topic-based shards. We manage the computational cost of applying these techniques to large datasets by defining topics on a subset of the collection. Experiments with three large collections demonstrate that selective search using topic-based shards reduces search costs by at least an order of magnitude without reducing search accuracy.

    What is unclear to me is whether mapping shards across independent and distinct collections that have topic-based shards would be as effective?

    That would depend on the similarity of the shards but that is measurable. Not to mention mappable by a topic map.

    It would be interesting if large collections started offering topic-based shard APIs to their contents.

    Such that a distributed query could search shards that have been mapped as being relevant to a particular query.

    Axioms of Context

    Filed under: Context,Semantics — Patrick Durusau @ 10:17 am

    Tefko Saracevic in his keynote address The notion of context in “Information Interaction in Context” at: the Third Information Interaction in Context Symposium (IIiX’10) offered the following five (5) axioms of context:

    • Axiom 1: One cannot not have a context in information interaction. Every interaction is conducted within a context. Because context-less information interaction is impossible, it is not possible not to have a context.
    • Axiom 2: Every interaction has a content and relationship aspect – context is the later and classifies the former. It means that all interactions, apart from information derived from meaning of words or terms describing the content, have more information to be derived from context.
    • Axiom 3: The nature of information interaction is asymmetric; it involves differing processes and interpretation by parties involved. Contexts are asymmetric as well. Systems context is primarily about meanings; user context is primarily about situations.
    • Axiom 4: Context is multilayered. It extends beyond users or systems. In interactions it is customary to consider direct context, but context extends indirectly to broader social context also.
    • Axiom 5: Context is not self-revealing, nor is it self-evident. Context may be difficult to formulate and synthesize. But plenty can go wrong when not taken into consideration in interactions.

    Unfortunately only an abstract of Saracevic’s keynote is reported in the proceedings.

    I think his fifth axiom, Context is not self-revealing, nor it it self-evident, is the one most relevant for topic maps.

    What subjects we mean to identify depend upon contexts we may only dimly sense. Mostly because they are so familiar.

    In a Bible encoding project several years ago, none of our messages made the context clear because we shared the context in which those messages took place.

    Anyone who stumbled upon those at the time or later, could have a hard time deciding what was being talked about and why?

    We have always had the capacity to say more about context, but topic maps enable us to build mappings based on those statements of contexts.

    The contexts that give our words meaning and identify the subjects of discussion.

    Text Analytics: Yesterday, Today and Tomorrow

    Filed under: Text Analytics — Patrick Durusau @ 9:23 am

    Text Analytics: Yesterday, Today and Tomorrow by Tony Russell-Rose and his colleagues Vladimir Zelevinsky and Michael Ferretti.

    Nothing particularly new but a highly entertaining account of text analytics and its increasing importance.

    Part 1 you could allow managers to view without assistance.

    Parts 2 and 3, well, you might better be there to provide some contextual information.

    « Newer PostsOlder Posts »

    Powered by WordPress