Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 14, 2011

Topic Maps – Human-oriented Semantics? – A Quibble

Filed under: Marketing,TMDM,Topic Maps — Patrick Durusau @ 4:09 pm

Topic Maps – Human-oriented Semantics?

As promised, I have a quibble about the presentation that Lars made this morning. 😉

When talking about topic maps as semantic technology, Lars suggested or at least I heard him suggest, that topic maps help the person inside the Chinese room in John Searle’s famous example.

Lars then proceeded to use an example of a topic map, where the content was written in Japanese.

To show that you could know something about the content or at least relationships between the content, whether you could read it or not.

All of which is true, but my quibble is that such an understanding is on the part of the audience to the presentation and not of the machine/person inside the Chinese room.

Even with a topic map as input, we still don’t know what, if anything, is understood by a person or machine inside the Chinese room.

All we ever know is that we got the correct response to our input.

The presentation elided the transition from the Chinese room to the audience for the presentation. Quite different, at least in my view.

I did not allow that to distract me from an otherwise excellent presentation but I thought I should mention it. 😉

NoSQL benchmarks and performance evaluations – Post

Filed under: NoSQL — Patrick Durusau @ 6:52 am

NoSQL benchmarks and performance evaluations

From Alex Popescu’s MyNoSQL blog, a gathering of NoSQL evaluations.

Used with caution this could be useful information.

Communicating Across the Academic Divide – Post

Filed under: Marketing,Semantic Diversity — Patrick Durusau @ 5:57 am

Communicating Across the Academic Divide

Myra H. Strober writes:

However, while doing research for my new book, Interdisciplinary Conversations: Challenging Habits of Thought, I found an even more fundamental barrier to interdisciplinary work: Talking across disciplines is as difficult as talking to someone from another culture. Differences in language are the least of the problems; translations may be tedious and not entirely accurate, but they are relatively easy to accomplish. What is much more difficult is coming to understand and accept the way colleagues from different disciplines think—their assumptions and their methods of discerning, evaluating, and reporting “truth”—their disciplinary cultures and habits of mind.

I rather like the line: Talking across disciplines is as difficult as talking to someone from another culture.

That is the problem in a nutshell isn’t it?

What most solution proposers fail to recognize is that solutions to the problem are cultural artifacts themselves.

There is no place to stand outside of culture.

So we are always trying to talk to people from other cultures. Constantly.

Even as we try to solve the problem of talking to people from other cultures.

Realizing that does not make talking across cultures any easier.

It may help us realize that the equivalent of talking louder, isn’t likely to assist in the talking across cultural divides.

One of the reasons why I like topic maps is that it is possible, although not easy, to capture subject identifications from different cultures.

How well a topic map does that depends on the skill of its author and those contributing information to the map.

January 13, 2011

Infosomics

Filed under: Examples,Marketing,Topic Maps — Patrick Durusau @ 7:21 pm

Reading The Newsonomics of 2011 news metrics to watch I was reminded that topic maps lack a notion of infosomics.

That is a metric, any metric, to measure the benefit that a user derives from the use of a topic map.

I have heard lots of anecdotal stories but no hard numbers.

Consider the listing of search engines you will find at: Choose the Best Search for Your Information Need.

A useful listing and no doubt similar advice exists for search appliances, but none of which results in any hard numbers.

For example, say I am responsible for tech support for a particular software package. There is a collection of prior tech support requests with answers, manuals and other materials. Not to mention tech support staff who have general support training and training on this product in particular.

What I want to know is what measurable metrics, reduced length of support calls, lack of repeated calls from the same customer (same issue), higher customer satisfaction, I can expect from using a topic map?

The same sort of metrics that I haven’t seen (overlooked?) for any of the search appliances.

The best case scenario would be to have a vendor with multiple help desk operations that were basically equivalent and to set up one office with a topic map solution and the other office uses its current solution. Use automated monitoring to derive the metrics.

I prefer that sort of metric to “…someday we will all be one giant linked graph/topic map/insert your solution” type claims.

The latter being hard to evaluate in any meaningful way.

Visualize This: Where the public gets its news – Post

Filed under: Examples,Topic Maps,Visualization — Patrick Durusau @ 5:29 pm

Visualize This: Where the public gets its news

Flowing Data posted a rather poor graph of a survey on where people get their national and international news.

The data was also posted so readers could create their own visualizations.

I rather like the idea of the challenge to improve the work posted.

I wonder what it would take for that to work for topic maps?

That is how could data be presented so that it would not be too burdensome to create different topic map representations of the data?

Thinking such an exercise could serve both as a demonstration of the diversity of interests in the topic maps community as well as an example of merging the resulting topic maps together.

To make the parts into a whole.

Suggestions?

Contra XTM?

Filed under: Topic Map Software,Topic Maps,XTM — Patrick Durusau @ 3:10 pm

I was reading a topic map paper that complained about difficulties processing XTM with XML tools.

In fact, the article says, you need a topic map engine to process XTM effectively.

Was that a surprise?

What if I ran across a SQL database dump with tables, which contain foreign keys, etc.

I would be bet that I need a SQL database engine to process it effectively.

Would that be a surprise?

XTM, is and was an interchange syntax for topic maps.

That means people can interchange XTM topic maps with the expectation of a defined set of semantics, for processing with, wait for it, a topic map engine.

I write this because I think XML is under-recognized as a declarative semantic format and too casually viewed as a basis for processing.

There are cases where XML can be used as a basis for processing, I don’t know, tweets for example. 😉

Seriously, a file being written in XML (think word processing formats), doesn’t automatically make XML tools the best processing choice.

XTM is one of those cases, but that wasn’t a surprise.

Introduction to MongoDB

Filed under: MongoDB,NoSQL — Patrick Durusau @ 1:12 pm

Introduction to MongoDB by Justin Jenkins.

Maybe I am just getting jaded or tired, perhaps both, but the hello world examples in introductions has worn thin.

It can be used to illustrate elementary operations, very elementary operations, but once you have seen one elementary operation, you have seen them all.

Just once I would like to see a code to crack passwords protecting the WhiteHouse switchboard or script for real time TCP/IP packet replacement type example. Maybe not exactly those but you get the drift.*

Something with some bite to it.

Perhaps in addition to the hello world examples.

The introduction by Jenkins is serviceable enough, but for real details, see: MongoDB, the MongoDB homesite.

*****
* The equivalent for topic maps would be an example of how to make leaked information dangerous rather than simply annoying.

For example, a topic map could merge currently secret (or public) information about an individual to assist in the evaluation of a leak. Or to decide on how to exploit it. Without every analyst having to dig up the same information.

Document Indexing – Wrong Level?

Filed under: Indexing,Search Engines — Patrick Durusau @ 8:16 am

I was reading the Jaccard distance treatment in Anand Rajaraman and Jeffrey D. Ullman and something that keeps nagging at me became clearer.

Is document indexing the wrong level for indexing?

Take a traditional research paper as an example.

You would give me low marks if I handed in a paper with the following as one of my footnotes:

# Principia Mathematica, Volume 1

But that is a perfectly acceptable result for a search engine. I am pointed to an entire document as relevant to my search.

True enough but hardly very helpful.

Search engines can take me to a document but that still leaves all the hard work to me.

Not that I mind the hard work but that hard work is done over and over again, as each user encounters the document.

Seems terribly inefficient to have the same work done each time the document is returned.

Say for example that I am searching for the proof that 1 + 1 = 2, I should be able to create a representative for that subject that points every searcher to the same location. As opposed to them digging out that bit of information for themselves.

I have heard that bit of information assigned various locations in Principia Mathematica. I am acquiring a reprint so I can verify its location for myself and will be posting its location.

Topic maps help because they are about subject indexing which I take to be different from document indexing.

A document index only tells you that somewhere in a document, one or more terms relevant to your search may be found. Not terribly helpful.

A subject index, on the other hand, particularly if made using a topic map, not only isolates the location of a subject but can also tell you additional information about the subject. Such as other information about the subject.

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing – Post

Filed under: Data Mining,Similarity — Patrick Durusau @ 5:42 am

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing

Bob Carpenter of Ling-Pipe Blog points out the treatment of Jaccard distance in Mining Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman.

Worth a close look.

Topic Maps – Human-oriented Semantics?
(streaming video – 14.01.2011 12:00 GMT)

Filed under: Marketing,Topic Maps — Patrick Durusau @ 5:41 am

Topic Maps – Human-oriented Semantics?

Lars Marius Garshol has a topic maps presentation tomorrow in Sogndal.

I won’t be in Norway tomorrow but there will be a streaming video of the presentation.

Despite the 7 AM start time on the East Coast of the US I plan on attending.

Lars has either authored or contributed to every aspect of the topic maps effort for the past decade.

I will find something to disagree with in his presentation, for old times sake if nothing else, but I am sure it will be an interesting presentation.

Please spread the word about this presentation.

January 12, 2011

Winners of Mozilla Open Data Competition announced – Post

Filed under: Visualization — Patrick Durusau @ 8:44 pm

Winners of Mozilla Open Data Competition announced

From the Revolutions: News about R, statistics and the world of open source from the staff of Revolution Analytics blog, a report on the Mozilla Open Data Competition, “How Do People Use Firefox.”

Thirty-two entries, some of which are reviewed here. All of them worth reviewing.

Chris Harrison’s Graphics – Post

Filed under: Graphical Models,Graphics,Visualization — Patrick Durusau @ 3:41 pm

Chris Harrison’s Graphics reported by Bob Carpenter at LingPipe Blog.

Visualizations are like slide presentations.

They can be painful but you do encounter those that simply work.

These are ones that just work.

It is possible to visualize a topic map as a graph, yawn, but when was the last time you saw a graph outside of math class?

True, all maps are graphs but I would be willing to bet most people would not name a map as an example of a graph.

Why?

Because a map, at least a well done one, assists its reader in accomplishing some task of interest to them. Using the map is a goal, not an end unto itself.

Hmmm, maps with nodes and edges connecting those nodes,…, I know, how about Disney World Maps!

Those are maps of physical locations.

Questions:

  1. What are some of the characteristics of any one or more of the Disney maps? (3-5 pages, no citations)
  2. Find five examples of maps that are not maps of physical locations.
  3. What is different/same about the maps in #1 versus #2? (3-5 pages, no citations)

*****
PS: Depending on the status of diplomatic cables (hopefully from a number of countries), consider that a graph between the cables could be interesting.

More interesting would be photos of the folks mentioned, arranged by events or contacts they share in the US. Has characteristics of a graph but perhaps more immediately compelling.

Say showing photos of all the School of the Americas graduates clustered together, like in a high school yearbook or police mug photo book.

Or showing those same photos with US officials.

To facilitate human recognition of additional subjects to pursue.

ACM Digital Library for Computing Professionals

Filed under: Computer Science,Digital Library,Library — Patrick Durusau @ 2:59 pm

ACM Digital Library for Computing Professionals

The ACM has released a new version of it digital library, and, is offering a free three-month trial of it.

From the announcement:

  • Reorganized author profile pages that present a snapshot of author contributions and metrics of author influence by monitoring publication and citation counts and download usage volume
  • Broadened citation pages for individual articles with tabs for metadata and links to facilitate exploration and discovery of the depth of content in the DL
  • Enhanced interactivity tools such as RSS feeds, bibliographic exports, and social network channels to retrieve data, promote user engagement, and introduce user content
  • Redesigned binders for creating personal, annotatable collections of bibliographies or reading lists, and sharing them with ACM and non-ACM members, or exporting them into standard authoring tools like self-generated virtual PDF publications
  • Expanded table-of-contents opt-in service for all publications in the DL—from ACM and other publishers—that alerts users via email and RSS feeds to new issues of journals, magazines, newsletters, and proceedings.

I mention it here for a couple of reasons:

1) For resources on computing, whether contemporary or older materials, I can’t think of a better starting place for research. I am here more often than not.

2) It sets a benchmark for what is available in terms of digital libraries. If you are going to use topic maps to build a digital library, what would you do better?

Information Theory, Inference, and Learning Algorithms

Filed under: Inference,Information Theory,Machine Learning — Patrick Durusau @ 11:46 am

Information Theory, Inference, and Learning Algorithms Author: David J.C. MacKay, full text of the 2005 printing available for downloading. Software is also available.

From a review that I read (http://dx.doi.org/10.1145/1189056.1189063), MacKay treats machine learning as the other side of the coin from information theory.

Take the time to visit MacKay’s homepage.

There you will find his book Sustainable Energy – Without the Hot Air. Highly entertaining.

International Workshop on Semantic Technologies for Information-Integrated Collaboration (STIIC 2011)

Filed under: Conferences,Semantic Diversity,Semantics — Patrick Durusau @ 10:03 am

International Workshop on Semantic Technologies for Information-Integrated Collaboration (STIIC 2011) as part of the 2011 International Conference on Collaboration Technologies and Systems (CTS 2011), May 23 – 27, 2011, The Sheraton University City Hotel, Philadelphia, Pennsylvania, USA.

From the announcement:

Information-integrated collaboration networks have become an important part of today’s complex enterprise systems – this becomes obvious if we consider, as a prominent example, the high dynamics of network-centric systems, which need to react to changes at the level of their information and communication space by providing flexible mechanisms to manage a wide variety of information resources, heterogeneous, decentralized, and constantly evolving. Semantic technologies promise to deliver innovative and effective solutions to this problem, facilitating the realization of information integration mechanisms that allow collaboration systems to provide the added value they are expected to.

Two fundamental problems are inherent to the design of integrated collaboration solutions: (i) semantic inaccessibility, caused by the failure to explicitly specify the semantic content of the information contained within the subsystems that must share information in order to collaborate effectively; and (ii) logical disconnectedness: caused by the failure to explicitly represent constraints between the information managed by the different collaborating subsystems.

Mainstream EAI technologies deal with information and information management tasks at the syntactic level. Data protocols and standards that are used to facilitate seamless information exchange and ‘plug and play’ interoperability do not take into account the meaning of the underlying information and the view of the individual stakeholders on the information exchanged. What is lacking are mechanisms that have the ability to capture, store, and manage the meaning of the data and artifacts that need to be shared for collaborative problem solving, decision support, planning, and execution.

Important Dates:

Paper submissions: January 24, 2011

Acceptance notification: February 11, 2011

Camera ready papers and registration due: March 1, 2011

Conference dates: May 23 – 27, 2011

I rather like the line:

What is lacking are mechanisms that have the ability to capture, store, and manage the meaning of the data and artifacts that need to be shared for collaborative problem solving, decision support, planning, and execution.

Sorta says it all, doesn’t it?

TM++ Topic Maps Engine – News

Filed under: Topic Map Software — Patrick Durusau @ 8:23 am

TM++ Topic Maps Engine has moved to using using the Microsoft(R) Visual Studio(R) 2010 IDE for development.

January 11, 2011

1st International Workshop on Semantic
Publication (SePublica 2011)

Filed under: Conferences,Ontology,OWL,RDF,Semantic Web,SPARQL — Patrick Durusau @ 7:24 pm

1st International Workshop on Semantic Publication (SePublica 2011) in connection with 8th Extended Semantic Web Conference (ESWC 2011), May 29th or 30th, Hersonissos, Crete, Greece.

From the Call for Papers:

The CHALLENGE of the Semantic Web is to allow the Web to move from a dissemination platform to an interactive platform for networked information. The Semantic Web promises to “fundamentally change our experience of the Web”.

In spite of improvements in the distribution, accessibility and retrieval of information, little has changed in the publishing industry so far. The Web has succeeded as a dissemination platform for scientific and non-scientific papers, news, and communication in general; however, most of that information remains locked up in discrete documents, which are poorly interconnected to one another and to the Web.

The connectivity tissues provided by RDF technology and the Social Web have barely made an impact on scientific communication nor on ebook publishing, neither on the format of publications, nor on repositories and digital libraries. The worst problem is in accessing and reusing the computable data which the literature represents and describes.

No, I am not going to say that topic maps are the magic bullet that will solve all those issues or the ones listed in their Questions and Topics of Interest.

What I do think topic maps bring to the table is an awareness that semantic interoperability isn’t primarily a format or computational problem.

Every new (and impliedly universal) format or model simply compounds the semantic interoperability problem.

By creating yet more formats and/or models between which semantic interoperability has to be designed.

Starting with the question of what subjects need to be identified and how they are identified now could lead to a viable, local semantic interoperability solution.

What more could a client want?

Local semantic interoperability solutions can form the basis for spreading semantic interoperability, one solution at a time.

*****
PS: Forgot the important dates:

Paper/Demo Submission Deadline: February 28, 23:59 Hawaii Time

Acceptance Notification: April 1

Camera Ready Version: April 15

SePublica Workshop: May 29 or May 30 (to be announced)

12th IEEE International Conference on Information Reuse and Integration (IEEE IRI-2011)

Filed under: Conferences,Information Integration,Information Reuse — Patrick Durusau @ 7:02 pm

12th IEEE International Conference on Information Reuse and Integration (IEEE IRI-2011)

From the announcement:

Given the emerging global Information-centric IT landscape that has tremendous social and economic implications, effectively processing and integrating humongous volumes of information from diverse sources to enable effective decision making and knowledge generation have become one of the most significant challenges of current times. Information Reuse and Integration (IRI) seeks to maximize the reuse of information by creating simple, rich, and reusable knowledge representations and consequently explores strategies for integrating this knowledge into systems and applications. IRI plays a pivotal role in the capture, representation, maintenance, integration, validation, and extrapolation of information; and applies both information and knowledge for enhancing decision-making in various application domains.

This conference explores three major tracks: information reuse, information integration, and reusable systems. Information explores theory and practice of optimizing representation; information integration focuses on innovative strategies and algorithms for applying integration approaches in novel domains; and reusable systems focus on developing and deploying models and corresponding processes that enable Information Reuse and Integration to play a pivotal role in enhancing decision-making processes in various application domains.

All three tracks depend on subject identity, whether explicitly recognized or not. Would be nice to have topic map representatives at the conference.

Important Dates:

Paper submission deadline February 15, 2011

Notification of acceptance April 15, 2011

Camera-ready paper due May 1, 2011

Presenting author registration due May 1, 2011

Advance (discount) registration for general public and other co-author June 30, 2011

Hotel reservation (special discount rate) closing date July 15, 2011

Conference events August 3-5, 2011

Just picking at random from prior proceedings, I noticed:

Inconsistency: the good, the bad, and the ugly by Du Zhang from the 9th annual meeting.

Definitely a topic map sort of conference.

Dynamic Semantic Publishing for any Blog (Part 1 + 2) – Post(s)

Filed under: Entity Extraction,Semantic Web,Semantics — Patrick Durusau @ 5:10 pm

Dynamic Semantic Publishing for any Blog (Part 1)

Benjamin Nowack outlines how he would replicate the dynamic semantic publishing approach used by the BBC in their coverage of the 2010 World Cup.

Dynamic Semantic Publishing for any Blog (Part 2) will disappoint anyone interested in developing dynamic semantic publishing solutions.

Block level overview that repeats what anyone interested in semantic technologies already knows.

Extended infomercial.

Save your time and look elsewhere for substantive content on semantic publishing.

Linked Data Extraction with Zemata and OpenCalais

Filed under: Entity Extraction,Linked Data — Patrick Durusau @ 1:53 pm

Linked Data Extraction with Zemata and OpenCalais

Benjamin Nowack’s review at BNODE of Named Entity Extraction APIs by Zemanta and OpenCalais.

You can brew your own entity extraction routines and likely will for specialized domains. For more general work, or just to become familiar with entity extraction and its limitations, the APIs Benjamin reviews are a good starting place.

Every Subject A Topic?

Filed under: Authoring Topic Maps,Graphs,Networks — Patrick Durusau @ 10:16 am

The obvious answer to the question: Every Subject A Topic?, is no but I wanted to write up a specific use case I saw discussed today.

I was watching Understanding Graph Databases with Darren Wood, part of the NoSQL Tapes earlier today.

Wood mentioned that in intelligence work a node that has a lot of connections to other nodes, really isn’t that interesting.

For example, modeling telephone calls, that everyone calls the local pizza place isn’t all that interesting.

On the other hand, a node with few connections, especially a connection that bridges subgraphs, could be very interesting.

I thought about that in terms of modeling say campaign finances with a topic map.

I could have a topic that represents Democrats, one that represents Republicans and one for each of the other parties.

Plus create an association with each of those topics for each donation.

But noisy when you think about it from the perspective of the resulting graph.

Some options come to mind:

  1. Preserve the information but as part of each donation represented as a topic.
  2. Create a topic that is just the number of donations and the sum donated.
  3. A variant on #2 except by zip code, to enable a map coloring of donations by zip code.

Will have to think about different ways to create a topic map on the same data.

To establish a baseline for comparing modeling choices.

Finishing up ODF edits this month but perhaps something in the February time frame.

Banned in China!

Filed under: Data Analysis — Patrick Durusau @ 7:21 am

Data Analysis Using Regression and Multilevel/Hierarchical Models (ISBN-13: 9780521686891) has been banned in China due to politically sensitive materials in the text.

What politically sensitive materials in the text set off the censors is unknown at this point but apparently queries are pending.

There is only one response to censorship or attempts at censorship:

Order a copy of the work in question and urge others to do so as well. (Or assist in the dissemination of the materials.)

I say that without qualification or limitation.

Censorship, whether political (insulting the government), national security (diplomatic cables for example), religious (cartoons), or otherwise, is the refuge of the insecure.

If something bothers you, don’t look.

*****
PS: I just ordered my copy, how about you?

International Conference on Theory and Practice of Digital Libraries

Filed under: Conferences,Library,Library software,Topic Maps — Patrick Durusau @ 6:40 am

International Conference on Theory and Practice of Digital Libraries – Call for papers in four general areas:

Foundations: Technology and Methodologies

  • Digital libraries: architectures and infrastructures
  • Metadata standards and protocols in digital library systems
  • Interoperability in digital libraries, data and information integration
  • Distributed and collaborative information spaces
  • Systems, algorithms, and models for digital preservation
  • Personalization in digital libraries
  • Information access: retrieval and browsing
  • Information organization
  • Information visualization
  • Multimedia information management and retrieval
  • Multilinguality in digital libraries
  • Knowledge organization and ontologies in digital libraries

Digital Humanities

  • Digital libraries in cultural heritage
  • Computational linguistics: text mining and retrieval
  • Organizational aspects of digital preservation
  • Information policy and legal aspects (e.g., copyright laws)
  • Social networks and networked information
  • Human factors in networked information
  • Scholarly primitives

Research Data

  • Architectures for large-scale data management (e.g., Grids, Clouds)
  • Cyberinfrastructures: architectures, operation and evolution
  • Collaborative information environments
  • Data mining and extraction of structure from networked information
  • Scientific data curation
  • Metadata for scientific data, data provenance
  • Services and workflows for scientific data
  • Data and knowledge management in virtual organizations

Applications and User Experience

  • Multi-national digital library federations (e.g., Europeana)
  • Digital Libraries in eGovernment, elearning, eHealth, eScience, ePublishing
  • Semantic Web and Linked Data
  • User studies for and evaluation of digital library systems and applications
  • Personal information management and personal digital libraries
  • Enterprise-scale knowledge and information management
  • User behavior and modeling
  • User mobility and context awareness in information access
  • User interfaces for digital libraries

Topic maps have a contribution to make in these areas. Don’t be shy!

Important dates

Abstract submission (full and short papers): March 21, 2011

Research paper submission: March 28, 2011 (midnight HAST, GMT -10hrs)

Notification of acceptance: May 23, 2011

Submission of final version: June 6, 2011

******
PS: Note the call for demos on all the same areas. Demo submission – Due March 28, 2011; Notification of acceptance – May 23, 2011; Submission of final version – June 6, 2011

January 10, 2011

Walking Towards A Topic Map

Filed under: Graphs,Machine Learning — Patrick Durusau @ 6:55 pm

Improving graph-walk-based similarity with reranking: Case studies for personal information management Authors: Einat Minkov, William W. Cohen Keywords: graph walk, learning, semistructured data, PIM

Abstract:

Relational or semistructured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous, describing different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates, and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading, and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk-based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe high-level information such as the paths traversed in the walk. High performance, together with practical runtimes, suggest that the described framework is a useful search system in the PIM domain, as well as in other semistructured domains. (emphasis in original)

OK, so I lied. The title of the post isn’t the title of the article. Sue me. 😉

Although, on the other hand you will find that for the authors, relatedness and similarity are used interchangeably (footnote 4), which I found to be rather odd.

My point being that creation of a topic map can be viewed as a process of refinement.

Based on some measure of similarity, you can decide that enough information has been identified or gathered together about a subject and simply stop.

There may well be additional information that could be refined out of a graph about a subject but there is no rule that compels you do to so.

Engineering basic algorithms of an in-memory text search engine

Filed under: Data Structures,Indexing,Search Engines — Patrick Durusau @ 4:37 pm

Engineering basic algorithms of an in-memory text search engine Authors: Frederik Transier, Peter Sanders Keywords: Inverted index, in-memory search engine, randomization

Abstract:

Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.

Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.

A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.

An interesting comparison of inverted indexes with suffix-arrays.

I am troubled by the reconstruct the input aspects of the paper.

While it is understandable and in some cases, more efficient, for data to be held in a localized data structure, my question is what do we do when data exceeds local storage capacity?

Think about the data held by Lexis/Nexis for example. Where would we put it while creating a custom data structure for its access?

There are data sets, important data sets, that have to be accessed in place.

And those data sets need to be addressed using topic maps.

*****
You may recall from the TAO paper by Steve Pepper the illustration of topics, associations and occurrences floating above a data set.

While topic map formats have been useful in many ways, they have distracted from the vision of topic maps as an information overlay as opposed to yet-another-format.

Formats are just that, formats. Pick one.

Efficient set intersection for inverted indexing

Filed under: Data Structures,Information Retrieval,Sets — Patrick Durusau @ 4:08 pm

Efficient set intersection for inverted indexing Authors: J. Shane Culpepper, Alistair Moffat Keywords: Compact data structures, information retrieval, set intersection, set representation, bitvector, byte-code

Abstract:

Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.

The treatment of set intersection caught my attention.

Unlike document sets, topic maps have restricted sets of properties or property values that will form the basis for set intersection (merging in topic maps lingo).

Topic maps also differ in that identity bearing properties are never ignored, whereas in searching a reverse index, terms can be included in the index that are ignored in a particular query.

What impact those characteristics will have on set intersection for topic maps remains a research question.

Large Scale Data Mining Using Genetics-Based Machine Learning

Filed under: Data Mining,Machine Learning — Patrick Durusau @ 3:16 pm

Large Scale Data Mining Using Genetics-Based Machine Learning Authors: Jaume Bacardit, Xaiver Llorà

Tutorial on data mining with genetics-based machine learning algorithms.

Usual examples of exploding information from genetics to high energy physics.

While those are good examples, it really isn’t necessary to go there in order to get large scale data sets.

Imagine constructing a network for all the entities and their relationships in a single issue of the New York Times.

That data isn’t as easily available or to process as genetic databases or results from the Large Hadron Collider.

But that is a question of ease of access and processing, not being large scale data.

The finance pages alone have listings for all the major financial institutions in the country. What about mapping their relationships to each other?

Or for that matter, mapping the phone calls, emails and other communications between the stock trading houses? Broken down by subjects discussed.

Important problems often as not have data that is difficult to acquire. Doesn’t make them any less important problems.

NoSQL Tapes

Filed under: Cassandra,CouchDB,Graphs,MongoDB,Neo4j,Networks,NoSQL,OrientDB,Social Networks — Patrick Durusau @ 1:33 pm

NoSQL Tapes: A filmed compilation of interviews, explanations & case studies

From the email announcement by Tim Anglade:

Late last year, as the NOSQL Summer drew to a close, I got the itch to start another NOSQL community project. So, with the help of vendors Scality and InfiniteGraph, I toured around the world for 77 days to meet and record video interviews with 40+ NOSQL vendors, users and dudes-you-can-trust.

….

My original goals were to attempt to map a comprehensive view of the NOSQL world, its origins, its current trends and potential future. NOSQL knowledge seemed to me to be heavily fragmented and hard to reconcile across projects, vendors & opinions. I wanted to try to foster more sharing in our community and figure out what people thought ‘NOSQL’ meant. As it happens, I ended up learning quite a lot in the process (as I’m sure even seasoned NOSQLers on this list will too).

I’d like to take this opportunity to thank everybody who agreed to participate in this series: 10gen, Basho, Cloudant, CouchOne, FourSquare, Ben Black, RethinkDB, MarkLogic, Cloudera, SimpleGeo, LinkedIn, Membase, Ryan Rawson, Cliff Moon, Gemini Mobile, Furuhashi-san, Luca Garulli, Sergio Bossa, Mathias Meyer, Wooga, Neo4J, Acunu (and a few other special guests I’m keeping under wraps for now); I couldn’t have done it without them and learned by leaps & bounds for every hour I spent with each of them.

I’d also like to thank my two sponsors, Scality & InfiniteGraph, from the bottom of my heart. They were supportive in a way I didn’t think companies could be and let me total control of the shape & content of the project. I’d encourage you to check them out if you haven’t done so already.

As always, I’ll be glad to take any comments or suggestions you may have either by email (tim@nosqltapes.com) or on Twitter (@timanglade).

Simply awesome!

Memoirs of a Graph Addict:
Despair to Redemption

Filed under: Data Structures,Graphs,Neo4j — Patrick Durusau @ 11:24 am

Memoirs of a Graph Addict: Despair to Redemption Author: Marko A. Rodriguez

Alex Popescu says this slide deck covers:

  • graph structures
  • graph databases
  • graph applications
  • TinkerPop product suite

Which is true but omits that Marko also covers:

  • 2014 – 75% decrease in world population – rise of dynamically distributed democracy
  • 2018 – Eudaemonic Engine: Seeking Virtue Through Circuitry
  • 2023 – Universal Computer: A Single Computational Substrate
  • 2030 – “Man learns to encode themselves into the URI namespace…”

The graph parts are useful, your mileage may vary on the rest.

« Newer PostsOlder Posts »

Powered by WordPress