Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 30, 2010

Apache Mahout – Website

Filed under: Classification,Clustering,Data Mining,Mahout,Pattern Recognition,Software — Patrick Durusau @ 8:54 pm

Apache Mahout

From the website:

Apache Mahout’s goal is to build scalable machine learning libraries. With scalable we mean:

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Current capabilities include:

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)

A topic maps class will only have enough time to show some examples of using Mahout. Perhaps an informal group?

RDF Extension for Google Refine

Filed under: Data Mining,RDF,Software — Patrick Durusau @ 1:09 pm

RDF Extension for Google Refine

From the website:

This project adds a graphical user interface(GUI) for exporting data of Google Refine projects in RDF format. The export is based on mapping the data to a template graph using the GUI.

See my earlier post on Google Refine 2.0.

BTW, if you don’t know the folks at DERI – Digital Enterprise Research Institute take a few minutes (it will stretch into hours) to explore their many projects. (I will be doing a separate post on projects of particular interest for topic maps from DERI soon.)

Peer-to-Peer Networks?

Filed under: Marketing,Topic Maps — Patrick Durusau @ 11:49 am

Take some paper and write down your definition of a “peer-to-peer” network. No more than a paragraph and certainly not more than a page.

Then answer the following questions:

  1. Is sharing data essential to your definition?
  2. Are libraries peers if they have common users?
  3. Are books/journals peers if they have common readers?
  4. How should we deal with semantic inconsistency between peers?

Suggestion: Don’t confuse how something is done with it being done. Technique is important in terms of performance and other trade-offs but the question here is one of underlying principles. Once those are uncovered, then we can discuss how best to put them into practice.

For example, would you consider ants and bees to have social networks? Perhaps even peer-to-peer networks? Leaving aside swarm theory for the moment, just think about how you think information is conveyed in a colony/hive.

Hans Rosling: Let my dataset change your mindset

Filed under: Marketing,Topic Maps — Patrick Durusau @ 11:23 am

Hans Rosling: Let my dataset change your mindset

Not strictly relevant to topic maps per se, but it does illustrate the sort of demonstration of power that would make an excellent marketing tool for topic maps.

As a “text” person, I then to think of things I would want with a text but that isn’t likely to be exciting for the average customer for topic maps.

What about “interchangeable” mashups? People do mashups all the time (or so I am told) but re-use is about 0 unless you are wiling to simply accept the other mashup.

Even if you watch to see what is being mashed together, that doesn’t mean the same rule will be applied tomorrow. Or that your reliance on it will stop should the rule change.

Questions:

  1. What is your favorite mashup? What is the most popular mashup?
  2. What facts would you want to add to either mashup to make it re-usable? (3-5 pages, citations)
  3. How would you demonstrate graphically the impact a topic map had on your favorite mashup? (Presentation, mock-up acceptable, real time much better)

PS: If anyone is working on this, I would be more than happy to volunteer some time to assist.

PPS: More of a long term idea than an immediate project but the “Plum Book” lists all the positions that are appointed by Presidents. It is published every four (4) years and is available as HTML starting with 1996.

Thinking that presents very amusing possibilities combined with campaign finance disclosures. Could answer important questions like: Have the prices of positions, adjusted for inflation, fallen in 2014?

November 29, 2010

Cloud9: a MapReduce library for Hadoop

Filed under: Hadoop,MapReduce,Software — Patrick Durusau @ 1:40 pm

Cloud9: a MapReduce library for Hadoop

From the website:

Cloud9 is a MapReduce library for Hadoop designed to serve as both a teaching tool and to support research in data-intensive text processing. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. Hadoop provides an open-source implementation of the programming model. The library itself is available on github and distributed under the Apache License.

See Data-Intensive Text Processing with MapReduce by Lin and Dyer for more details on MapReduce.

Guide to using the Cloud9 library, including its use on particular data sets, such as Wikipedia.

Data-Intensive Information Processing Apps

Filed under: MapReduce,Software — Patrick Durusau @ 10:02 am

Data-Intensive Information Processing Applications

University of Maryland course by Jimmy Lin, (jimmylin@umd.edu) and Nitin Madnani, (nmadnani@umiacs.umd.edu).

From the course site:

This course is about scalable approaches to processing large amounts of information (terabytes and even petabytes). We focus mostly on MapReduce, which is presently the most accessible and practical means of computing at this scale, but will discuss other approaches as well.

Includes slides, data sets, etc.

Google App Engine

Filed under: Google App Engine,Software — Patrick Durusau @ 9:44 am

Google App Engine 1.4.0 pre-release is out!

I ran across this today.

From the webpage:

Every Google App Engine application gets enough free resources to serve approximately 5 million monthly page views.

I should be able to fit into that limitation. 😉

Is anyone working on a topic map application using the Google App Engine?

TREC Entity Track: Plans for Entity 2011

Filed under: Conferences,Entity Extraction,TREC — Patrick Durusau @ 8:50 am

TREC Entity Track: Plans for Entity 2011

Plans for Entity 2011.

Known datasets of interest: ClueWeb09, DBPedia Ontology, Billion Triple Dataset

It’s not too early to get started for next year!

TREC Entity Track: Report from TREC 2010

Filed under: Conferences,Entity Extraction,TREC — Patrick Durusau @ 8:17 am

TREC Entity Track: Report from TREC 2010

A summary of the results for the TREC Entity Track (related entity finding (REF) search task on the WWW) for 2010.

Lucene / Solr for Academia: PhD Thesis Ideas
(Disambiguated Indexes/Indexing)

Filed under: Lucene,Software,Solr — Patrick Durusau @ 5:57 am

Lucene / Solr for Academia: PhD Thesis Ideas

Excellent opportunity to make suggestions that could result not only in more academic work but also in advancement of useful open source software.

My big idea (don’t mind if you borrow/steal it for implementation):

We all know how traditional indexes work. They gather up single tokens and then point back to the locations in documents where they are found.

So they can’t distinguish among differing uses of the same string. One aspect of the original indexing problem that lead to topic maps.

What if indexers could be given configuration files that said: When indexing http://www.medicalsite.org create a tuple for indexing that includes site=www.medicalsite.org type=medical, etc. and index the entire tuple as a single entry.

And enable indexers to index by members of the tuples so that if I decide that all uses of a term of type=medical mean the same subject, I can produce an index that represents that choice.

Sounds a lot like merging doesn’t it?

I don’t know of any index that does what I just described but I don’t know all indexes so if I have overlooked something, please sing out.

If successful, would be an entirely different way of authoring topic maps against large information stores.

Not to mention creating the opportunity to monetize indexes as separate from the information resources themselves. The Readers Guide to Periodical Literature is a successful example of that approach as product.

Hmmm, needs a name, how about: Disambiguated Indexes/Indexing?

Suggestions?

November 28, 2010

Names, Identifiers, LOD, and the Semantic Web

Filed under: LOD,Names,RDF,Semantic Web,Subject Identifiers — Patrick Durusau @ 5:28 pm

I have been watching the identifier debate in the LOD community with its revisionists, personal accounts and other takes on what the problem is, if there is a problem and how to solve the problem if there is one.

I have a slightly different question: What happens when we have a name/identifier?

Short of being present when someone points to or touches an object, themselves, you (if the TSA) and says a name or identifier, what happens?

Try this experiment. Take a sheet of paper and write: George W. Bush.

Now write 10 facts about George W. Bush.

Please circle which ones that you think must match to identify George W. Bush.

So, even though you knew the name George W. Bush, isn’t it fair to say that the circled facts are what you would use to identify George W. Bush?

Here’s is the fun part: Get a colleague or co-worker to do the same experiment. (Substitute Lady Gaga if your friends don’t know enough facts about George W. Bush.)

Now compare several sets of answers for the same person.

Working from the same name, you most likely listed different facts and different ones you would use to identify that subject.

Even though most of you would agree that some or all of the facts listed go with that person.

It sounds like even though we use identifiers/names, those just clue us in on facts, some of which we use to make the identification.

That’s the problem isn’t it?

A name or identifier can make us think of different facts (possibly identifying different subjects) and even if the same subject, we may use different facts to identify the subject.

Assuming we are at a set of facts (RDF graph, whatever) we need to know: What facts identify the subject?

And a subject may have different identifying properties, depending on the context of identification.

Questions:

  1. How to specify essential facts for identification as opposed to the extra ones?
  2. How to answer #1 for an RDF graph?
  3. How do you make others aware of your answer in #2?

Comments/suggestions?

Common Logic, ISO/IEC 24707

Filed under: Logic,Ontology — Patrick Durusau @ 11:12 am

Common Logic, ISO/IEC 24707

Every time I have a question about ISO/IEC 24707 I have to either find it on my hard drive or search for the publicly available text.

Don’t know that it will help to write it down here, but it might. 😉

Questions (of course):

  1. Annotated bibliography of citation/use of ISO/IEC 24707. (within last year)
  2. Not a logic course but do you have a preference among the syntaxes? (discussion)
  3. Are ontologies necessarily logical?

Ontologies, Semantic Data Integration, Mono-ontological (or not?)

Filed under: Marketing,Medical Informatics,Ontology,Semantic Web,Topic Maps — Patrick Durusau @ 10:21 am

Ontologies and Semantic Data Integration

Somewhat dated, 2005, but still interesting.

I was particularly taken with:

First, semantics are used to ensure that two concepts, which might appear in different databases in different forms with different names, can be described as truly equivalent (i.e. they describe the same object). This can be obscured in large databases when two records that might have the same name actually describe two different concepts in two different contexts (e.g. ‘COLD’ could mean ‘lack of heat’, ‘chronic obstructive lung disorder’ or the common cold). More frequently in biology, a concept has many different names during the course of its existence, of which some might be synonymous (e.g. ‘hypertension’ and ‘high blood pressure’) and others might be only closely related (e.g. ‘Viagra’, ‘UK92480’ and ‘sildenafil citrate’).

In my view you could substitute “topic map” everywhere he says ontology, well, except one.

With a topic map, you and I can have the same binding points for information about particular subjects and yet not share the same ontological structure.

Let me repeat that: With a topic map we can share (and update) information about subjects, even though we don’t share a common ontology.

You may have a topic map that reflects a political history of the United States over the last 20 years and in part it exhibits an ontology that reflects elected offices and their office holders.

For the same topic map, to which I contribute information concerning those office holders, I might have a very different ontology, involving offices in Hague.

The important fact is that we could both contribute information about the same subjects and benefit from the information entered by others.

To put it another way is the different being mono-ontological or not?

Questions:

  1. Is “mono-ontological” another way of saying “ontologically/logically” consistent? (3-5 pages, citations if you like)
  2. What are the advantages of mono-ontological systems? (3-5 pages, citations)
  3. What are the disadvantages of mono-ontological systems? (3-5 pages, citations)

Computable Category Theory

Filed under: Category Theory — Patrick Durusau @ 8:01 am

Computable Category Theory

From the Preface (in the book):

This book should be helpful to computer scientists wishing to understand the computational significance of theorems in category theory and the constructions carried out in their proofs. Specialists in programming languages should be interested in the use of a functional programming language in this novel domain of application, particularly in the way in which the structure of programs is inherited from that of the mathematics. It should also be of interest to mathematicians familiar with category theory – they may not be aware of the computational significance of the constructions arising in categorical proofs.

In general, we are engaged in a bridge-building exercise between category theory and computer programming. Our efforts are a first attempt at connecting the abstract mathematics with concrete programs, whereas others have applied categorical ideas to the theory of computation.

Website has a pdf of a book by the same title and source code for programs.

This may not be useful for the “meatball” semantics of everyday practice, developing tools for use in the design information systems is another matter.

Questions:

  1. Suggest updated works for each section in “Accompanying Texts.”
  2. Annotated bibliography of works listed in #1.
  3. Instances of use of category theory in library science? Should there be? (3-5 pages, citations)

Computational Information Design

Filed under: Interface Research/Design,Visualization — Patrick Durusau @ 7:24 am

Computational Information Design by Ben Fry develops what is required for information visualization.

Particularly relevant given the developer-driven designs/presentations I have seen.

I am sure the developers in question found them very intuitive.

Or at least I hope they did.

It be really sad to have an interface no one found intuitive.

Questions:

  1. Annotated bibliography of citations of Fry’s dissertation. Focus on what is interesting/useful about the citing material.
  2. For collection development, would it be helpful to have a graphic overview by publication date for the collection? What other graphics would you suggest? (3-5 pages, no citations)
  3. Annotated bibliography of ten (10) works on visualization of information.

November 27, 2010

Pattern Recognition

Filed under: Authoring Topic Maps,Pattern Recognition — Patrick Durusau @ 10:00 pm

Pattern Recognition by Robi Polikar.

Survey of pattern recognition.

Any method that augments your “recognition” of subjects in texts relies on some form of “pattern recognition.”

The suggested reading at the end of the article is very helpful.

Questions:

  1. Reports of use of any of the pattern recognition techniques in library research? (2-3 pages, citations)
  2. Pick one of the reported techniques. What type of topic map would it be used with? Why? (3-5 pages, citations)
  3. Demonstrate the use of one of the reported techniques on a data set. (project/class presentation)

Introduction to Category Theory in Scala

Filed under: Category Theory,Scala — Patrick Durusau @ 9:58 pm

Introduction to Category Theory in Scala.

Jack Park and I have been bouncing posts about category theory resources off of each other for years.

This one looks like a keeper.

It may be the sort of series that acts as a bridge between an abstraction (category theory) and the real world of programming.

They are related you know.

Successful Data Integration Projects Require A Diverse Approach

Filed under: Data Integration,Marketing,Topic Maps — Patrick Durusau @ 9:57 pm

Successful Data Integration Projects Require A Diverse Approach (may require registration)

Apologies but even though you may have to register (free), I thought this story was worth mentioning.

If just for the observation that ETL (extract, transform, load) is “…a lot like throwing a bomb when all that’s needed is a bullet.” I have a less generous explanation but perhaps another time.

My point here is that data integration is a hot topic and topic maps can be part of the solution set.

No, I am not going to do one of those “…window of opportunity is closing…” routines because:

1) The MDM (master data management) folks haven’t cracked this nut since the 1980’s.

2) The Semantic Web effort, with a decade of hard work, has managed to re-invent the vocabulary problem in URIs. (I still think we should send the W3C a fruit basket.)

3) Every solution is itself an opportunity for subject identity integration with other solutions. (It is a self-perpetuating business opportunity. Next to having an addictive product, the best kind.)

Making topic maps relevant to data integration is going to require that we move away from the file format = topic maps approach.

Customers should understand that topic maps put them in change of managing their data, with their identifications. (With the potential to benefit from other identifications of the same subjects.)

That is the real diversity in data integration.

November 26, 2010

The Data Science Venn Diagram – Post

Filed under: Data Mining,Education,Humor — Patrick Durusau @ 1:41 pm

The Data Science Venn Diagram by Drew Conway is a must see!

Not only is it amusing but a good way to judge the skill set needed for data science.

Balisage Contest:

Print this in color for Balisage next year.

Put diagrams on both sides of bulletin board.

Contestant and colleague have to mark the location of the contestant at the same time.

Results displayed to audience. 😉

Person who comes closest to matching the colleague’s evaluation wins a prize (to be determined).

Principal Components Analysis

Filed under: Image Recognition,Principal Component Analysis (PCA) — Patrick Durusau @ 11:54 am

A Tutorial on Principal Components Analysis by Lindsay I. Smith.

From Chapter 3:

Finally we come to Principal Components Analysis (PCA). What is it? It is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where the luxury of graphical representation is not available, PCA is a powerful tool for analysing data.

The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, ie. by reducing the number of dimensions, without much loss of information. This technique used in image compression, as we will see in a later section.

One of the main application areas for PCA is image analysis, recognition.

Lindsay starts off with a review of the mathematics needed to work through the rest of the material.

Topic maps are a natural fit for pairing up the results of image recognition, for example, and other data. More on that anon.

Publication, Publication

Filed under: Marketing,Topic Maps — Patrick Durusau @ 11:33 am

Publication, Publication is the location of Gary King’s paper by the same title, with updates to the paper and related resources.

The paper is details how to take students through replication of an existing published article. In order to teach students how produce professional quality papers by having the original and their attempt to replicate it.

I mention it because topic maps are sorely lacking in a publication record. At least in professional, trade and popular literature.

Maybe due to the lack of a marketing person, something that was mentioned recently. 😉

But, like a blog, it could not be just one piece replicated a bunch of times or just an occasional blurb.

It would have to be a steady drum beat, both theoretical (yes, I used the “T” word) as well as practical pieces.

I will be looking for takers in professional and trade literature, not in replies here.

Questions (of course there are questions, this is a topic maps class):

  1. What journal or magazine (print/online) would you suggest for a topic map article? What should that article focus on and why? (2-3 pages)
  2. What federal or state agency do you think needs a topic map? Be specific with references to resources illustrating the problem to be solved. (3-5 pages, citations)
  3. What is the most compelling story line for topic maps in general? For a specific case? Write 1 pagers for both.
  4. Replicate a paper (chosen with instructor, semester project).

Ensemble Based Systems in Decision Making

Filed under: Classifier Fusion,Data Fusion — Patrick Durusau @ 11:11 am

Ensemble Based Systems in Decision Making Authors: Robi Polikar Keywords: Multiple classifier systems, classifier combination, classifier fusion, classifier selection, classifier diversity, incremental learning, data fusion

Abstract:

In matters of great importance that have financial, medical, social, or other implications, we often seek a second opinion before making a decision, sometimes a third, and sometimes many more. In doing so, we weigh the individual opinions, and combine them through some thought process to reach a final decision that is presumably the most informed one. The process of consulting “several experts” before making a final decision is perhaps second nature to us; yet, the extensive benefits of such a process in automated decision making applications have only recently been discovered by computational intelligence community.

Also known under various other names, such as multiple classifier systems, committee of classifiers, or mixture of experts, ensemble based systems have shown to produce favorable results compared to those of single-expert systems for a broad range of applications and under a variety of scenarios. Design, implementation and application of such systems are the main topics of this article. Specifically, this paper reviews conditions under which ensemble based systems may be more beneficial than their single classifier counterparts, algorithms for generating individual components of the ensemble systems, and various procedures through which the individual classifiers can be combined. We discuss popular ensemble based algorithms, such as bagging, boosting, AdaBoost, stacked generalization, and hierarchical mixture of experts; as well as commonly used combination rules, including algebraic combination of outputs, voting based techniques, behavior knowledge space, and decision templates. Finally, we look at current and future research directions for novel applications of ensemble systems. Such applications include incremental learning, data fusion, feature selection, learning with missing features, confidence estimation, and error correcting output codes; all areas in which ensemble systems have shown great promise

Ironic that the second paragraph of the abstract starts off with the very semantic diversity that bedevils effective information retrieval and navigation.

Excellent survey article on ensemble systems.

Questions:

  1. Read and summarize this article. (1-2 pages)
  2. Choose a data set (list to be posted for class). Outline the choices or evaluations you would make in assembling an ensemble system. (3-5 pages, no citations)
  3. Build an ensemble system to assist with building a topic map for a specific data set (Project)

Scalable reduction of large datasets to interesting subsets

Filed under: OWL,RDF,Semantic Web — Patrick Durusau @ 11:04 am

Scalable reduction of large datasets to interesting subsets Authors: Gregory Todd Williams, Jesse Weaver, Medha Atre, James A. Hendler Keywords: Billion Triples Challenge, Scalability, Parallel, Inferencing, Query, Triplestore

Abstract:

With a huge amount of RDF data available on the web, the ability to find and access relevant information is crucial. Traditional approaches to storing, querying, and reasoning fall short when faced with web-scale data. We present a system that combines the computational power of large clusters for enabling large-scale reasoning and data access with an efficient data structure for storing and querying the accessed data on a traditional personal computer or other resource-constrained device. We present results of using this system to load the 2009 Billion Triples Challenge dataset, materialize RDFS inferences, extract an “interesting” subset of the data using a large cluster, and further analyze the extracted data using a personal computer, all in the order of tens of minutes.

I wonder about the use of the phrase “…web-scale data?”

if a billion triples is a real challenge, then what happens when RDF/RDFa is deployed across an entity and inference rich body of material like legal texts? Or property descriptions? Or the ownership rights based on property descriptions?

In any event, the prep of the data for inferencing illustrates a use case for topic maps:

Information about people is represented in different ways in the BTC2009 dataset, including the use of the FOAF,7 SIOC,8 DBpedia,9 and AKT10 ontologies. We create a simple upper ontology to bring together concepts and properties pertaining to people. For example, we define the class up:Person which is defined as a superclass to existing person classes, e.g., foaf:Person. We do the same for relevant properties, e.g., up:full name is a superproperty of akt:full-name. Note that “up” is the namespace prefix for our upper ontology.

What subject represented by akt:full-name was responsible for the mapping in question? How does that translate to other ontologies? Oh, sorry, no place to record that mapping.

Questions:

  1. How do you evaluate the claims of “…web-scale data?” (3-5 pages, citations)
  2. Does creating ad-hoc upper ontologies scale? Yes/No/Why? (3-5 pages, citations)
  3. How does interchanges of ad-hoc uppper ontologies work? (3-5 pages, citations)

Infer.NET

Filed under: Bioinformatics,Inference,Machine Learning — Patrick Durusau @ 11:02 am

Infer.NET

From the website:

Infer.NET is a framework for running Bayesian inference in graphical models. It can also be used for probabilistic programming as shown in this video.

You can use Infer.NET to solve many different kinds of machine learning problems, from standard problems like classification or clustering through to customised solutions to domain-specific problems. Infer.NET has been used in a wide variety of domains including information retrieval, bioinformatics, epidemiology, vision, and many others.

I should not have been surprised but use of a “.net” language is required to use Infer.Net.

I would appreciate comments from anyone who uses Infer.Net for inferencing to assist in the authoring of topic maps.

Managing Terabytes of Web Semantics Data

Filed under: OWL,RDF,Semantic Web — Patrick Durusau @ 11:00 am

Managing Terabytes of Web Semantics Data Authors: Michele Catasta, Renaud Delbru, Nickolai Toupikov, and Giovanni Tummarello

Abstract:

A large amount of semi structured data is now made available on the Web in form of RDF, RDFa and Microformats. In this chapter, we discuss a general model for the Web of Data and, based on our experience in Sindice.com, we discuss how this is reflected in the architecture and components of a large scale infrastructure. Aspects such as data collection, processing, indexing, ranking are touched, and we give an ample example of an applications built on top of said infrastructure.

Appears as Chapter 6 in R. De Virgilio et al. (eds.), Semantic Web Information Management, © Springer-Verlag Berlin Heidelberg 2010.

Hopefully not too repetitious with the other Sindice.com material I have been posting.

It is a good overview of the area, in addition to specifics about Sindice.com.

Semantic Now?

Filed under: Navigation,OWL,RDF,Semantic Web,Topic Maps — Patrick Durusau @ 10:58 am

Visit Semantic Web, then return here (or use a separate browser window).

I went to the Semantic Web page of the W3C looking for a prior presentation and was struck by the semantic now nature of the page.

It isn’t clear how to access older material.

I have to confess to having only a passing interest in self-promotional, puff pieces, including logos.

I assume that is true for many of the competent researchers working with the W3C. (There are a lot of them, this is not a criticism of their work.)

So, where is the interface that enables quick access to substantial materials, including older standards, statements and presentations?

*****
I understand at least some of the W3C site is described in RDF. What degree of detail, precision, I don’t know. Would make a starting point for a topic map of the site.

The other necessary component and where this page falls down, would be a useful navigation choices. That would be the harder problem.

Let me know if you are interested in cracking this nut.

Another Take on the Semantic Web?

Filed under: OWL,RDF,Semantic Web — Patrick Durusau @ 10:56 am

Bob Ferris constructs a take on the SW at: On Resources, Information Resources and Documents.

Whatever you think of Bob’s vision of the SW, the fundamental problem is one of requiring universal use of a flat identifier (URI).

Which leaves us with string comparison. Different string, different thing being identified.

Some of the better SW software now evaluates RDF graphs for identification of entities.

Not all that different from how we identify entities.

Departs from the URI = Identifier basis of the SW, but to be useful, that was inevitable.

Two more challenges face the SW (where topic maps can help, there are others):

1) How to communicate to other users what parts of an RDF graph to match for identity purposes? (including matching on subparts)

2) How to communicate to other users when non-Isomorphic RDF graphs are semantically equivalent?

More on those issues anon.

Mechanical Turk and Jump Starting Topic Maps

Filed under: Authoring Topic Maps,Topic Maps — Patrick Durusau @ 9:47 am

Is anyone using the Mechanical Turk for topic map authoring purposes?

Would require breaking authoring into small tasks and perhaps capturing some information in the background.

Could be a refinement step to follow automatic data extraction or evaluation.

Use the LAMP stack for data collection.

Once an authoring framework was in place, just a question of populating it.

Would appreciate notes from anyone taking this approach to creating topic maps.

*****
Before anyone complains this would not be as precise as the brooding intellect approach to topic map authoring, yes, yes, you are right.

Just as printers rely on Danielle Steele and similar authors for their livelihood, semantic technologies, including topic maps, need to get their intellectual skirts dirty.

November 25, 2010

LingPipe Blog

Filed under: Data Mining,Natural Language Processing,Text Analytics — Patrick Durusau @ 11:07 am

LingPipe Blog: Natural Language Processing and Text Analytics

Blog for the LingPipe Toolkit.

If you want to move beyond hand-authored topic maps, NLP and other techniques are in your future.

Imagine using LingPipe to generate entity profiles that you then edit (or not) and market for particular data resources.

On entity profiles, see: Sig.ma.

Fuzzy Table

Filed under: Fuzzy Matching,Hadoop,High Dimensionality — Patrick Durusau @ 10:29 am

Tackling Large Scale Data In Government.

OK, but I cite the post because of its coverage of Fuzzy Table:

FuzzyTable is a large-scale, low-latency, parallel fuzzy-matching database built over Hadoop. It can use any matching algorithm that can compare two often high-dimensional items and return a similarity score. This makes it suitable not only for comparing fingerprints but other biometric modalities, images, audio, and anything that can be represented as a vector of features.

Hmmm, “anything that can be represented as a vector of features?”

Did someone mention subject identity? 😉

Worth a very close read. Software release coming.

Older Posts »

Powered by WordPress