Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 19, 2010

“…an absolute game changer”

Filed under: Linked Data,LOD,Marketing,Semantic Web — Patrick Durusau @ 1:27 pm

Aldo Bucchi write that http://uriburner.com/c/DI463N is:

Single most powerful demo available. Really looking fwd to what’s coming next.

Let’s see how this shifts gears in terms of Linked Data comprehension.
Even in its current state, this is an absolute game changer.

I know this was not easy. My hat goes off to the team for their focus.

Now, just let me send this link out to some non-believers that have
been holding back my evangelization pipeline 😉

I may count as one of the “non-believers.” 😉

Before Aldo throws open the flood gates on his “evagenlization pipeline,” let me observe:

The elderly gentlemen appears in: Tropical grassland, Desert, Temperate grassland, Coniferous forest, Flooded grassland, Mountain grassland, Broadleaf forest, Tropical dry forest, Rainforest, Taiga, Tundra, Urban, Tropical coniferous forests, Mountains, Coastal, and Wetlands.

So he must get around a lot.

Only the BBC appears in Estuaries.

Granting that is a clever presentation of subjects that share a common locale and works fairly responsively but that hardly qualifies as a “…game changer…”

This project is a good experiment on making information more accessible.

Why aren’t the facts enough?

All Identifiers, All The Time – LOD As An Answer?

Filed under: Linked Data,LOD,RDA,Semantic Web,Subject Identity — Patrick Durusau @ 6:25 am

I am still musing over Thomas Neidhart’s comment:

To understand this identifier you would need implicit knowledge about the structure and nature of every possible identifier system in existence, and then you still do not know who has more information about it.

Aside from questions of universal identifier systems failing without exception in the past, which makes one wonder why this system should succeed, there are other questions.

Such as why would any system need to encounter every possible identifier system in existence?

That is the LOD effort has setup a strawman (apologies for the sexism) that it then proceeds to blow down.

If a subject has multiple identifiers in a set and my system recognizes only one out of three, what harm has come of the subject having the other two identifiers?

There is no processing overhead since by admission the system does not recognize the other identifier so it doesn’t process them.

The advantage being that some other system make recognize the subject on the basis of the other identifiers.

This post is a good example of that practice.

I had a category “Linked Data,” but I added a category this morning, “LOD,” just in case people search for it that way.

Why shouldn’t our computers adapt to how we use identifiers (multiple ones for the same subjects) rather than our attempting (and failing) to adapt to universal identifiers to make it easy for our computers?

November 18, 2010

A Direct Mapping of Relational Data to RDF

Filed under: Ambiguity,RDF,Semantic Web,Subject Identity — Patrick Durusau @ 7:15 pm

A Direct Mapping of Relational Data to RDF

A major step towards putting relational data “on the web.”

Identifying what that data means and providing a basis for reconciling it with other data remains to be addressed.

InfoGrid: The Web Graph Database

Filed under: Database,Graphs,Infogrid — Patrick Durusau @ 7:04 pm

InfoGrid: The Web Graph Database

From the website:

InfoGrid is a Web Graph Database with a many additional software components that make the development of REST-ful web applications on a graph foundation easy.

This looks like a very good introduction to graph databases.

Questions:

  1. Suggest any other introductions to graph databases you think would be suitable for library school students.
  2. Of the tutorials on graph databases you found, what would you change or do differently?
  3. What examples would you find compelling as a library school student for graph databases?

URIs and Identity

Filed under: Ambiguity,RDF,Semantic Web,Subject Identity,Topic Maps — Patrick Durusau @ 6:55 pm

If I read Halpin and others correctly, URIs identify the subjects they identify, except when they identify some other subject and it isn’t possible to know which of any number of subjects is being identified.

That is what I (and others) take as “ambiguity.”

Some readers have taken my comments to on URIs to be critical of RDF, which wasn’t my intent.

What I object to is the sentiment that everyone should use only URIs and then cherry pick any RDF graph that may result for identity purposes.

For example, in a family tree, there may be an entry: John Smith.

For which we can create: http://myfamilytree.smith.com/john_smith

That may resolve to an RDF graph but what properties in that graph identify a particular John Smith?

A “uniform” syntax for that “identifier” isn’t helpful if we all reach various conclusions about what properties in the graph to use for identification.

Or if we have different tests to evaluate the values of those properties.

Even with an RDF graph and rules for which properties to evaluate, we may still have ambiguity.

But rules for evaluation of RDF graphs for identity lessen the ambiguity.

All within the context, format, data model of RDF.

It does detract from URIs as identifiers but URIs as identifiers are no more viable than any single token as an identifier.

Sets of key/value pairs, which are made up of tokens, have the potential to lessen ambiguity, but not banish it.

The Positive Matching Index: A new similarity measure with optimal characteristics

Filed under: Binary Distance,Similarity,Subject Identity — Patrick Durusau @ 7:57 am

The Positive Matching Index: A new similarity measure with optimal characteristics Authors: Daniel Andrés Dos Santosa, Reena Deutsch Keywords: Binary datam, Association coefficient, Jaccard index, Dice index, Similarity

Abstract:

Despite the many coefficients accounting for the resemblance between pairs of objects based on presence/absence data, no one measure shows optimal characteristics. In this work the Positive Matching Index (PMI) is proposed as a new measure of similarity between lists of attributes. PMI fulfills the Tulloss’ theoretical prerequisites for similarity coefficients, is easy to calculate and has an intrinsic meaning expressable into a natural language. PMI is bounded between 0 and 1 and represents the mean proportion of positive matches relative to the size of attribute lists, ranging this cardinality continuously from the smaller list to the larger one. PMI behaves correctly where alternative indices either fail, or only approximate to the desirable properties for a similarity index. Empirical examples associated to biomedical research are provided to show out performance of PMI in relation to standard indices such as Jaccard and Dice coefficients.

An index for people who don’t think a single measure for identity (URIs) is enough, say those in the natural sciences?

November 17, 2010

RDA: Resource Description and Access

Filed under: Cataloging,Classification,RDA,Subject Identity,Topic Maps — Patrick Durusau @ 11:06 am

RDA: Resource Description and Access

From the website:

RDA: Resource Description and Access is the new standard for resource description and access designed for the digital world. Built on the foundations established by AACR2, RDA provides a comprehensive set of guidelines and instructions on resource description and access covering all types of content and media. (emphasis in original)

In case you are interested in the draft of 2008 version, just to get the flavor of it, see: http://www.rdatoolkit.org/constituencyreview.

More to follow on RDA and topic maps.

Normalized Kernels as Similarity Indices (and algorithm bias)

Filed under: Clustering,Kernel Methods,Similarity — Patrick Durusau @ 8:20 am

Normalized Kernels as Similarity Indices Authors(s): Julien Ah-Pine Keywords Kernels normalization, similarity indices, kernel PCA based clustering

Abstract:

Measuring similarity between objects is a fundamental issue for numerous applications in data-mining and machine learning domains. In this paper, we are interested in kernels. We particularly focus on kernel normalization methods that aim at designing proximity measures that better fit the definition and the intuition of a similarity index. To this end, we introduce a new family of normalization techniques which extends the cosine normalization. Our approach aims at refining the cosine measure between vectors in the feature space by considering another geometrical based score which is the mapped vectors’ norm ratio. We show that the designed normalized kernels satisfy the basic axioms of a similarity index unlike most unnormalized kernels. Furthermore, we prove that the proposed normalized kernels are also kernels. Finally, we assess these different similarity measures in the context of clustering tasks by using a kernel PCA based clustering approach. Our experiments employing several real-world datasets show the potential benefits of normalized kernels over the cosine normalization and the Gaussian RBF kernel.

Points out that some methods don’t result in an object being found to be most similar to…itself. What an odd result.

Moreover, it is possible for vectors the represent different scores to be treated as identical.

Questions:

  1. What axioms of similarity indexes should we take notice of? (3-5 pages, citations)
  2. What methods treat vectors with different scores as identical? (3-5 pages, citations)
  3. Are geometric based similarity indices measuring semantic or geometric similarity? Are those the same concepts or different concepts? (10-15 pages, citations, you can make this a final paper if you like.)

Hard-Coding Bias in Google “Algorithmic” Search Results

Filed under: Access Points,Subject Headings,Subject Identity,Topic Maps — Patrick Durusau @ 7:31 am

Hard-Coding Bias in Google “Algorithmic” Search Results.

Not that I want to get into analysis of hard-coding or not in search results but it is an interesting lead into issues a bit closer to home.

To what extent does subject identification have built-in biases that impact user communities?

Or less abstractly, how would we go about discovering and perhaps countering such bias?

For countering the bias you can guess that I would suggest topic maps. 😉

The more pressing question is and one that is relevant to topic map design, is how to discover our own biases?

What seems perfectly natural to me, with a background in law, biblical studies, networking technologies, markup technologies, and now semantic technologies, may seem so to other users.

To make matters worse, how do you ask a user about information they did not find?

Questions:

  1. How would you survey users to discover biases in subject identification? (3-5 pages, no citations)
  2. How would you discover what information users did not find? (3-5 pages, no citations)
  3. Class project: Design and test a survey for bias in a particular subject identification. (assuming permission from a library)

PS: There are biases in algorithms as well but we will cover those separately.

Maintainability, Auditability, eXtensibility – MAX Reconciliation

Filed under: Marketing,Topic Maps — Patrick Durusau @ 6:57 am

Maintainability, Auditability, eXtensibility – MAX Reconciliation

Don’t take this the wrong way, Google Refine 2.0 is a nice piece of work.

But for:

Maintainability

Auditability

eXtensibility

or MAX Reconciliation,

you are going to need something more.

I am not going to pitch an Ur data model to “bind them and in the darkness rule them.”

You and your users understand their subjects and can create the best subject identity model for their data.

I am suggesting that you leave room to make explicit the implicit knowledge that identifies your subjects.

How much of that you want to make explicit is a design choice.

The more subject identifications you do make explicit, the easier reliable reconciliation that can be audited and extended will become.

None of those may be values for your project, but if they are…, well, you know what comes next.

November 16, 2010

Reducing Ambiguity, LOD, Ookaboo, TMRM

Filed under: Ambiguity,Subject Identity,TMRM,Topic Maps — Patrick Durusau @ 9:25 pm

While reading Resource Identity and Semantic Extensions: Making Sense of Ambiguity and In Defense of Ambiguity it occurred to me that reducing ambiguity has a hidden assumption.

That hidden assumption is the intended audience for who I wish to reduce ambiguity.

For example, Ookaboo does #it solves the problem of multiple vocabularies for its intended audience thusly:

Our strategy for dealing with multiple subject terminologies is to what we call a reference set, which in this case is

http://ookaboo.com/o/pictures/topic/2021903/Central_Air_Force_Museum#it
http://dbpedia.org/resource/Central_Air_Force_Museum
http://rdf.freebase.com/ns/m.0g_2bv

If we want to assert foaf:depicts we assert foaf:depicts against all of these. The idea is that not all clients are going to have the inferencing capabilities that I wish they’d have, so I’m trying to assert terms in the most “core” databases of the LOD cloud.

In a case like this we may have YAGO, OpenCyc, UMBEL and other terms available. Relationships like this are expressed as

<:Whatever> <ontology2:ak>
<http://mpii.de/yago/resource/Central_Air_Force_Museum> .

<ontology2:aka>, not dereferencable yet, means (roughly) that “some people use term X to refer to substantially the same thing as term Y.” It’s my own answer to the <owl:sameAs> problem and deliberately leaves the exact semantics to the reader. (It’s a lossy expression of the data structures that I use for entity management)

This is very like a TMRM solution since it gathers different identifications together, in hopes that at least one will be understood by a reader.

This is very unlike a TMRM solution because it has no legend to say how to compare these “values,” must less their “key.”

The lack of a legend makes integration in legal, technical, medical or intelligence applications, ah, difficult.

Still, it is encouraging to see the better Linked Data applications moving in the direction of the TMRM.

In Defense of Ambiguity

Filed under: OWL,RDF,Semantic Web,Subject Identity — Patrick Durusau @ 5:49 pm

by Patrick J. Hayes and Harry Halpin was cited in David Booth’s article so like any academic, I had to go read the cited paper. 😉

Highly recommended.

The authors conclude:

Regardless of the details, the use of any technology in Web architecture to distinguish between access and reference, including our proposed ex:refersTo and ex:describedBy, does nothing more than allow the author of a URI to explain how they would like the URI to be used. Ultimately, there is nothing that Web architecture can do to prevent a URI from being used to refer to some thing non-accessible. However, at least having a clear and coherent device, such as a few RDF predicates, would allow the distinction to be made so the author could give guidance on what they believe best practice for their URI would be. This would vastly improve the situation from where it is today, where this distinction is impossible. The philosophical case for the distinction between reference and access is clear. The main advantage of Web architecture is that there is now a de facto universal identification scheme for accessing networked resources. With the Semantic Web, we can now extend this scheme to the wide world outside the Web by use of reference. By keeping the distinction between reference and access clear, the lemons of ambiguity can be turned into lemonade. Reference is inherently ambiguous, and ambiguity is not an error of communication, but fundamental to the success of communication both on and off the Web.

Sounds like the distinction between subject locators and identifiers that topic maps made long before this paper was written.

Resource Identity and Semantic Extensions: Making Sense of Ambiguity

Filed under: OWL,RDF,Semantic Web,Subject Identity — Patrick Durusau @ 5:29 pm

Resource Identity and Semantic Extensions: Making Sense of Ambiguity David Booth’s paper was cited by Bernard Vatant so I had to go take a look.

Bernard says: “The best analysis of the issue I’ve read so far.” I have to agree.

From the paper’s conclusion:

In general, a URI’s resource identity will necessarily be ambiguous. But this is not the end of the world. Rather, it means that while it may be unambiguous enough for one application, another application may require finer distinctions and thus consider it ambiguous. However, this ambiguity of resource identity can be precisely constrained by the use of URI declarations. Finally, a standard process is proposed for determining a URI’s resource identity.

Ambiguity is part and parcel of any system and the real question is how much can you tolerate?

For some systems that is quite a bit, for others, air traffic controllers come to mind, as little as possible.

Other identifiers are ambiguous as well.

Successful integration of data across systems depends on how well we deal with that ambiguity.

November 15, 2010

Extensible Reconciliation

Filed under: Marketing,Topic Maps — Patrick Durusau @ 7:34 pm

Your boss tells you another group in your company is “reconciling” data from a “different” perspective.

His boss wants both sets of data reconciled with each other. And to have the separate views as well.

How hard could it be?

You have free software in the form of Google Refine 2.0

This is beginning to sound like a Ballywood horror movie.

You have no explicit basis for reconciling your data, no documented rules* and the other project is in the same shape.

It will take endless meetings to thrash out an implicit mapping that enables the “reconciliation” of the “reconciliations.”

Which works until either group encounters new data that needs to be “reconciled.”

If you could only treat data structures as first class subjects, which have sets of key/values pairs, then reconciliation and new data would not be such a pain.

Then your reconciliation would be extensible.

Well, it is extensible now, but it is painful and error prone.

Unfortunately that thought comes as you are getting another Botox shot so you can sit through another “reconciliation” meeting.

*No documented rules. To say “When you see X, do Y.” is a recipe, not a rule. Rules imply some modicum of understanding.

Auditable Reconciliation

Filed under: Marketing,Topic Maps — Patrick Durusau @ 5:38 pm

You made it back from your down time and your employer is glad to see you back.

“Some of the “reconciled” data we have been getting looks odd. Can you audit the data to make sure it is being reconciled correctly? Thanks!”

You remember that all you have are bare tokens.

AnHai Doan’s observation about after the fact mappings:

…the manual creation of semantic mappings has long been known to be extremely laborious and error-prone. For example, a recent project at the GTE telecommunications company sought to integrate 40 databases that have a total of 27,000 elements (i.e., attributes of relational tables) [LC00]. The project planners estimated that, without the database creators, just finding and documenting the semantic mappings among the elements would take more than 12 person years.

is ringing in your ears.

Mapping and creating sets of key/values has to be an augmented process, but the existence of sets of key/values pairs enables auditing of the “reconciled data.”

Sets of key/value pairs you don’t have.

*****
PS: Sets of key/value pairs = subject proxies, with rules for “reconciliation” to use Googleease.

To say “key/value pairs” does not presume any particular methodology for storage or processing. Pick one. Let usefulness be your guide.

Maintainable Reconciliation

Filed under: Marketing,Topic Maps — Patrick Durusau @ 5:24 pm

You have downloaded Google Refine 2.0 and are busy “reconciling” data.

Say you are one of the lucky ones and this is for your employer. 😉

Now you need to work on another project or even take some downtime.

So, how do you hand off maintenance of “reconciling” data?

A reconciliation on which your employer now relies.

You recognize the data to be reconciled but there are two problems:

  1. The raw data has implied properties you are using to reconcile the data. That means there isn’t anything for you to point anyone to as a basis for reconciliation. Just as well because:
  2. The rules for reconciling the data exist only in your head. So, the properties being implicit isn’t such an issue, the rules for handling it aren’t written down either.

Flatland identity as far as the eye can see.

If you had a defined set of properties (key/value pairs) as the basis for reconciliation, you could also say how to carry out the reconciliation.

And your data would be maintainable.

Best of luck with your downtime.

*****
PS: BTW, if you think documenting the names and locations of the data you are integrating counts as documentation, think again. What happens when new data comes along? Data your boss is going to expect to be integrated.

Analysis of Amphibian Biodiversity Data

Filed under: Authoring Topic Maps,Bioinformatics,Similarity — Patrick Durusau @ 3:14 pm

Analysis of Amphibian Biodiversity Data.

Traditional citation: Hayek, L.-A. C. 1994. Analysis of amphibian biodiversity data. Pp. 207-269. In: Measuring and monitoring miological diversity. Standard methods for amphibians. W. R. Heyer et al., eds. (Smithsonian Institution, Washington, D. C.).

Important for two reasons:

  1. it gathers together forty-six (46) similarity measures (yes, 46 of them)
  2. illustrates that reading broadly is useful in topic maps work

Questions:

  1. From Hayek, which measures would you want to use building your topic map? Why? (3-5 pages, no citations)
  2. What measures developed after Hayek would you want to use? (specific to your data) (3-5 pages, citations)
  3. Just curious, we talk about algorithms “measuring” similarity. Pick two things, books, articles, whatever that you think are “similar.” Would any of these algorithms say they were similar? (3-5 pages, no citations. Yes, it is a hard question.)

Towards Index-based Similarity Search for Protein Structure Databases

Filed under: Bioinformatics,Biomedical,Indexing,Similarity — Patrick Durusau @ 5:00 am

Towards Index-based Similarity Search for Protein Structure Databases Authors: Orhan Çamoǧlu, Tamer Kahveci, Ambuj K. Singh Keywords: Protein structures, feature vectors, indexing, dataset join

Abstract:

We propose two methods for finding similarities in protein structure databases. Our techniques extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. These feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. This technique quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times while keeping the sensitivity similar.

Unless you want to do a project on bioinformatics indexing and topic maps, this paper probably isn’t of much interest.

I include it as an illustration of fashioning an domain specific index and for those who are interested, what subjects and their definitions lurk therein.

Questions (for those who want to pursue both topic maps and bioinformatics):

  1. Isolate all the “we chose” aspects of the paper. What results would have been different with other choices? The “we obtained best results…” is unsatisfying. In what sense “best results?”
  2. What aspects of this process would be amenable to use of a topic map?
  3. What about the results (if anything) would have to be different to make these results meaningful in a topic map to be merged with results by other researchers?

SecondString

Filed under: Searching,Software,String Matching — Patrick Durusau @ 4:56 am

SecondString is a Java library of string matching techniques.

The Levenshtein distance test mentioned in the LIMES post is an example of a string matching technique.

The results are not normalized so compare results from the techniques cautiously.

Questions:

  1. Suggest 1 – 2 survey articles on string matching for the class. (The Navarro article cited in Wikipedia on the Levenshtein distance is almost ten years old and despite numerous exclusions, still runs 58 pages. Excellent article but needs updating with more recent material.)
  2. What one technique would you use in constructing your topic map? Why? (2-3 pages, citing examples of why it would be the best for your data)

November 14, 2010

How Complexity Leads to Simplicity – TED Talk

Filed under: Graphs,Maps,Networks — Patrick Durusau @ 11:17 am

How Complexity Leads to Simplicity.

Ed Berlow in a little over 3 minutes demonstrates the use of an ordered network to discern simplicity in complex graphs.

Any number of factors that contribute to the identification of a subject.

Ordered networks to analyze those factors so we can isolate those that are:

  • easiest to recognize
  • have the most power of discrimination
  • are recognized by the largest group of people
  • …, etc., from a certain point of view.

What factors we choose will depend upon our goals and requirements.

Ordered networks may help us make those choices.

Questions:

  1. Future law librarians may want to look at: Looking Back: an Ordered Network Model of Legal Precedent by Stephen R. Haptonstahl
  2. Create an ordered graph for a subject and its context. (30-50 nodes, labeled graphs for class discussion, jpeg format.*)
  3. What factors would you choose to identify your subject? What are the consequences of those choices? (discussion)

*I would suggest Graphviz as graph software. You can check under resources for visual editors. You can use other software if you like.

I will walk through creation of a smallish ordered network with Graphviz.

Linked Data Tutorial

Filed under: Linked Data,Semantic Web,Semantics — Patrick Durusau @ 9:36 am

Linked Data Tutorial: “A Practical Introduction by Dr. Michael Hausenblas..

A quick overview of “Linked Data” but not too quick to avoid pointing out some of its issues.

As slide 6 says of the principles: “Many things (deliberately?) kept blurry”

That blurriness carries over into the implementation and semantics of Linked Data.

Linking everything together in a higgly-piggly manner will lead to…, I assume everything being linked together in a higgly-piggly manner.

Once linked together perhaps that will drive refinement of the linking into something useful.

Questions:

  1. List examples of the use of Linked Data in libraries. (3-5 pages, citations/links)
  2. How would you use Linked Data in a library? (3-5 pages, no citations)
  3. What would you change about Linked Data practice or standards? (3-5 pages, citations)
  4. Finding aid on Linked Data for librarians. (3-5 pages, citations)

Orient: The Database For The Web – Presentation

Filed under: NoSQL,OrientDB,Software — Patrick Durusau @ 9:02 am

Orient: The Database For The Web

Nice slide deck if you need something for the company CTO.

Perhaps to justify a NOSQL conference or further investigation into NOSQL as an option.

I was deeply amused by slide 19’s claim of “Ø Config.”

Maybe true if I am running it on my laptop during a conference presentation.

A bit more thought required for use in or with a topic map system.

Orient is an impressive bit of software and is likely to be used or encountered by topic mappers.

Questions:

  1. Uses of OrientDB in library contexts? (3-5 pages, citations/links)
  2. Download and install OrientDB. How do you evaluate it’s claim of “Ø Config?” (3-5 pages, no citations)
  3. Extra credit: As librarians you will be asked to evaluate vendor claims about software. Develop a finding aid on software evaluation for librarians faced with that task. (3-5 pages, citations)

November 13, 2010

LIMES – LInk discovery framework for MEtric Spaces

Filed under: Linked Data,Semantic Web,Software — Patrick Durusau @ 7:46 am

LIMES – LInk discovery framework for MEtric Spaces

From the website:

LIMES is a link discovery framework for the Web of Data. It implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces. It is easily configurable via a web interface. It can also be downloaded as standalone tool for carrying out link discovery locally.

LIMES detects “duplicates” in a single source or between sources by use of string metrics.

The current version of LIMES supports exclusively the string metrics Levenshtein, QGrams, BlockDistance and Euclidian as implements by the SimMetrics library. Further metrics will be included in following versions.

An interesting approach to use as a topic map authoring aid.

Questions:

  1. Using the online LIMES interface, develop and run five (5) link discovery requests. Name and save the result files. Upload them to your class project directory. Be prepared to discuss your requests and results in class.
  2. Sign up to be discussion leader for one of the algorithms supported by LIMES. Prepare a two (2) page summary for the class on your algorithm.
  3. What suggestions would you have for the project on its current UI?
  4. Use LIMES to augment your topic map authoring. Comments? (3-5 pages, no citations)
  5. In an actual run, I got the following as owl:sameAs – http://bio2rdf.org/mesh:D016889 and http://data.linkedct.org/page/condition/4398. Your evaluation? You may follow any links you find to make your evaluation. (2-3 pages, include URLs for other locations that you visit)

JUNG Graph Implementation

Filed under: Graphs,Software,Visualization — Patrick Durusau @ 6:48 am

JUNG Graph Implementation

From the website:

The Java Universal Network/Graph Framework is a software library that provides a common and extensible language for the modeling, analysis, and visualization of data that can be represented as a graph or network. It is written in Java, which allows JUNG-based applications to make use of the extensive built-in capabilities of the Java API, as well as those of other existing third-party Java libraries.

JUNG can be used to process property graphs in Gremlin.

The JUNG team notes that some classes may run into memory issues since JUNG was designed to work with in memory graphs.

Still, it looks like an effective tool for experimenting with exploration and delivery of information as visualized graphs.

November 12, 2010

I See What You Mean

Filed under: Authoring Topic Maps,Marketing,Topic Maps — Patrick Durusau @ 6:28 pm

A recent email from Andrew S. Townley reminded me of a story I heard from my father decades ago.

Circa rural Louisiana, USA, early 1930’s. A friend had just completed a new house and asked the local plumber to come install the “commode.” When the plumber started gathering up his tool kit, the friend protested that he didn’t need to bring “all that” with him. That he had done this many times before. The plumber persisted on the grounds it was better to be prepared so he would not have to return for additional tools.

When they arrive at the new house, the plumber finds he is to install what is known to him as a “toilet.”

Repeating the term “commode” over and over again would not have helped, nor in a modern context, would having a universal URI for “commode.”

What would help, and what topic maps offer, is a representative for the subject that both “commode” and “toilet” name. A representative that contains properties that authors thought identify the subject it represents.

That enables either party to the conversation to do two very important things:

  • Search for subjects in the way most familiar to them.
  • Examining properties of the subject to see if it is the subject they were seeking.

One more important thing, if they are editing a topic map:

  • Add additional properties that identify the subject in yet another way.

Understanding what others mean in my experience has been asking the other person to explain what they mean in different ways until I finally stumble upon one when I say: “I see what you mean!”

Topic maps are a way to bring “I see what you mean” to information systems.

*****
I am glossing over representatives containing properties of all sorts, not just those that identify a subject and that which properties identify a subject are declared.

What is critical to this post is that different people identify the same subjects differently and assign them different properties.

Intellectual credit for this post goes to Michel Biezunski. Michel and I had a conversation years ago where Michel was touting the phrase: “See What I Mean” or SWIM. I think my formulation fits the story better but you decide which phrasing works best for you.

LOD, Semantic Ambiguity and Topic Maps

Filed under: Authoring Topic Maps,Linked Data,Semantic Web,Topic Maps — Patrick Durusau @ 6:23 pm

The semantic ambiguity of linked data has been a hot topic of discussion of late.

Not only of what linked data links to but of linked data itself!

If you have invested a lot in linked data efforts, don’t panic!

Topic maps, even using XTM/CTM syntaxes, to say nothing of more exotic models, can reduce any semantic ambiguity using occurrences.

If and when it is necessary.

Quite serious, “if and when necessary.”

Err, “if and when necessary” meaning when it is important enough for someone to pay for the disambiguation.

Ambiguity between buyers and sellers of women’s shoes or lingerie probably abounds, but unless someone is willing to pay the freight for disambiguation, it isn’t my concern.

Linked data is exposing the ambiguity of the Semantic Web.

Being unable to solve the semantic ambiguity it exposes, linked data is creating opportunities for topic maps!

Maybe we should send the W3C a fruit basket or something?

As Time Goes by: Discovering Eras in Evolving Social Networks

Filed under: Clustering,Data Mining,Evoluntionary — Patrick Durusau @ 6:21 pm

As Time Goes by: Discovering Eras in Evolving Social Networks Authors(s): Michele Berlingerio, Michele Coscia, Fosca Giannotti, Anna Monreale, Dino Pedreschi

Abstract:

Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus instead on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure (derived from the Jaccard coefficient) between two temporal snapshots of the network. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset; we illustrate how the discovered temporal clustering highlights the crucial moments when the network had profound changes in its structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis.

Deeply interesting work.

Questions:

  1. Is is a fair assumption that terms used by one scholar will be used the same way by scholars that cite them? (discussion)
  2. If you think #1 is true, then does entity resolution, etc., however you want to talk about recognition of subjects, apply from the first scholar outwards? If so, how far? (discussion)
  3. If you think #1 is false, why? (discussion)
  4. How would you go about designing a project to identify usages of terms in a body of literature? Such that you could detect changes in usage? What questions would you have to ask? (3-5 pages, citations)

PS: Another way to think about this area is: Do terms have social lives? Is that a useful way to talk about them?

Classification and Novel Class Detection in Data Streams with Active Mining

Filed under: Uncategorized — Patrick Durusau @ 5:03 pm

Classification and Novel Class Detection in Data Streams with Active Mining Authors(s): Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham

Abtract:

We present ActMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and limited labeled data. Most of the existing data stream classification techniques address only the infinite length and concept-drift problems. Our previous work, MineClass, addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. Concept-evolution occurs in the stream when novel classes arrive. However, most of the existing data stream classification techniques, including MineClass, require that all the instances in a data stream be labeled by human experts and become available for training. This assumption is impractical, since data labeling is both time consuming and costly. Therefore, it is impossible to label a majority of the data points in a high-speed data stream. This scarcity of labeled data naturally leads to poorly trained classifiers. ActMiner actively selects only those data points for labeling for which the expected classification error is high. Therefore, ActMiner extends MineClass, and addresses the limited labeled data problem in addition to addressing the other three problems. It outperforms the state-of-the-art data stream classification techniques that use ten times or more labeled data than ActMiner.

I would have liked this article better had it not said that the details of the test data could be found in another article.

Specifically: Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B.M.: “Integrating novel class detection with classification for concept-drifting data streams.” In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD c 2009. LNCS, vol. 5782, pp. 79–94. Springer, Heidelberg (2009)

Which directed me to: “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams,” http://www.utdallas.edu/?mmm058000/reports/UTDCS-13-09.pdf

I leave it as an exercise for the readers to guess the names of the authors of the last paper.

Otherwise interesting research marred by presentation in dribs and drabs.

Now that I have all three papers I will have to see what questions arise, other than questionable publishing practices.

Searching with Tags: Do Tags Help Users Find Things?

Filed under: Uncategorized — Patrick Durusau @ 7:03 am

Searching with Tags: Do Tags Help Users Find Things? Authors: Margaret E.I. Kipp, and D. Grant Campbell

Abstract:

This study examines the question of whether tags can be useful in the process of information retrieval. Participants searched a social bookmarking tool specialising in academic articles (CiteULike) and an online journal database (Pubmed). Participant actions were captured using screen capture software and they were asked to describe their search process. Users did make use of tags in their search process, as a guide to searching and as hyperlinks to potentially useful articles. However, users also made use of controlled vocabularies in the journal database to locate useful search terms and of links to related articles supplied by the database.

Good review of the literature, such as it is, on use of user supplied tagging for searching.

Worth reading on the question raised about the use of tags but there is another question lurking in the background.

The authors say in various forms:

The ability to discover useful resources is of increasing importance where web searches return 300 000 (or more) sites of unknown relevance and is equally important in the realm of digital libraries and article databases. The question of the ability to locate information is an old one and led directly to the creation of cataloguing and classification systems for the organisation of knowledge. However, such systems have not proven to be truly scalable when dealing with digital information and especially information on the web.

Since at least 1/3 of the web is pornography and that is not usually relevant to scientific, technical or medical searching, we can reduce the searching problem by 1/3 right there. I don’t know the percentage for shopping, email archives, etc., but when you come down to the “core” literature for field, it really isn’t all that large is it?

Questions:

  1. Do search applications need to “scale” to web size or just enough to cover “core” literature? (discussion)
  2. For library science, how would you go about constructing a list of the “core” literature? (3-5 pages, no citations)
  3. If you use tagging, describe your experience with assigning tags. (3-5 pages, no citations)
  4. If you use tagging for searching purposes, describe your experience (3-5 pages, no citations)

First seen at: ResourceBlog

« Newer PostsOlder Posts »

Powered by WordPress