Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 27, 2015

Subjects For Less Obscure Topic Maps?

Filed under: Marketing,Topic Maps — Patrick Durusau @ 3:31 pm

A new window into our world with real-time trends

From the post:

Every journey we take on the web is unique. Yet looked at together, the questions and topics we search for can tell us a great deal about who we are and what we care about. That’s why today we’re announcing the biggest expansion of Google Trends since 2012. You can now find real-time data on everything from the FIFA scandal to Donald Trump’s presidential campaign kick-off, and get a sense of what stories people are searching for. Many of these changes are based on feedback we’ve collected through conversations with hundreds of journalists and others around the world—so whether you’re a reporter, a researcher, or an armchair trend-tracker, the new site gives you a faster, deeper and more comprehensive view of our world through the lens of Google Search.

Real-time data

You can now explore minute-by-minute, real-time data behind the more than 100 billion searches that take place on Google every month, getting deeper into the topics you care about. During major events like the Oscars or the NBA Finals, you’ll be able to track the stories most people are searching for and where in the world interest is peaking. Explore this data by selecting any time range in the last week from the date picker.

Follow @GoogleTrends for tweets about new data sets and trends.

See GoogleTrends at: https://www.google.com/trends/

This has been in browser tab for several days. I could not decide if it was eye candy or something more serious.

After all, we are talking about searches ranging from experts to the vulgar.

I went an visited today’s results at Google Trends, and found:

  • 5 Crater of Diamonds State Park, Arkansas
  • 17 Ted 2, Jurassic World
  • 22 World’s Ugliest Dog Contest [It doesn’t say if Trump entered or not.]
  • 35 Episcopal Church
  • 48 Grace Lee Boggs
  • 59 Raquel Welch
  • 68 Dodge, Mopar, Dodge Challenger
  • 79 Xbox One, Xbox, Television
  • 86 Escobar: Paradise Lost, Pablo Escobar, Benicio del Toro
  • 98 Islamic State of Iraq and the Levant

I was glad to see Raquel Welch was in the top 100 but saddened that she was out scored by the Episcopal Church. That has to sting.

When I think of topic maps that I can give you as examples, they involve taxes, Castrati, and other obscure topics. My favorite use case is an ancient text annotated with commentaries and comparative linguistics based on languages no longer spoken.

I know what interests me but not what interests other people.

Thoughts on using Google Trends to pick “hot” topics for topic mapping?

June 26, 2015

Topic Maps For Sharing (or NOT!)

Filed under: Cybersecurity,Topic Maps — Patrick Durusau @ 1:14 pm

This is one slide (#38) out of several but I saw it posted by PBBsRealm(Brad M) and thought it was worth transcribing part of it:

From the slide:

Why is Cyber Security so Hard?

No common taxonomy

  • Information is power; sharing is seen as loss of power

[Searching on several phrases and NERC (North American Electricity Reliability Corporation), I have been unable to find the entire slide deck.]

Did you catch the line:

Information is power; sharing is seen as loss of power

You can use topic maps for sharing, but how much sharing you choose to do is up to you.

For example, assume your department is responsible for mapping data for ETL operations. Each analyst is using state of the art software to create mappings from field to field. In the process of creating those mappings, each analyst learns enough about those fields to make sure the mapping is correct.

Now one or more of your analysts leave for other positions. All the ad hoc knowledge they had of the data fields has been lost. With a topic map, you could have been accumulating power as each analyst discovered information about each data field.

If management requests the mapping you are using, you output the standard field to field mapping, with none of the extra information that you have accumulated for each field in a topic map. The underlying descriptions remain solely in your possession.

With topic maps, you can share a little or a lot, your call.

PS: You can also encrypt the values you use for merging in your topic map. Which could enable different levels of merging for one map, based upon a level of security clearance. An example would be a topic map resource accessible by people with varying security clearances. (CIA/NSA take note.)

June 21, 2015

People Don’t Want Something Truly New,…

Filed under: Marketing,Topic Maps — Patrick Durusau @ 8:05 pm

People Don’t Want Something Truly New, They Want the Familiar Done Differently by Nir Eyal.

From the post:

I’ll admit, the bento box is an unlikely place to learn an important business lesson. But consider the California Roll — understanding the impact of this icon of Japanese dining can make all the difference between the success or failure of your product.

If you’ve ever felt the frustration of customers not biting, then you can sympathize with Japanese restaurant owners in America during the 1970s. Sushi consumption was all but non-existent. By all accounts, Americans were scared of the stuff. Eating raw fish was an aberration and to most, tofu and seaweed were punch lines, not food.

Then came the California Roll. While the origin of the famous maki is still contested, its impact is undeniable. The California Roll was made in the USA by combining familiar ingredients in a new way. Rice, avocado, cucumber, sesame seeds, and crab meat — the only ingredient unfamiliar to the average American palate was the barely visible sliver of nori seaweed holding it all together.

The success story of introducing Americans to the consumption of sushi, from almost no consumption at all, to a $2.25 billion market annually.

How would you answer the question:

What’s the “California Roll” for topic maps?

June 19, 2015

Addicted: An Industry Matures / Hooked: How to Build Habit-Forming Products

Filed under: Marketing,Topic Maps — Patrick Durusau @ 12:31 pm

Addicted: An Industry Matures by Ted McCarthy.

From the post:

Perhaps nothing better defines our current age than to say it is one of rapid technological change. Technological improvements will continue to provide more to individuals and society, but also to demand more: demand (and leak) more of our data, more time, more attention and more anxieties. While an increasingly vocal minority have begun to rail against certain of these demands, through calls to pull our heads away from our screens and for corporations and governments to stop mining user data, a great many in the tech industry see no reason to change course. User data and time are requisite in the new business ecosystem of the Internet; they are the fuel that feeds the furnace.

Among those advocating for more fuel is Nir Eyal and his recent work, Hooked: How to Build Habit-Forming Products. The book — and its accompanying talk — has attracted a great deal of attention here in the Bay Area, and it’s been overwhelmingly positive. Eyal outlines steps that readers — primarily technology designers and product managers — can follow to make ‘habit-forming products.’ Follow his prescribed steps, and rampant entrepreneurial success may soon be yours.

Since first seeing Eyal speak at Yelp’s San Francisco headquarters last fall, I’ve heard three different clients in as many industries refer to his ideas as “amazing,” and some have hosted reading groups to discuss them. His book has launched to Amazon’s #1 bestseller spot in Product Management, and hovers near the same in Industrial & Product Design and Applied Psychology. It is poised to crack into the top 1000 sellers across the entire site, and reviewers have offered zealous praise: Eric Ries, a Very Important tech Person indeed, has declared the book “A must read for everyone who cares about driving customer engagement.”

And yet, no one offering these reviews has pointed what should be obvious: that Eyal’s model for “hooking” users is nearly identical to that used by casinos to “hook” their own; that such a model engenders behavioral addictions in users that can be incredibly difficult to overcome. Casinos may take our money, but these products can devour our time; and while we’re all very aware of what the casino owners are up to, technology product development thus far has managed to maintain an air of innocence.

While it may be tempting to dismiss a book seemingly written only for, and read only by, a small niche of $12 cold pressed juice-drinking, hoodie and flip flop-wearing techies out on the west coast, one should consider the ways in which those techies are increasingly creating the worlds we all inhabit. Technology products are increasingly determining the news we read, the letters we send, the lovers we meet and the food we eat — and their designers are reading this book, and taking note. I should know: I’m one of them.

I start with Ted McCarthy’s introduction because I found out about Hooked: How to Build Habit-Forming Products by Nir Eyal. It certainly sounded like a book that I must read!

I was hoping to find reviews sans moral hand-wringing but even Hooked: How To Make Habit-Forming Products, And When To Stop Flapping by Wing Kosner gets in on the moral concern act:

In the sixth chapter of the book, Eyal discusses these manipulations, but I think he skirts around the morality issues as well as the economics that make companies overlook them. The Candy Crush Saga game is a good example of how his formulation fails to capture all the moral nuance of the problem. According to his Manipulation Matrix, King, the maker of Candy Crush Saga, is an Entertainer because although their product does not (materially) improve the user’s life, the makers of the game would happily use it themselves. So, really, how bad can it be?

Consider this: Candy Crush is a very habit-forming time-waster for the majority of its users, but a soul-destroying addiction for a distinct minority (perhaps larger, however, than the 1% Eyal refers to as a rule of thumb for user addiction.) The makers of the game may be immune to the game’s addictive potential, so their use of it doesn’t necessarily constitute a guarantee of innocuousness. But here’s the economic aspect: because consumers are unwilling to pay for casual games, the makers of these games must construct manipulative habits that make players seek rewards that are most easily attained through in-app purchases. For “normal” players, these payments may just be the way that they pay to play the game instead of a flat rate up-front or a subscription, and there is nothing morally wrong with getting paid for your product (obviously!) But for “addicted” players these payments may be completely out of scale with any estimate of the value of a casual game experience. King reportedly makes almost $1 million A DAY from Candy Crush, all from in app purchases. My guess is that there is a long tail going on with a relative few players being responsible for a disproportionate share of that revenue.

This is in Forbes.

I don’t read Forbes for moral advice. 😉 I don’t consult technologists either. For moral advice, consult your local rabbi, priest or iman.

Here is an annotated introduction to Hooked, if you want to get a taste of what awaits before ordering the book. If you visit the book’s website, you will be offered a free Hooked workbook. And you can follow Nir Eyal on Twitter: @nireyal. Whatever else can be said about Nir Eyal, he is a persistent marketeer!)

Before you become overly concerned about the moral impact of Hooked, recall that legions of marketeers have labored for generations to produce truly addictive products, some with “added ingredients” and others, more recently, not. Creating additive products isn’t as easy as “read the book” and the rest of us will start wearing bras on our heads. (Apologies to Scott Adams and especially to Dogbert.)

Implying that you can make all of us into addictive product mavens, however, is a marketing hook that few of us can resist.

Enjoy!

June 17, 2015

BBC Trials Something Topic Map-Like

Filed under: Media,Navigation,Topic Maps — Patrick Durusau @ 3:33 pm

BBC trials a way to explain complex backstories in its shows by Nick Summers.

From the post:

Most of the BBC’s programming is only available for 30 days on iPlayer, so trying to keep up with long-running and complicated TV shows can be a pain. Want to remember how River Song fits into the Doctor Who universe, but don’t have the DVD box sets to hand? Your best option is normally to browse Wikipedia or some Whovian fan sites. To tackle the problem, the BBC is experimenting with a site format called “Story Explorer,” which could explain storylines and characters for some of its most popular shows. Today, the broadcaster is launching a version for its Home Front radio drama with custom illustrations, text descriptions and audio snippets. More importantly, the key events are laid out as simple, vertical timelines so that you can easily track the show’s wartime chronology.

With three seasons, sixteen interlocking storylines and 21 hours of audio, Story Explorer could be a valuable resource for new and lapsed Home Front fans. It’s been released as part of BBC Taster, a place where the broadcaster can share some of its more creative and forward-thinking ideas with the public. There’s a good chance it won’t be taken any further, although the BBC is already asking on its blog whether license fee payers would like an “informative, attractive and scalable” version “linked through to the rest of the BBC and the web.” Sort of like a multimedia Wikipedia for BBC shows, then. The broadcaster has suggested that the same format could be used to support shows like Doctor Who, Casualty, Luther, Poldark, Wolf Hall and The Killing. It sounds like a pretty good idea to us — an easy way for younger Who fans to recap early seasons would go down a storm.

This is one of those times when you wonder why you don’t live in the UK? Isn’t the presence of the BBC enough of a reason for immigration?

There are all those fascists at the ports of entry, so say nothing of the lidless eyes and their operators that follow you around. But still, there is the BBC, at the cost of living in a perpetual security state.

Doesn’t the idea of navigating through a series with links to other BBC and one presumes British Library and Museum resources sound quite topic map like? Rather than forcing viewers to rely upon fan sites with their trolls and fanatics? (sorry, no pun intended)

Of course, if the BBC had an effective (read user friendly) topic map authoring tool on its website, then fans could contribute content, linked to programs or even scenes, at their own expense, to be lightly edited by staff, in order to grow viewers around BBC offerings.

I suspect some nominal payment could be required to defray the cost of editing comments. Most of the people I know would pay for the right to “have their say,” even if the reading of other people’s content was free.

Should the BBC try that suggestion, I hope it works very well for them. I only ask in return is that they market the BBC more heavily to cable providers in the American South. Thanks!


For a deeper background on Story Explorer, see: Home Front Story Explorer: Putting BBC drama on the web by Tristan Ferne.

Check out this graphic from Tristan’s post:

BBC-world

Doesn’t that look like a topic map to you?

Well, except that I would have topics to represent the relationships (associations) and include the “real world” (gag, how I hate that phrase) as well as those shown.

June 16, 2015

A Topic Map Irony

Filed under: Topic Maps — Patrick Durusau @ 12:55 pm

I have been working for weeks to find a “killer” synopsis of topic maps for a presentation later this summer. I have re-read all the old ones, mine, yours, theirs and a number of imaginary ones. None of the really seem to be the uber topic map synopsis.

After my latest exchange with a long time correspondent and my most recent suggestion lamed in the exchange, a topic map irony occurred to me:

For all of the flogging of semantic diversity in promotion of topic maps, it never occurred to me that looking for one (1) uber explanation of topic maps was going in the wrong direction.

What aspect of “topic maps” are important to one audience is very unlikely to be important to another.

What if their requirements are to point to (occurrences) of a subject in a data set and to maintain documentation about those subjects, separate and apart from that data set? The very concept of merging may not be relevant for a documentation use case.

What if their requirements are the modeling of relationships (associations) with multiple inputs of the same role and a pipeline of operations. The focus there being on modeling with associations. That topic maps have other characteristics may be interesting but not terribly important.

What if their requirements are the auditing of the mapping of multiple data sources that are combined together for a data pipeline? There we get into merging and what basis for merging exists, etc. and perhaps not so much into associations.

And there are any number of variations on those use cases, each one of which would require a different explanation and emphasis on topic maps.

To say nothing of having different merging models, some of which might ignore IRIs as a basis for merging.

To approach semantic diversity with an attempt at uniformity seems deeply ironic.

What was I thinking?

PS: To be sure, interchange in a community of use requires the use of standards but those should exist only in domain specific cases. Trying to lasso the universe of subjects in a single representation isn’t a viable enterprise.

June 13, 2015

Business Linkage Analysis: An Overview

Filed under: Business Intelligence,Topic Maps — Patrick Durusau @ 8:18 pm

Business Linkage Analysis: An Overview by Bob Hayes.

From the post:

Customer feedback professionals are asked to demonstrate the value of their customer feedback programs. They are asked: Does the customer feedback program measure attitudes that are related to real customer behavior? How do we set operational goals to ensure we maximize customer satisfaction? Are the customer feedback metrics predictive of our future financial performance and business growth? Do customers who report higher loyalty spend more than customers who report lower levels of loyalty? To answer these questions, companies look to a process called business linkage analysis.

Business Linkage Analysis is the process of combining different sources of data (e.g., customer, employee, partner, financial, and operational) to uncover important relationships among important variables (e.g., call handle time and customer satisfaction). For our context, linkage analysis will refer to the linking of other data sources to customer feedback metrics (e.g., customer satisfaction, customer loyalty).

Business Case for Linkage Analyses

Based on a recent study on customer feedback programs best practices (Hayes, 2009), I found that companies who regularly conduct operational linkages analyses with their customer feedback data had higher customer loyalty (72nd percentile) compared to companies who do conduct linkage analyses (50th percentile). Furthermore, customer feedback executives were substantially more satisfied with their customer feedback program in helping them manage customer relationships when linkage analyses (e.g., operational, financial, constituency) were a part of the program (~90% satisfied) compared to their peers in companies who did not use linkage analyses (~55% satisfied). Figure 1 presents the effect size for VOC operational linkage analyses.

Linkage analyses appears to have a positive impact on customer loyalty by providing executives the insights they need to manage customer relationships. These insights give loyalty leaders an advantage over loyalty laggards. Loyalty leaders apply linkage analyses results in a variety of ways to build a more customer-centric company: Determine the ROI of different improvement effort, create customer-centric operational metrics (important to customers) and set employee training standards to ensure customer loyalty, to name a few. In upcoming posts, I will present specific examples of linkage analyses using customer feedback data.

Discovering linkages between factors hidden in different sources of data?

Or as Bob summarizes:

Business linkage analysis is the process of combining different sources of data to uncover important insights about the causes and consequence of customer satisfaction and loyalty. For VOC programs, linkage analyses fall into three general types: financial, operational, and constituency. Each of these types of linkage analyses provide useful insight that can help senior executives better manage customer relationships and improve business growth. I will provide examples of each type of linkage analyses in following posts.

More posts in this series:

Linking Financial and VoC Metrics

Linking Operational and VoC Metrics

Linking Constituency and VoC Metrics

BTW, VoC = voice of customer.

A large and important investment, in data collection, linking and analysis.

Of course, you do have documentation for all the subjects that occur in your business linkage analysis? So that when that twenty-something who crunches all the numbers leaves, you won’t have to start from scratch? Yes?

Given the state of cybersecurity, I thought it better to ask than to guess.

Topic maps can save you from awkward questions about why the business linkage analysis reports are late. Or perhaps not coming until you can replace personnel and have them reconstruct the workflow.

Topic map based documentation is like insurance. You may not need it every day but after a mission critical facility burns to the ground, do you want to be the one to report that your insurance had lapsed?

June 10, 2015

How Entity-Resolved Data Dramatically Improves Analytics

Filed under: Entity Resolution,Merging,Topic Maps — Patrick Durusau @ 8:08 pm

How Entity-Resolved Data Dramatically Improves Analytics by Marc Shichman.

From the post:

In my last two blog posts, I’ve written about how Novetta Entity Analytics resolves entity data from multiple sources and formats, and why its speed and scalability are so important when analyzing large volumes of data. Today I’m going to discuss how analysts can achieve much better results than ever before by utilizing entity-resolved data in analytics applications.

When data from all available sources is combined and entities are resolved, individual records about a real-world entity’s transactions, actions, behaviors, etc. are aggregated and assigned to that person, organization, location, automobile, ship or any other entity type. When an application performs analytics on this entity-resolved data, the results offer much greater context than analytics on the unlinked, unresolved data most applications use today.

Analytics that present a complete view of all actions of an individual entity are difficult to deliver today as they can require many time-consuming and expensive manual processes. With entity-resolved data, complete information about each entity’s specific actions and behaviors is automatically linked so applications can perform analytics quickly and easily. Below are some examples of how applications, such as enterprise search, data warehouse and link analysis visualization, can employ entity-resolved data from Novetta Entity Analytics to provide more powerful analytics.

Marc isn’t long on specifics of how Novetta Entity Analytics works in his prior posts but I think we can all agree on his recitation of the benefits of entity resolution in this post.

Once we know the resolution of an entity or subject identity as we would say in topic maps, the payoffs are immediate and worthwhile. Search results are more relevant, aggregated (merged) data speeds up queries and multiple links are simplified as they are merged.

How we would get there varies but Marc does a good job of describing the benefits!

June 9, 2015

An identifier is not a string…

Filed under: Topic Maps — Patrick Durusau @ 6:49 pm

Deborah A. Lapeyre tweets:

An identifier is not a string, it is an association between a string and a thing #dataverse2015

I assume she is here: Dataverse Community Meeting 2015.

I be generous and assume that Deborah was just reporting something said by a speaker. 😉

What else would an identifier (or symbol) be if it wasn’t just a string?

What happens if we use a symbol (read word) in a conversation and the other person says: “I don’t understand.”

Do we:

  1. Skip the misunderstood part of the conversation?
  2. Repeat the misunderstood part of the conversation but louder?
  3. Expand or use different words for the misunderstood part of the conversation?

Are you betting on #3?

If you have every played 20 questions then you know that discovering what a symbol means involves listing other symbols and their values, while you try to puzzle out the original symbol

Think of topic maps as being the reverse of twenty questions. We start with the answer and we want to make sure everyone gets the same answer. So, how do you do that? You list questions and their answers (key/value pairs) for the answer.

It is a way of making the communication more reliable because if you don’t immediately recognize the answer, then you can consult the questions and their answers to make sure you understand.

Additional people can add their questions and answers to the answer so someone working in another language, for instance, can know you are talking about an answer they recognize.

True enough, you can claim an association between a string and “a thing” but then you are into all sorts of dubious and questionable metaphysics and epistemology. You are certainly free to list such an association between a string and a thing but that is only one question/answer among many.

You do realize of course that all the keys and values are answers in their own right and could also be described with a list of key/value pairs.

I think I like the reverse of twenty-questions better than my earlier identifier explanation. You can play as short or as long a game as you choose.

Does that work for you?

June 3, 2015

Experiment proves Reality does not exist until it is Measured [Nor Do Topics]

Filed under: Physics,Topic Maps — Patrick Durusau @ 3:35 pm

Experiment proves Reality does not exist until it is Measured

From the post:

The bizarre nature of reality as laid out by quantum theory has survived another test, with scientists performing a famous experiment and proving that reality does not exist until it is measured.

Physicists at The Australian National University (ANU) have conducted John Wheeler’s delayed-choice thought experiment, which involves a moving object that is given the choice to act like a particle or a wave. Wheeler’s experiment then asks — at which point does the object decide?

Common sense says the object is either wave-like or particle-like, independent of how we measure it. But quantum physics predicts that whether you observe wave like behavior (interference) or particle behavior (no interference) depends only on how it is actually measured at the end of its journey. This is exactly what the research team found.

“It proves that measurement is everything. At the quantum level, reality does not exist if you are not looking at it,” said Associate Professor Andrew Truscott from the ANU Research School of Physics and Engineering.

The results are more of an indictment of “common sense” than startling proof that “reality does not exist if you are not looking at it.”

In what sense would “reality” exist if you weren’t looking at it?

It is well known that what we perceive as motion, distance, sensation are all constructs that are being assembled by our brains based upon input from our senses. Change those senses or fool them and the “displayed” results are quite different.

If you doubt either of those statements, try your hand at the National Geographic BrainGames site.

Topics as you recall, represent all the information we know about a particular subject.

So, in what sense does a topic not exist until we look at it?

Assuming that you have created your topic map in a digital computer, where would you point to show me your topic map? The whole collection of topics. Or a single topic for that matter?

In order to point to a topic, you have to query the topic map. That is you have to ask to “see” the topic in question.

When displayed, that topic may have information that you don’t remember entering. In fact, you may be able to prove you never entered some of the information displayed. Yet, the information is now being displayed before you.

Part of the problem arises because for convenience sake, we often think of computers as storing information as we would write it down on a piece of paper. But the act of displaying information by a computer is a transformation of its storage of information into a format that is easier for us to perceive.

A transformation process underlies the display of a topic, well, depending upon the merging rules for your topic map. It is always possible to ask a topic map to return a set of topics that match a merging criteria but that again is your “looking at” a requested set of topics and not in any way “the way the topics are in reality.”

One of the long standing problems in semantic interoperability is the insistence of every solution that it has the answer if everyone else would just listen and abandon their own solutions.

Yes, yes that would work but thus far, after over 6,000 years of recorded, different systems for semantics (languages, both natural and artificial) that has never happened. I take that as some evidence that a universal solution isn’t going to happen.

What I am proposing is that topics, in a topic map, have the shape and content designed by an author and/or as requested by a user. That is the result of a topic map is always a question of “what did you ask” and not some preordained answer.

As I said, that isn’t likely to come up early in your use of topic maps but it could be invaluable for maintenance and processing of a topic map.

I am working on several examples to illustrate this idea and hope to have one or more of them posted tomorrow.

June 2, 2015

Identifiers as Shorthand for Identifications

Filed under: Identification,Identifiers,Topic Maps — Patrick Durusau @ 9:33 am

I closed Identifiers vs. Identifications? saying:

Many questions remain, such as how to provide for collections of sets “of properties which provide clues for establishing identity?,” how to make those collections extensible?, how to provide for constraints on such sets?, where to record “matching” (read “merging”) rules?, what other advantages can be offered?

In answering those questions, I think we need to keep in mind that identifiers and identifications lie along a continuum that runs from where we “know” what is meant by an identifier to where we ourselves need a full identification to know what is being discussed. A useful answer won’t be one or the other, but a pairing that suits a particular circumstance and use case.

You can also think of identifiers as a form of shorthand for an identification. If we were working together in a fairly small office, you would probably ask, “Is Patrick in?” rather than listing all the properties that would serve as an identification for me. So all the properties that make up an identification are unspoken but invoked by the use of the identifier.

Works quite well in a small office because to some varying degree, we would all share the identifications that are represented by the identifiers we use in everyday conversation.

That sharing of identifications behind identifiers doesn’t happen in information systems, unless we have explicitly added identifications behind those identifiers.

One problem we need to solve is how to associate an identification with an identifier or identifiers. Looking only slightly ahead, we could use an explicit mechanism like a TMDM association, if we wanted to be able to talk about the subject of the relationship between an identifier and the identification that lies behind it.

But we are not compelled to talk about such a subject and could declare by rule that within a container, an identifier is a shorthand for properties of an identification in the same container. That assumes the identifier is distinguished from the properties that make up the identification. I don’t think we need to reinvent the notions of essential vs. accidental properties but merging rules should call out what properties are required for merging.

The wary reader will have suspected before now that many (if not all) of the terms in such a container could be considered as identifiers in and of themselves. Suddenly they are trying to struggle uphill from a swamp of subject recursion. It is “elephants all the way down.”

Have no fear! Just as we can avoid using TMDM associations to mark the relationship between an identifier and the properties making up an identification, we need use containers for identifiers and identifications only when and where we choose.

In some circumstances we may use bare identifiers, sans any identifications and yet add identifications when circumstances warrant it.

No level, identifiers, an identification, an identification that explodes other identifiers, etc., is right for every purpose. Each may be appropriate for some particular purpose.

We need to allow for downward expansion in the form of additional containers along side the containers we author, as well as extension of containers to add sub-containers for identifiers and identifications we did not or chose not to author.

I do have an underlying assumption that may reassure you about the notion of downward expansion of identifier/identification containers:

Processing of one or more containers of identifiers and identifications can choose the level of identifiers + identifications to be processed.

For some purposes I may only want to choose “top level” identifiers and identifications or even just parts of identifications. For example, think of the simple mapping of identifiers that happens in some search systems. You may examine the identifications for identifiers and then produce a bare mapping of identifiers for processing purposes. Or you may have rules for identifications that produce a mapping of identifiers.

Let’s assume that I want to create a set of the identifiers for Pentane and so I query for the identifiers that have the molecular property C5H12. Some of the identifiers (with their scopes) returned will be: Beilstein Reference 969132, CAS Registry Number 109-66-0, ChEBI CHEBI:37830, ChEMBL ChEMBL16102, ChemSpider 7712, DrugBank DB03119.

Each one of those identifiers may have other properties in their associated identifications, but there is no requirement that I produce them.

I mentioned that identifiers have scope. If you perform a search on “109-66-0” (CAS Registry Number) or 7712 (ChemSpider) you will quickly find garbage. Some identifiers are useful only with particular data sources or in circumstances where the data source is identified. (The idea of “universal” identifiers is a recurrent human fiction. See The Search for the Perfect Language, Eco.)

Which means, of course, we will need to capture the scope of identifiers.

June 1, 2015

Identifiers vs. Identifications?

Filed under: Duke,Topic Maps,XML — Patrick Durusau @ 3:50 pm

One problem with topic map rhetoric has been its focus on identifiers (the flat ones):

identifier2

rather than saying topic are managing subject identifications, that is, making explicit what is represented by an expectant identifier:

identifier-pregnant

For processing purposes it is handy to map between identifiers, to query identifiers, access by identifiers, to mention only a few tasks, and all of them are machine facing.

However efficient it may be to use flat identifiers (even by humans), having access to bundle of properties thought to identify a subject is useful as well.

Topic maps already capture identifiers but their syntaxes need to be extended to support the capturing of subject identifications along with identifiers.

Years of reading has gone into the realization about identifiers and their relationship to identifications, but I would be remiss if I didn’t call out the work of Lars Marius Garshol on Duke.

From the GitHub page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.2 (see ReleaseNotes).

Duke can find duplicate customer records, or other kinds of records in your database. Or you can use it to connect records in one data set with other records representing the same thing in another data set. Duke has sophisticated comparators that can handle spelling differences, numbers, geopositions, and more. Using a probabilistic model Duke can handle noisy data with good accuracy.

In an early post on Duke Lars observes:


The basic idea is almost ridiculously simple: you pick a set of properties which provide clues for establishing identity. To compare two records you compare the properties pairwise and for each you conclude from the evidence in that property alone the probability that the two records represent the same real-world thing. Bayesian inference is then used to turn the set of probabilities from all the properties into a single probability for that pair of records. If the result is above a threshold you define, then you consider them duplicates.

Bayesian identity resolution

Only two quibbles with Lars on that passage:

I would delete “same real-world thing” and substitute, “any subject you want to talk about.”

I would point out that Bayesian inference is only one means of determining if two or more sets of properties represent the same subject. Defining sets of matching properties comes to mind. Inferencing based on relationships (associations). “Ask Steve,” is another.

But, I have never heard a truer statement from Lars than:

The basic idea is almost ridiculously simple: you pick a set of properties which provide clues for establishing identity.

Many questions remain, such as how to provide for collections of sets “of properties which provide clues for establishing identity?,” how to make those collections extensible?, how to provide for constraints on such sets?, where to record “matching” (read “merging”) rules?, what other advantages can be offered?

In answering those questions, I think we need to keep in mind that identifiers and identifications lie along a continuum that runs from where we “know” what is meant by an identifier to where we ourselves need a full identification to know what is being discussed. A useful answer won’t be one or the other, but a pairing that suits a particular circumstance and use case.

The Silence of Attributes

Filed under: Topic Maps,XML — Patrick Durusau @ 3:22 pm

I was reading Relax NG by Eric van der Vlist, when I ran into this wonderful explanation of the silence of attributes:

Attributes are generally difficult to extend. When choosing from among elements and attributes, people often base their choice on the relative ease of processing, styling, or transforming. Instead, you should probably focus on their extensibility.

Independent of any XML schema language, when you have an attribute in an instance document, you are pretty much stuck with it. Unless you replace it with an element, there is no way to extend it. You can’t add any child elements or attributes to it because it is designed to be a leaf node and to remain a leaf node. Furthermore, you can’t extend the parent element to include a second instance of an attribute with the same name. (Attrbutes with duplicate names are forbidden by XML 1.0.) You are thus making an impact not only on the extensibility of the attribute but also on the extensibility of the parent element.

Because attributes can’t be annotated with new attributes and because they can’t be duplicated, they can’t be localized like elements through duplication with different values of xml:lang attributes. Because attributes are more difficult to localize, you should avoid storing any text targeted at human consumers within attributes. You never know whether your application will become international. These attributes would make it more difficult to localize. (At page 200)

Let’s think of “localization” as “use a local identifier” and re-read that last paragraph again (with apologies to Eric):

Because attributes can’t be annotated with new attributes and because they can’t be duplicated, they can’t use local identifiers like elements through duplication with different values of xml:lang attributes. Because attributes are more difficult to localize, you should avoid storing any identifiers targeted at human consumers within attributes. You never know whether your application will become international. These attributes would make it more difficult to use local identifiers.

As a design principle, the use of attributes prevents us from “localizing” to an identifier that a user might recognize.

What is more, identifiers stand in the place of or evoke, the properties that we would list as being “how” we identified a subject, even though we happily use an identifier as a shorthand for that set of properties.

While we should be able to use identifiers for subjects, we should also be able to provide the properties we see those identifiers as representing.

May 31, 2015

External Metadata Management? A Riff for Topic Maps?

Filed under: Topic Maps — Patrick Durusau @ 1:08 pm

m-files

I was reading a white paper on M-Files when I encountered the following passage:


And to get to the “what” that’s behind metadata, many are turning to a best practice approach of separate metadata management. This approach takes into account the entire scope of enterprise content, including addressing the idea of metadata associated with information for which no file exists. For instance, an audit or a deviation is not a file, but an object for which metadata exists, so by definition, to support this, the system must manage metadata separately from the file itself.

And when metadata is not embedded in files, but managed separately, IT administrators gain more flexibility to:

  • Manage metadata structure using centralized tools.
  • Support adding metadata to all documents regardless of file format.
  • Add metadata to documents (or objects) that do not contain files (or that contain multiple files). This is useful when a document is actually a single paper copy that needs to be incorporated into the ECM system. Some ECM providers often refer to records management when discussing this capability, and others simply provide it as another way to manage a document.
  • Export files from the ECM platform without metadata tags.

Separate metadata management in ECM helps to ensure that all enterprise information is searchable, available and exportable – regardless of file type, format or object type — underscoring again the idea that data is not valuable to an organization unless it can be found.

Topic maps offer other advantages as well but external metadata management may be a key riff to introducing topic maps to big data.

I have uncovered some of the research and other literature on external metadata management. More to follow!

May 30, 2015

TMXSL

Filed under: Topic Maps,XSLT — Patrick Durusau @ 7:43 pm

TMXSL

From the readme file:

XSL(T) stylesheets to translate non topic map sources and Topic Maps syntaxes

Currently supported:

* TM/XML -> CTM 1.0, XTM 1.0, XTM 2.0, XTM 2.1

* XTM 1.0 -> CTM 1.0, XTM 2.0, XTM 2.1

* XTM 2.x -> CTM 1.0, XTM 2.1, JTM 1.0, JTM 1.1, XTM 1.0

* Atom 1.0 -> XTM 2.1

* RSS -> XTM 2.1

* OpenDocument Metadata -> TM/XML (experimental)

License: BSD

Lars Heuer has updated TMXSL!

The need for robust annotation of data grows daily and every new solution that I have seen is “Another Do It My Way (ADIMW).” Which involves loss of data “Done The Old Way (DTOW)” and changing software. And the cycle repeats itself in small and large ways with every new generation.

Topic maps could change that, even topic maps with the syntactic cruft from early designs could do better. Reconsidered, topic maps can do far better.

More on that topic anon!

May 28, 2015

I Is For Identifier

Filed under: Identifiers,Topic Maps — Patrick Durusau @ 1:20 pm

As you saw yesterday, Sam Hunting and I have a presentation at Balisage 2015 (Wednesday, August 12, 2015, 9:00 AM, if you are buying a one-day ticket), “Spreadsheets – 90+ million end user programmers with no comment tracking or version control.”

If you suspect the presentation has something to do with topic maps, take one mark for your house!

You will have to attend the conference to get the full monty but there are some ideas and motifs that I will be testing here before incorporating them into the paper and possibly the presentation.

The first one is a short riff on identifiers.

Omitting the hyperlinks, the Wikipedia article in identifiers says in part:

An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique class of objects, where the “object” or class may be an idea, physical [countable] object (or class thereof), or physical [noncountable] substance (or class thereof). The abbreviation ID often refers to identity, identification (the process of identifying), or an identifier (that is, an instance of identification). An identifier may be a word, number, letter, symbol, or any combination of those.

(emphasis in original)

It goes on to say:


In computer science, identifiers (IDs) are lexical tokens that name entities. Identifiers are used extensively in virtually all information processing systems. Identifying entities makes it possible to refer to them, which is essential for any kind of symbolic processing.

There is an interesting shift in that last quote. Did you catch it?

The first two sentences are talking about identifiers but the third shifts to “[i]identifying entities makes it possible to refer to them….” But single token identifiers aren’t the only means to identify an entity.

For example, a police record may identify someone by their Social Security Number and permit searching by that number, but it can also identify an individual by height, weight, eye/hair color, age, tatoos, etc.

But we have been taught from a very young age that I stands for Identifier, a single token that identifies an entity. Thus:

identifier2

Single identifiers are found in “virtually all information systems,” not to mention writing from all ages and speech as well. They save us a great deal of time by allowing us to say “President Obama” without having to enumerate all the other qualities that collectively identify that subject.

Of course, the problem with single token identifiers is that we don’t all use the same ones and sometimes use the same ones for different things.

So long as we remain fixated on bare identifiers:

identifier2

we will continue to see efforts to create new “persistent” identifiers. Not a bad idea for some purposes, but a rather limited one.

Instead of bare identifiers, what if we understood that identifiers stand in the place of all the qualities of the entities we wish to identify?

That is our identifiers were seen as being pregnant with the qualities of the entities they represent:

identifier-pregnant

For some purposes, like unique keys in a database, our identifiers can be seen as opaque identifiers, that’s all there is to see.

For other purposes, such as indexing across different identifiers, then our identifiers are pregnant with the qualities that identify the entities they represent.

If we look at the qualities of the entities represented by two or more identifiers, we may discover that the same identifier represents two different entities, or we may discover that two (or more) identifiers represent the same entities.

I think we need to acknowledge the allure of bare identifiers (the ones we think we understand) and their usefulness in many circumstances. We should also observe that identifiers are in fact pregnant with the qualities of the entities they represent, enabling use to distinguish the same identifier but different entity case and match different identifiers for the same entity.

Which type of identifier you need, bare or pregnant, depends upon your use case and requirements. Neither one is wholly suited for all purposes.

(Comments and suggestions are always welcome but especially on these snippets of material that will become part of a larger whole. On the artwork as well. I am trying to teach myself Gimp.)

May 27, 2015

Balisage 2015 Program Is Out!

Filed under: Conferences,Topic Maps,XML — Patrick Durusau @ 4:24 pm

Balisage 2015 Program

Tommie Usdin posted this message announcing the Balisage 2015 program:

I think this is an especially strong Balisage program with a good mix of theoretical and practical. The 2015 program includes case studies from journal publishing, regulatory compliance systems, and large-scale document systems; formatting XML for print and browser-based print formatting; visualizing XML structures and documents. Technical papers cover such topics as: MathML; XSLT; use of XML in government and the humanities; XQuery; design of authoring systems; uses of markup that vary from poetry to spreadsheets to cyber justice; and hyperdocument link management.

Good as far as it goes but a synopsis (omitting blurbs and debauchery events) of the program works better for me:

  • The art of the elevator pitch B. Tommie Usdin, Mulberry Technologies
  • Markup as index interface: Thinking like a search engine Mary Holstege, MarkLogic
  • Markup and meter: Using XML tools to teach a computer to think about versification David J. Birnbaum, Elise Thorsen, University of Pittsburgh
  • XML (almost) all the way: Experiences with a small-scale journal publishing system Peter Flynn, University College Cork
  • The state of MathML in K-12 educational publishing Autumn Cuellar, Design Science Jean Kaplansky, Safari Books Online
  • Diagramming XML: Exploring concepts, constraints and affordances Liam R. E. Quin, W3C
  • Spreadsheets – 90+ million end user programmers with no comment tracking or version control Patrick Durusau Sam Hunting
  • State chart XML as a modeling technique in web engineering Anne
    Brüggemann-Klein
    , Marouane Sayih, Zlatina Cheva, Technische Universität München
  • Implementing a system at US Patent and Trademark Office to fully automate the conversion of filing documents to XML Terrel Morris, US Patent and Trademark Office Mark Gross, Data Conversion Laboratory Amit Khare, CGI Federal
  • XML solutions for Swedish farmers: A case study Ari Nordström, Creative Words
  • XSDGuide — Automated generation of web interfaces from XML schemas: A case study for suspicious activity reporting Fabrizio Gotti, Université de Montréal Kevin Heffner, Pegasus Research & Technologies Guy Lapalme, Université de Montréal
  • Tricolor automata C. M. Sperberg-McQueen, Black Mesa Technologies; Technische Universität Darmstadt
  • Two from three (in XSLT) John Lumley, jωL Research / Saxonica
  • XQuery as a data integration language Hans-Jürgen Rennau, Traveltainment Christian Grün, BaseX
  • Smart content for high-value communications David White, Quark Software
  • Vivliostyle: An open-source, web-browser based, CSS typesetting engine Shinyu Murakami, Johannes Wilm, Vivliostyle
  • Panel discussion: Quality assurance in XML transformation
  • Comparing and diffing XML schemas Priscilla Walmsley, Datypic
  • Applying intertextual semantics to cyberjustice: Many reality checks for the price of one Yves Marcoux, Université de Montréal
  • UnderDok: XML structured attributes, change tracking, and the metaphysics of documents Claus Huitfeldt, University of Bergen, Norway
  • Hyperdocument authoring link management using Git and XQuery in service of an abstract hyperdocument management model applied to DITA hyperdocuments Eliot Kimber, Contrext
  • Extending the cybersecurity digital thread with XForms Joshua Lubell, National Institute of Standards and Technology
  • Calling things by their true names: Descriptive markup and the search for a perfect language C. M. Sperberg-McQueen, Black Mesa Technologies; Technische Universität Darmstadt

Now are you ready to register and make your travel arrangements?

Disclaimer: I have no idea why the presentation: Spreadsheets – 90+ million end user programmers with no comment tracking or version control is highlighted in your browser. Have you checked your router for injection attacks by the NSA? 😉

PS: If you are doing a one-day registration, the Spreadsheets presentation is Wednesday, August 12, 2015, 9:00 AM. Just saying.

May 14, 2015

Dynamical Systems on Networks: A Tutorial

Filed under: Dynamic Graphs,Dynamic Updating,Networks,Topic Maps — Patrick Durusau @ 2:55 pm

Dynamical Systems on Networks: A Tutorial by Mason A. Porter and James P. Gleeson.

Abstract:

We give a tutorial for the study of dynamical systems on networks. We focus especially on “simple” situations that are tractable analytically, because they can be very insightful and provide useful springboards for the study of more complicated scenarios. We briefly motivate why examining dynamical systems on networks is interesting and important, and we then give several fascinating examples and discuss some theoretical results. We also briefly discuss dynamical systems on dynamical (i.e., time-dependent) networks, overview software implementations, and give an outlook on the field.

At thirty-nine (39) pages and two hundred and sixty-three references, the authors leave the reader with an overview of the field and the tools to go further.

I am intrigued by the closer by the authors:


Finally, many networks are multiplex (i.e., include multiple types of edges) or have other multilayer features [16, 136]. The existence of multiple layers over which dynamics can occur and the possibility of both structural and dynamical correlations between layers offers another rich set of opportunities to study dynamical systems on networks. The investigation of dynamical systems on multilayer networks is only in its infancy, and this area is also loaded with a rich set of problems [16, 136, 144, 205].

Topic maps can have multiple type of edges and multiple layers.

For further reading on those topics see:

The structure and dynamics of multilayer networks by S. Boccaletti, G. Bianconi, R. Criado, C.I. del Genio, J. Gómez-Gardeñes, M. Romance, I. Sendiña-Nadal, Z. Wang, M. Zanin.

Abstract:

In the past years, network theory has successfully characterized the interaction among the constituents of a variety of complex systems, ranging from biological to technological, and social systems. However, up until recently, attention was almost exclusively given to networks in which all components were treated on equivalent footing, while neglecting all the extra information about the temporal- or context-related properties of the interactions under study. Only in the last years, taking advantage of the enhanced resolution in real data sets, network scientists have directed their interest to the multiplex character of real-world systems, and explicitly considered the time-varying and multilayer nature of networks. We offer here a comprehensive review on both structural and dynamical organization of graphs made of diverse relationships (layers) between its constituents, and cover several relevant issues, from a full redefinition of the basic structural measures, to understanding how the multilayer nature of the network affects processes and dynamics.

Multilayer Networks by Mikko Kivelä, Alexandre Arenas, Marc Barthelemy, James P. Gleeson, Yamir Moreno, Mason A. Porter.

Abstract:

In most natural and engineered systems, a set of entities interact with each other in complicated patterns that can encompass multiple types of relationships, change in time, and include other types of complications. Such systems include multiple subsystems and layers of connectivity, and it is important to take such “multilayer” features into account to try to improve our understanding of complex systems. Consequently, it is necessary to generalize “traditional” network theory by developing (and validating) a framework and associated tools to study multilayer systems in a comprehensive fashion. The origins of such efforts date back several decades and arose in multiple disciplines, and now the study of multilayer networks has become one of the most important directions in network science. In this paper, we discuss the history of multilayer networks (and related concepts) and review the exploding body of work on such networks. To unify the disparate terminology in the large body of recent work, we discuss a general framework for multilayer networks, construct a dictionary of terminology to relate the numerous existing concepts to each other, and provide a thorough discussion that compares, contrasts, and translates between related notions such as multilayer networks, multiplex networks, interdependent networks, networks of networks, and many others. We also survey and discuss existing data sets that can be represented as multilayer networks. We review attempts to generalize single-layer-network diagnostics to multilayer networks. We also discuss the rapidly expanding research on multilayer-network models and notions like community structure, connected components, tensor decompositions, and various types of dynamical processes on multilayer networks. We conclude with a summary and an outlook.

This may have been where we collectively went wrong in marketing topic maps. Yes, yes it is true that topic maps could do multilayer networks but network theory has made $billions with an overly simplistic model that bears little resemblance to reality.

As computation resources improve and closer to reality models, at least somewhat closer, become popular, something between simplistic networks and the full generality of topic maps could be successful.

May 6, 2015

Topic Extraction and Bundling of Related Scientific Articles

Topic Extraction and Bundling of Related Scientific Articles by Shameem A Puthiya Parambath.

Abstract:

Automatic classification of scientific articles based on common characteristics is an interesting problem with many applications in digital library and information retrieval systems. Properly organized articles can be useful for automatic generation of taxonomies in scientific writings, textual summarization, efficient information retrieval etc. Generating article bundles from a large number of input articles, based on the associated features of the articles is tedious and computationally expensive task. In this report we propose an automatic two-step approach for topic extraction and bundling of related articles from a set of scientific articles in real-time. For topic extraction, we make use of Latent Dirichlet Allocation (LDA) topic modeling techniques and for bundling, we make use of hierarchical agglomerative clustering techniques.

We run experiments to validate our bundling semantics and compare it with existing models in use. We make use of an online crowdsourcing marketplace provided by Amazon called Amazon Mechanical Turk to carry out experiments. We explain our experimental setup and empirical results in detail and show that our method is advantageous over existing ones.

On “bundling” from the introduction:

Effective grouping of data requires a precise definition of closeness between a pair of data items and the notion of closeness always depend on the data and the problem context. Closeness is defined in terms of similarity of the data pairs which in turn is measured in terms of dissimilarity or distance between pair of items. In this report we use the term similarity,dissimilarity and distance to denote the measure of closeness between data items. Most of the bundling scheme start with identifying the common attributes(metadata) of the data set, here scientific articles, and create bundling semantics based on the combination of these attributes. Here we suggest a two step algorithm to bundle scientific articles. In the first step we group articles based on the latent topics in the documents and in the second step we carry out agglomerative hierarchical clustering based on the inter-textual distance and co-authorship similarity between articles. We run experiments to validate the bundling semantics and to compare it with content only based similarity. We used 19937 articles related to Computer Science from arviv [htt12a] for our experiments.

Is a “bundle” the same thing as a topic that represents “all articles on subject X?”

I have seen a number of topic map examples that use the equivalent proper noun, a proper subject, that is a singular and unique subject.

But there is no reason why I could not have a topic that represents all the articles on deep learning written in 2014, for example. Methods such as the bundling techniques described here could prove to be quite useful in such cases.

May 5, 2015

One Subject, Three Locators

Filed under: Identifiers,Library,Topic Maps — Patrick Durusau @ 2:01 pm

As you may know, the Library of Congress actively maintains its subject headings. Not surprising to anyone other than purveyors of fixed ontologies. New subjects appear, terminology changes, old subjects have new names, etc.

The Subject Authority Cooperative Program (SACO) has a mailing list:

About the SACO Listserv (sacolist@loc.gov)

The SACO Program welcomes all interested parties to subscribe to the SACO listserv. This listserv was established first and foremost to facilitate communication with SACO contributors throughout the world. The Summaries of the Weekly Subject Editorial Review Meeting are posted to enable SACO contributors to keep abreast of changes and know if proposed headings have been approved or not. The listserv may also be used as a vehicle to foster discussions on the construction, use, and application of subject headings. Questions posted may be answered by any list member and not necessarily by staff in the Cooperative Programs Section (Coop) or PSD. Furthermore, participants are encouraged to provide comments, share examples, experiences, etc.

On the list this week was the question:

Does anyone know how these three sites differ as sources for consulting approved subject lists?

http://www.loc.gov/aba/cataloging/subject/weeklylists/

http://www.loc.gov/aba/cataloging/subject/

http://classificationweb.net/approved-subjects/

Janis L. Young, Policy and Standards Division, Library of Congress replied:

Just to clarify: all of the links that you and Paul listed take you to the same Approved Lists. We provide multiple access points to the information in order to accommodate users who approach our web site in different ways.

Depending upon your goals, the Approved Lists could be treated as a subject that has three locators.

April 16, 2015

Wandora – Heads Up! New release 2015-04-20

Filed under: Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 12:42 pm

No details, just saw a tweet about the upcoming release set for next next Monday.

The latest date that the new web search application from DARPA will drop as well.

Could be the start of a busy week!

April 14, 2015

Attribute-Based Access Control with a graph database [Topic Maps at NIST?]

Filed under: Cybersecurity,Graphs,Neo4j,NIST,Security,Subject Identity,Topic Maps — Patrick Durusau @ 3:25 pm

Attribute-Based Access Control with a graph database by Robin Bramley.

From the post:

Traditional access control relies on the identity of a user, their role or their group memberships. This can become awkward to manage, particularly when other factors such as time of day, or network location come into play. These additional factors, or attributes, require a different approach, the US National Institute of Standards and Technology (NIST) have published a draft special paper (NIST 800-162) on Attribute-Based Access Control (ABAC).

This post, and the accompanying Graph Gist, explore the suitability of using a graph database to support policy decisions.

Before we dive into the detail, it’s probably worth mentioning that I saw the recent GraphGist on Entitlements and Access Control Management and that reminded me to publish my Attribute-Based Access Control GraphGist that I’d written some time ago, originally in a local instance having followed Stefan Armbruster’s post about using Docker for that very purpose.

Using a Property Graph, we can model attributes using relationships and/or properties. Fine-grained relationships without qualifier properties make patterns easier to spot in visualisations and are more performant. For the example provided in the gist, the attributes are defined using solely fine-grained relationships.

Graph visualization (and querying) of attribute-based access control.

I found this portion of the NIST draft particularly interesting:


There are characteristics or attributes of a subject such as name, date of birth, home address, training record, and job function that may, either individually or when combined, comprise a unique identity that distinguishes that person from all others. These characteristics are often called subject attributes. The term subject attributes is used consistently throughout this document.

In the course of a person’s life, he or she may work for different organizations, may act in different roles, and may inherit different privileges tied to those roles. The person may establish different personas for each organization or role and amass different attributes related to each persona. For example, an individual may work for Company A as a gate guard during the week and may work for Company B as a shift manager on the weekend. The subject attributes are different for each persona. Although trained and qualified as a Gate Guard for Company A, while operating in her Company B persona as a shift manager on the weekend she does not have the authority to perform as a Gate Guard for Company B.
…(emphasis in the original)

Clearly NIST recognizes that subjects, at least in the sense of people, are identified by a set of “subject attributes” that uniquely identify that subject. It doesn’t seem like much of a leap to recognize that for other subjects, including the attributes used to identify subjects.

I don’t know what other US government agencies have similar language but it sounds like a starting point for a robust discussion of topic maps and their advantages.

Yes?

April 9, 2015

Almost a Topic Map? Or Just a Mashup?

Filed under: Digital Library,Library,Mashups,Topic Maps — Patrick Durusau @ 4:34 pm

WikipeDPLA by Eric Phetteplace.

From the webpage:

See relevant results from the Digital Public Library of America on any Wikipedia article. This extension queries the DPLA each time you visit a Wikipedia article, using the article’s title, redirects, and categories to find relevant items. If you click a link at the top of the article, it loads in a series of links to the items. The original code behind WikiDPLA was written at LibHack, a hackathon at the American Library Association’s 2014 Midwinter Meeting in Philadelphia: http://www.libhack.org/.

Google Chrome App Home Page

GitHub page

Wikipedia:The Wikipedia Library/WikipeDPLA

How you resolve the topic map versus mashup question depends on how much precision you expect from a topic map. While knowing additional places to search is useful, I never have a problem with assembling more materials than can be read in the time allowed. On the other hand, some people may need more prompting than others, so I can’t say that general references are out of bounds.

Assuming you were maintaining data sets with locally unique identifiers, using a modification of this script to query an index of all local scripts (say Pig scripts) to discover other scripts using the same data could be quite useful.

BTW, you need to have a Wikipedia account and be logged in for the extension to work. Or at least that was my experience.

Enjoy!

March 23, 2015

Unstructured Topic Map-Like Data Powering AI

Filed under: Annotation,Artificial Intelligence,Authoring Topic Maps,Topic Maps — Patrick Durusau @ 2:55 pm

Artificial Intelligence Is Almost Ready for Business by Brad Power.

From the post:

Such mining of digitized information has become more effective and powerful as more info is “tagged” and as analytics engines have gotten smarter. As Dario Gil, Director of Symbiotic Cognitive Systems at IBM Research, told me:

“Data is increasingly tagged and categorized on the Web – as people upload and use data they are also contributing to annotation through their comments and digital footprints. This annotated data is greatly facilitating the training of machine learning algorithms without demanding that the machine-learning experts manually catalogue and index the world. Thanks to computers with massive parallelism, we can use the equivalent of crowdsourcing to learn which algorithms create better answers. For example, when IBM’s Watson computer played ‘Jeopardy!,’ the system used hundreds of scoring engines, and all the hypotheses were fed through the different engines and scored in parallel. It then weighted the algorithms that did a better job to provide a final answer with precision and confidence.”

Granting that the tagging and annotation is unstructured, unlike a topic map, but it is as unconstrained by first order logic and other crippling features of RDF and OWL. Out of that mass of annotations, algorithms can construct useful answers.

Imagine what non-experts (Stanford logic refugees need not apply) could author about your domain, to be fed into an AI algorithm. That would take more effort than relying upon users chancing upon subjects of interest but it would also give you greater precision in the results.

Perhaps, just perhaps, one of the errors in the early topic maps days was the insistence on high editorial quality at the outset, as opposed to allowing editorial quality to emerge out of data.

As an editor I’m far more in favor of the former than the latter but seeing the latter work, makes me doubt that stringent editorial control is the only path to an acceptable degree of editorial quality.

What would a rough-cut topic map authoring interface look like?

Suggestions?

March 18, 2015

UK Bioinformatics Landscape

Filed under: Bioinformatics,Topic Maps — Patrick Durusau @ 4:16 pm

UK Bioinformatics Landscape

Two of the four known challenges in the UK bioinformatics landscape could be addressed by topic maps:

  • Data integration and management of omics data, enabling the use of “big data” across thematic areas and to facilitate data sharing;
  • Open innovation, or pre-competitive approaches in particular to data sharing and data standardisation

I say could be addressed by topic maps, I’m not sure what else you would use to address data integration issues, at least robustly. If you don’t mind paying to migrate data when terminology changes enough to impair your effectiveness, and continuing to pay for every future migration, I suppose that is one solution.

Given the choice, I suspect many people would like to exit the wheel of ETL.

Wandora tutorial – OCR extractor and Alchemy API Entity extractor

Filed under: Entity Resolution,OCR,Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 1:47 pm

From the description:

Video reviews the OCR (Optical Character Recognition) extractor and the Alchemy API Entity extractor of Wandora application. First, the OCR extractor is used to recognize text out of PNG images. Next the Alchemy API Entity extractor is used to recognize entities out of the text. Wandora is an open source tool for people who collect and process information, especially networked knowledge and knowledge about WWW resources. For more information see http://wandora.org.

A great demo of some of the many options of Wandora! (Wandora has more options than a Swiss army knife.)

It is an impressive demonstration.

If you aren’t familiar with Wandora, take a close look at it: http://wandora.org.

Full rules for protecting net neutrality released by FCC

Filed under: Government,Politics,Topic Maps — Patrick Durusau @ 1:08 pm

Full rules for protecting net neutrality released by FCC by Lisa Vaas

From the post:

The US Federal Communications Commission (FCC) on Thursday lay down 400 pages worth of details on how it plans to regulate broadband providers as a public utility.

These are the rules – and their legal justifications – meant to protect net neutrality.

Hardly the first word on net neutrality but it is a good centering point for much of the discussion that will follow. Think of using the document as a gateway into the larger discussion. A gateway that can lead you to interesting financial interests and relationships.

In response to provider claims about slow development of faster access and services, I would remind providers that the government built the Internet, it could certainly build another one. It could even contract out to Google to build one for it.

A WPA type project managed for quality purposes by Google. Then the government could lease equal access to it’s TB pipe. Changes the dynamics when it isn’t providers holding consumers hostage but a large competitor pushing against large providers.

PS: To anyone who thinks government competing with private business is “unfair,” given the conduct of private business, I wonder what you are using as a basis for comparison?

March 16, 2015

Flock: Hybrid Crowd-Machine Learning Classifiers

Filed under: Authoring Topic Maps,Classifier,Crowd Sourcing,Machine Learning,Topic Maps — Patrick Durusau @ 3:09 pm

Flock: Hybrid Crowd-Machine Learning Classifiers by Justin Cheng and Michael S. Bernstein.

Abstract:

We present hybrid crowd-machine learning classifiers: classification models that start with a written description of a learning goal, use the crowd to suggest predictive features and label data, and then weigh these features using machine learning to produce models that are accurate and use human-understandable features. These hybrid classifiers enable fast prototyping of machine learning models that can improve on both algorithm performance and human judgment, and accomplish tasks where automated feature extraction is not yet feasible. Flock, an interactive machine learning platform, instantiates this approach. To generate informative features, Flock asks the crowd to compare paired examples, an approach inspired by analogical encoding. The crowd’s efforts can be focused on specific subsets of the input space where machine-extracted features are not predictive, or instead used to partition the input space and improve algorithm performance in subregions of the space. An evaluation on six prediction tasks, ranging from detecting deception to differentiating impressionist artists, demonstrated that aggregating crowd features improves upon both asking the crowd for a direct prediction and off-the-shelf machine learning features by over 10%. Further, hybrid systems that use both crowd-nominated and machine-extracted features can outperform those that use either in isolation.

Let’s see, suggest predictive features (subject identifiers in the non-topic map technical sense) and label data (identify instances of a subject), sounds a lot easier that some of the tedium I have seen for authoring a topic map.

I particularly like the “inducing” of features versus relying on a crowd to suggest identifying features. I suspect that would work well in a topic map authoring context, sans the machine learning aspects.

This paper is being presented this week, CSCW 2015, so you aren’t too far behind. 😉

How would you structure an inducement mechanism for authoring a topic map?

March 14, 2015

KDE and The Semantic Desktop

Filed under: Linux OS,Merging,RDF,Semantics,Topic Maps — Patrick Durusau @ 2:30 pm

KDE and The Semantic Desktop by Vishesh Handa.

From the post:

During the KDE4 years the Semantic Desktop was one of the main pillars of KDE. Nepomuk was a massive, all encompassing, and integrated with many different part of KDE. However few people know what The Semantic Desktop was all about, and where KDE is heading.

History

The Semantic Desktop as it was originally envisioned comprised of both the technology and the philosophy behind The Semantic Web.

The Semantic Web is built on top of RDF and Graphs. This is a special way of storing data which focuses more on understanding what the data represents. This was primarily done by carefully annotating what everything means, starting with the definition of a resource, a property, a class, a thing, etc.

This process of all data being stored as RDF, having a central store, with applications respecting the store and following the ontologies was central to the idea of the Semantic Desktop.

The Semantic Desktop cannot exist without RDF. It is, for all intents and purposes, what the term “semantic” implies.

A brief post-mortem on the KDE Semantic Desktop which relied upon NEPOMUK (Networked Environment for Personal, Ontology-based Management of Unified Knowledge) for RDF-based features. (NEPOMUK was an EU project.)

The post mentions complexity more than once. A friend recently observed that RDF was all about supporting AI and not capturing arbitrary statements by a user.

Such as providing alternative identifiers for subjects. With enough alternative identifications (including context, which “scope” partially captures in topic maps), I suspect a deep learning application could do pretty well at subject recognition, including appropriate relationships (associations).

But that would not be by trying to guess or formulate formal rules (a la RDF/OWL) but by capturing the activities of users as they provide alternative identifications of and relationships for subjects.

Hmmm, merging then would be a learned behavior by our applications. Will have to give that some serious thought!

I first saw this in a tweet by Stefano Bertolo.

March 10, 2015

MIT Group Cites “Data Prep” as a Data Science Bottleneck

Filed under: Data Science,ETL,Topic Maps — Patrick Durusau @ 7:38 pm

MIT Group Cites “Data Prep” as a Data Science Bottleneck

The bottleneck is varying data semantics. No stranger to anyone interested in topic maps. The traditional means of solving that problem is to clean the data for one purpose, which unless the basis for cleaning is recorded, leaves the data dirty for the next round of integration.

What do you think is being described in this text?:

Much of Veeramachaneni’s recent research has focused on how to automate this lengthy data prep process. “Data scientists go to all these boot camps in Silicon Valley to learn open source big data software like Hadoop, and they come back, and say ‘Great, but we’re still stuck with the problem of getting the raw data to a place where we can use all these tools,’” Veeramachaneni says.

The proliferation of data sources and the time it takes to prepare these massive reserves of data are the core problems Tamr is attacking. The knee-jerk reaction to this next-gen integration and preparation problem tends to be “Machine Learning” — a cure for all ills. But as Veeramachaneni points out, machine learning can’t resolve all data inconsistencies:

Veeramachaneni and his team are also exploring how to efficiently integrate the expertise of domain experts, “so it won’t take up too much of their time,” he says. “Our biggest challenge is how to use human input efficiently, and how to make the interactions seamless and efficient. What sort of collaborative frameworks and mechanisms can we build to increase the pool of people who participate?”

Tamr has built the very sort of collaborative framework Veeramachaneni mentions, drawing from the best of machine and human learning to connect hundreds or thousands of data sources.

Top-down, deterministic data unification approaches (such as ETL, ELT and MDM) were not designed to scale to the variety of hundreds or thousands or even tens of thousands of data silos (perpetual and proliferating). Traditional deterministic systems depend on a highly trained architect developing a “master” schema — “the one schema to rule them all” — which we believe is a red herring. Embracing the fundamental diversity and ever-changing nature of enterprise data and semantics leads you towards a bottom up, probabalistic approach to connecting data sources from various enterprise silos.

You also have to engage the source owners collaboratively to curate the variety of data at scale, which is Tamr’s core design pattern. Advanced algorithms automatically connect the vast majority of the sources while resolving duplications, errors and inconsistencies among source data of sources, attributes and records — a bottom-up, probabilistic solution that is reminiscent of Google’s full-scale approach to web search and connection. When the Tamr system can’t resolve connections automatically, it calls for human expert guidance, using people in the organization familiar with the data to weigh in on the mapping and improve its quality and integrity.

Off hand I would say it is a topic map authoring solution that features algorithms to assist the authors where authoring has been crowd-sourced.

What I don’t know is whether the insight of experts is captured as dark data (A matches B) or if their identifications are preserved so they can be re-used in the future (The properties of A that result in a match with the properties of B).

I didn’t register to I can’t see the “white paper.” Let me know how close I came if you decide to get the “white paper.” Scientists are donating research data in the name of open science but startups are still farming registration data.

« Newer PostsOlder Posts »

Powered by WordPress