Archive for the ‘Topic Maps’ Category

Searching for Subjects: Which Method is Right for You?

Wednesday, April 20th, 2016

Leaving to one side how to avoid re-evaluating the repetitive glut of materials from any search, there is the more fundamental problem of how to you search for a subject?

This is a back-of-the-envelope sketch that I will be expanding, but here goes:

Basic Search

At its most basic, a search consists of a <term> and the search seeks to match strings that match that <term>.

Even allowing for Boolean operators, the matches against <term> are only and forever string matches.

Basic Search + Synonyms

Of course, as skilled searchers you will try not only one <term>, but several <synonym>s for the term as well.

A good example of that strategy is used at PubMed:

If you enter an entry term for a MeSH term the translation will also include an all fields search for the MeSH term associated with the entry term. For example, a search for odontalgia will translate to: “toothache”[MeSH Terms] OR “toothache”[All Fields] OR “odontalgia”[All Fields] because Odontalgia is an entry term for the MeSH term toothache. [PubMed Help]

The expansion to include the MeSH term Odontalgia is useful, but how do you maintain it?

A reader can see “toothache” and “Odontalgia” are treated as synonyms, but why remains elusive.

This is the area of owl:sameAs, the mapping of multiple subject identifiers/locators to a single topic, etc. You know that “sameness” exists, but why isn’t clear.

Subject Identity Properties

In order to maintain a PubMed or similar mapping, you need people who either “know” the basis for the mappings or you can have the mappings documented. That is you can say on what basis the mapping happened and what properties were present.

For example:


Key Value
symptom pain
general-location mouth
specific-location tooth

So if we are mapping terms to other terms and the specific location value reads “tongue,” then we know that isn’t a mapping to “toothache.”

How Far Do You Need To Go?

Of course for every term that we use as a key or value, there can be an expansion into key/value pairs, such as for tooth:


Key Value
general-location mouth
composition enamel coated bone
use biting, chewing


Each step towards more precise gathering of information increases your pre-search costs but decreases your post-search cost of casting out irrelevant material.

Moreover, precise gathering of information will help you avoid missing data simply due to data glut returns.

If maintenance of your mapping across generations is a concern, doing more than mapping of synonyms for reason or reasons unknown may be in order.

The point being that your current retrieval or lack thereof of current and correct information has a cost. As does improving your current retrieval.

The question of improved retrieval isn’t ideological but an ROI driven one.

  • If you have better mappings will that give you an advantage over N department/agency?
  • Will better retrieval slow down (never stop) the time wasted by staff on voluminous search results?
  • Will more precision focus your limited resources (always limited) on highly relevant materials?

Formulate your own ROI questions and means of measuring them. Then reach out to topic maps to see how they improve (or not) your ROI.

Properly used, I think you are in for a pleasant surprise with topic maps.

Dictionary of Fantastic Vocabulary [Increasing the Need for Topic Maps]

Monday, April 18th, 2016

Dictionary of Fantastic Vocabulary by Greg Borenstein.

Alexis Lloyd tweeted this link along with:

This is utterly fantastic.

Well, it certainly increases the need for topic maps!

From the bot description on Twitter:

Generating new words with new meanings out of the atoms of English.

Ahem, are you sure about that?

Is a bot is generating meaning?

Or are readers conferring meaning on the new words as they are read?

If, as I contend, readers confer meaning, the utterance of every “new” word, opens up as many new meanings as there are readers of the “new” word.

Example of people conferring different meanings on a term?

Ask a dozen people what is meant by “shot” in:

It’s just a shot away

When Lisa Fischer breaks into her solo in:

(Best played loud.)

Differences in meanings make for funny moments, awkward pauses, blushes, in casual conversation.

What if the stakes are higher?

What if you need to produce (or destroy) all the emails by “bobby1.”

Is it enough to find some of them?

What have you looked for lately? Did you find all of it? Or only some of it?

New words appear everyday.

You are already behind. You will get further behind using search.

Visualizing Data Loss From Search

Thursday, April 14th, 2016

I used searches for “duplicate detection” (3,854) and “coreference resolution” (3290) in “Ironically, Entity Resolution has many duplicate names” [Data Loss] to illustrate potential data loss in searches.

Here is a rough visualization of the information loss if you use only one of those terms:


If you search for “duplicate detection,” you miss all the articles shaded in blue.

If you search for “coreference resolution,” you miss all the articles shaded in yellow.

Suggestions for improving this visualization?

It is a visualization that could be performed on client’s data, using their search engine/database.

In order to identify the data loss they are suffering now from search across departments.

With the caveat that not all data loss is bad and/or worth avoiding.

Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.

That is a starting point for evaluating the use of topic maps.

While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?

Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.

I assume the most relevant terms are going to be those internal to customers and/or potential customers.

Interest in working this up into a client prospecting/topic map marketing tool?

Separately I want to note my discovery (you probably already knew about it) of VennDIS: a JavaFX-based Venn and Euler diagram software to generate publication quality figures. Download here. (Apologies, the publication itself if firewalled.)

The export defaults to 800 x 800 resolution. If you need something smaller, edit the resulting image in Gimp.

It’s a testimony to the software that I was able to produce a useful image in less than a day. Kudos to the software!

“Ironically, Entity Resolution has many duplicate names” [Data Loss]

Wednesday, April 13th, 2016

Nancy Baym tweeted:

“Ironically, Entity Resolution has many duplicate names” – Lise Getoor


I can’t think of any subject that doesn’t have duplicate names.

Can you?

In a “search driven” environment, not knowing the “duplicate” names for a subject means data loss.

Data loss that could include “smoking gun” data.

Topic mappers have been making that pitch for decades but it never has really caught fire.

I don’t think anyone doubts that data loss occurs, but the gravity of that data loss remains elusive.

For example, let’s take three duplicate names for entity resolution from the slide, duplicate detection, reference reconciliation, coreference resolution.

Supplying all three as quoted strings to CiteSeerX, any guesses on the number of “hits” returned?

As of April 13, 2016:

  • duplicate detection – 3,854
  • reference reconciliation – 253
  • coreference resolution – 3,290

When running the query "duplicate detection" "coreference resolution", only 76 “hits” are returned, meaning that there are only 76 overlapping cases reported in the total of 7,144 for both of those terms separately.

That’s assuming CiteSeerX isn’t shorting me on results due to server load, etc. I would have to cross-check the data itself before I would swear to those figures.

But consider just the raw numbers I report today: duplicate detection – 3,854, coreference resolution – 3,290, with 76 overlapping cases.

That’s two distinct lines of research on the same problem, for the most part, ignoring the other.

What do you think the odds are of duplication of techniques, experiences, etc., spread out over those 7,144 articles?

Instead of you or your client duplicating a known-to-somebody solution, you could be building an enhanced solution.

Well, except for the data loss due to “duplicate names” in a search environment.

And that you would have to re-read all the articles in order to find which technique or advancement was made in each article.

Multiply that by everyone who is interested in a subject and its a non-trivial amount of effort.

How would you like to avoid data loss and duplication of effort?

Coeffects: Context-aware programming languages – Subject Identity As Type Checking?

Tuesday, April 12th, 2016

Coeffects: Context-aware programming languages by Tomas Petricek.

From the webpage:

Coeffects are Tomas Petricek‘s PhD research project. They are a programming language abstraction for understanding how programs access the context or environment in which they execute.

The context may be resources on your mobile phone (battery, GPS location or a network printer), IoT devices in a physical neighborhood or historical stock prices. By understanding the neighborhood or history, a context-aware programming language can catch bugs earlier and run more efficiently.

This page is an interactive tutorial that shows a prototype implementation of coeffects in a browser. You can play with two simple context-aware languages, see how the type checking works and how context-aware programs run.

This page is also an experiment in presenting programming language research. It is a live environment where you can play with the theory using the power of new media, rather than staring at a dead pieces of wood (although we have those too).

(break from summary)

Programming languages evolve to reflect the changes in the computing ecosystem. The next big challenge for programming language designers is building languages that understand the context in which programs run.

This challenge is not easy to see. We are so used to working with context using the current cumbersome methods that we do not even see that there is an issue. We also do not realize that many programming features related to context can be captured by a simple unified abstraction. This is what coeffects do!

What if we extend the idea of context to include the context within which words appear?

For example, writing a police report, the following sentence appeared:

There were 20 or more <proxy string=”black” pos=”noun” synonym=”African” type=”race”/>s in the group.

For display purposes, the string value “black” appears in the sentence:

There were 20 or more blacks in the group.

But a search for the color “black” would not return that report because the type = color does not match type = race.

On the other hand, if I searched for African-American, that report would show up because “black” with type = race is recognized as a synonym for people of African extraction.

Inline proxies are the easiest to illustrate but that is only one way to serialize such a result.

If done in an authoring interface, such an approach would have the distinct advantage of offering the original author the choice of subject properties.

The advantage of involving the original author is that they have an interest in and awareness of the document in question. Quite unlike automated processes that later attempt annotation by rote.

No Perception Without Cartography [Failure To Communicate As Cartographic Failure]

Saturday, April 9th, 2016

Dan Klyn tweeted:

No perception without cartography

with an image of this text (from Self comes to mind: constructing the conscious mind by Antonio R Damasio):

The nonverbal kinds of images are those that help you display mentally the concepts that correspond to words. The feelings that make up the background of each mental instant and that largely signify aspects of the body state are images as well. Perception, in whatever sensory modality, is the result of the brain’s cartographic skill.

Images represent physical properties of entities and their spatial and temporal relationships, as well as their actions. Some images, which probably result from the brain’s making maps of itself making maps, are actually quite abstract. They describe patterns of occurrence of objects in time and space, the spatial relationships and movement of objects in terms of velocity and trajectory, and so forth.

Dan’s tweet spurred me to think that our failures to communicate to others could be described as cartographic failures.

If we use a term that is unknown to the average reader, say “daat,” the reader lacks a mental mapping that enables interpretation of that term.

Even if you know the term, it doesn’t stand in isolation in your mind. It fits into a number of maps, some of which you may be able to articulate and very possibly into other maps, which remain beyond your (and our) ken.

Not that this is a light going off experience for you or me but perhaps the cartographic imagery may be helpful in illustrating both the value and the risks of topic maps.

The value of topic maps is spoken of often but the risks of topic maps rarely get equal press.

How would topic maps be risky?

Well, consider the average spreadsheet using in a business setting.

Felienne Hermans in Spreadsheets: The Ununderstood Dark Matter of IT makes a persuasive case that spreadsheets are on an average five years old with little or no documentation.

If those spreadsheets remain undocumented, both users and auditors are equally stymied by their ignorance, a cartographic failure that leaves both wondering what must have been meant by columns and operations in the spreadsheet.

To the extent that a topic map or other disclosure mechanism preserves and/or restores the cartography that enables interpretation of the spreadsheet, suddenly staff are no longer plausibly ignorant of the purpose or consequences of using the spreadsheet.

Facile explanations that change from audit to audit are no longer possible. Auditors are chargeable with consistent auditing from one audit to another.

Does it sound like there is going to be a rush to use topic maps or other mechanisms to make spreadsheets transparent?

Still, transparency that befalls one could well benefit another.

Or to paraphrase King David (2 Samuel 11:25):

Don’t concern yourself about this. In business, transparency falls on first one and then another.

Ready to inflict transparency on others?

“No One Willingly Gives Away Power”

Friday, April 8th, 2016

Matthew Schofield in European anti-terror efforts hobbled by lack of trust, shared intelligence hits upon the primary reason for resistance to topic maps and other knowledge integration technologies.

Especially in intelligence, knowledge is power. No one willingly gives away power.” (Magnus Ranstorp, Swedish National Defense University)

From clerks who sort mail to accountants who cook the books to lawyers that defend patents and everyone else in between, everyone in an enterprise has knowledge, knowledge that gives them power others don’t have.

Topic maps have been pitched on a “greater good for the whole” basis but as Magnus points out, who the hell really wants that?

When confronted with a new technique, technology, methodology, the first and foremost question on everyone’s mind is:

Do I have more/less power/status with X?


approach loses power.


approach gains power.

Relevant lyrics:

Oh, there ain’t no rest for the wicked
Money don’t grow on trees
I got bills to pay
I got mouths to feed
And ain’t nothing in this world for free
No I can’t slow down
I can’t hold back
Though you know I wish I could
No there ain’t no rest for the wicked
Until we close our eyes for good

Sell topic maps to increase/gain power.

PS: Keep the line, “No one willingly gives away power” in discussions of why the ICIJ refuses to share the Panama Papers with the public.

Pentagon Confirms Crowdsourcing of Map Data

Tuesday, April 5th, 2016

I have mentioned before, Tracking NSA/CIA/FBI Agents Just Got Easier, The DEA is Stalking You!, how citizens can invite federal agents to join the gold fish bowl being prepared for the average citizen.

Of course, that’s just me saying it, unless and until the Pentagon confirms the crowdsourcing of map data!

Aliya Sternstein writes
in Soldiers to Help Crowdsource Spy Maps:

“What a great idea if we can get our soldiers adding fidelity to the maps and operational picture that we already have” in Defense systems, Gordon told Nextgov. “All it requires is pushing out our product in a manner that they can add data to it against a common framework.”

Comparing mapping parties to combat support activities, she said, soldiers are deployed in some pretty remote areas where U.S. forces are not always familiar with the roads and the land, partly because they tend to change.

If troops have a base layer, “they can do basically the same things that that social party does and just drop pins and add data,” Gordon said from a meeting room at the annual Esri conference. “Think about some of the places in Africa and some of the less advantaged countries that just don’t have addresses in the way we do” in the United States.

Of course, you already realize the value of crowd-sourcing surveillance of government agents but for the c-suite crowd, confirmation from a respected source (the Pentagon) may help push your citizen surveillance proposal forward.

BTW, while looking at Army GeoData research plans (pages 228-232), I ran across this passage:

This effort integrates behavior and population dynamics research and analysis to depict the operational environment including culture, demographics, terrain, climate, and infrastructure, into geospatial frameworks. Research exploits existing open source text, leverages multi-media and cartographic materials, and investigates data collection methods to ingest geospatial data directly from the tactical edge to characterize parameters of social, cultural, and economic geography. Results of this research augment existing conventional geospatial datasets by providing the rich context of the human aspects of the operational environment, which offers a holistic understanding of the operational environment for the Warfighter. This item continues efforts from Imagery and GeoData Sciences, and Geospatial and Temporal Information Structure and Framework and complements the work in PE 0602784A/Project T41.

Doesn’t that just reek with subjects that would be identified differently in intersecting information systems?

One solution would be to fashion top down mapping systems that are months if not years behind demands in an operational environment. Sort of like tanks that overheat in jungle warfare.

Or you could do something a bit more dynamic that provides a “good enough” mapping for operational needs and yet also has the information necessary to integrate it with other temporary solutions.

Pardon the Intermission

Friday, March 18th, 2016

Apologies for the absence of posts starting on March 15, 2016 until this one today.

I made an unplanned trip to the local hospital via ambulance around 8:00 AM on the 15th and managed to escape on the afternoon of March 17, 2016.

On the downside I didn’t have anyway to explain my sudden absence from the Net.

On the upside I had a lot of non-computer assisted time to think about topic maps, etc., while being poked, prodded, waiting for lab results, etc.

Not to mention I re-read the first two Harry Potter books. 😉

I have one interesting item for today and will be posting about my non-computer assisted thinking about topic maps in the near future.

Your interest in this blog and comments are always appreciated!

Technology Adoption – Nearly A Vertical Line (To A Diminished IQ)

Thursday, March 10th, 2016


From: There’s a major long-term trend in the economy that isn’t getting enough attention by Rick Rieder.

From the post:

As the chart above shows, people in the U.S. today are adopting new technologies, including tablets and smartphones, at the swiftest pace we’ve seen since the advent of the television. However, while television arguably detracted from U.S. productivity, today’s advances in technology are generally geared toward greater efficiency at lower costs. Indeed, when you take into account technology’s downward influence on price, U.S. consumption and productivity figures look much better than headline numbers would suggest.

Hmmm, did you catch that?

…while television arguably detracted from U.S. productivity, today’s advances in technology are generally geared toward greater efficiency at lower costs.

Really? Rick must have missed the memo on how multitasking (one aspect of smart phones, tablets, etc.) lowers your IQ by 15 points. About what you would expect from smoking a joint.

If technology encourages multitasking, making us dumber, then we are becoming less efficient. Yes?

Imagine if instead of scrolling past tweets with images of cats, food, irrelevant messages, every time you look at your Twitter time line, you got the two or three tweets relevant to your job function.

Each of those not-on-task tweets chips away at the amount of attention span you have to spend on the two or three truly important tweets.

Apps that consolidate, filter and diminish information flow are the path to greater productivity.

Topic maps anyone?


Thursday, March 10th, 2016


From the webpage:

Growthverse was built for marketers, by marketers, with input from more than 100 CMOs.

Explore 800 marketing technology companies (and growing).

I originally arrived at this site here.

Interesting visualization that may result in suspects (their not prospects until you have serious discussions) for topic map based tools.

The site says the last update was in September 2015 so take heed that the data is stale by about six months.

That said, its easier than hunting down the 800+ companies on your own.

Good hunting!

Wandora – New Release – 2016-03-08

Wednesday, March 9th, 2016

Wandora – New Release – 2016-03-08

The homepage reports:

New Wandora version has been released today (2016-03-08). The release adds Wandora support to MariaDB and PostgreSQL database topic maps. Wandora has now more stylish look, especially in Traditional topic map. The release fixes many known bugs.

I’m no style or UI expert but I’m not sure where I should be looking for the “…more stylish look….” 😉

From the main window:


If you select Tools and then Tools Manager (or Cntrl-t [lower case, contrary to the drop down menu]), you will see a list of all tools (300+) with the message:

All known tools are listed here. The list contains also unfinished, buggy and depracated tools. Running such tool may cause exceptions and unpredictable behavior. We suggest you don’t run the tools listed here unless you really know what you are doing.

It is a very impressive set of tools!

There is no lack of place to explore in Wandora and to explore with Wandora.


Overlay Journal – Discrete Analysis

Saturday, March 5th, 2016

The arXiv overlay journal Discrete Analysis has launched by Christian Lawson-Perfect.

From the post:

Discrete Analysis, a new open-access journal for articles which are “analytical in flavour but that also have an impact on the study of discrete structures”, launched this week. What’s interesting about it is that it’s an arXiv overlay journal founded by, among others, Timothy Gowers.

What that means is that you don’t get articles from Discrete Analysis – it just arranges peer review of papers held on the arXiv, cutting out almost all of the expensive parts of traditional journal publishing. I wasn’t really prepared for how shallow that makes the journal’s website – there’s a front page, and when you click on an article you’re shown a brief editorial comment with a link to the corresponding arXiv page, and that’s it.

But that’s all it needs to do – the opinion of Gowers and co. is that the only real value that journals add to the papers they publish is the seal of approval gained by peer review, so that’s the only thing they’re doing. Maths papers tend not to benefit from the typesetting services traditional publishers provide (or, more often than you’d like, are actively hampered by it).

One way the journal is adding value beyond a “yes, this is worth adding to the list of papers we approve of” is by providing an “editorial introduction” to accompany each article. These are brief notes, written by members of the editorial board, which introduce the topics discussed in the paper and provide some context, to help you decide if you want to read the paper. That’s a good idea, and it makes browsing through the articles – and this is something unheard of on the internet – quite pleasurable.

It’s not difficult to imagine “editorial introductions” with underlying mini-topic maps that could be explored on their own or that as you reach the “edge” of a particular topic map, it “unfolds” to reveal more associations/topics.

Not unlike a traditional street map for New York which you can unfold to find general areas but can then fold it up to focus more tightly on a particular area.

I hesitate to say “zoom” because in the application I have seen (important qualification), “zoom” uniformly reduces your field of view.

A more nuanced notion of “zoom,” for a topic map and perhaps for other maps as well, would be to hold portions of the current view stationary, say a starting point on an interstate highway and to “zoom” only a portion of the current view to show a detailed street map. That would enable the user to see a particular location while maintaining its larger context.

Pointers to applications that “zoom” but also maintain different levels of “zoom” in the same view? Given the fascination with “hairy” presentations of graphs that would have to be real winner.

Facts Before Policy? – Digital Security – Contacts – Volunteer Opportunity

Friday, March 4th, 2016

Rep. McCaul, Michael T. [R-TX-10] has introduced H.R.4651 – Digital Security Commission Act of 2016, full text here, a proposal to form the National Commission on Security and Technology Challenges.

From the proposal:

(2) To submit to Congress a report, which shall include, at a minimum, each of the following:

(A) An assessment of the issue of multiple security interests in the digital world, including public safety, privacy, national security, and communications and data protection, both now and throughout the next 10 years.

(B) A qualitative and quantitative assessment of—

(i) the economic and commercial value of cryptography and digital security and communications technology to the economy of the United States;

(ii) the benefits of cryptography and digital security and communications technology to national security and crime prevention;

(iii) the role of cryptography and digital security and communications technology in protecting the privacy and civil liberties of the people of the United States;

(iv) the effects of the use of cryptography and other digital security and communications technology on Federal, State, and local criminal investigations and counterterrorism enterprises;

(v) the costs of weakening cryptography and digital security and communications technology standards; and

(vi) international laws, standards, and practices regarding legal access to communications and data protected by cryptography and digital security and communications technology, and the potential effect the development of disparate, and potentially conflicting, laws, standards, and practices might have.

(C) Recommendations for policy and practice, including, if the Commission determines appropriate, recommendations for legislative changes, regarding—

(i) methods to be used to allow the United States Government and civil society to take advantage of the benefits of digital security and communications technology while at the same time ensuring that the danger posed by the abuse of digital security and communications technology by terrorists and criminals is sufficiently mitigated;

(ii) the tools, training, and resources that could be used by law enforcement and national security agencies to adapt to the new realities of the digital landscape;

(iii) approaches to cooperation between the Government and the private sector to make it difficult for terrorists to use digital security and communications technology to mobilize, facilitate, and operationalize attacks;

(iv) any revisions to the law applicable to wiretaps and warrants for digital data content necessary to better correspond with present and future innovations in communications and data security, while preserving privacy and market competitiveness;

(v) proposed changes to the procedures for obtaining and executing warrants to make such procedures more efficient and cost-effective for the Government, technology companies, and telecommunications and broadband service providers; and

(vi) any steps the United States could take to lead the development of international standards for requesting and obtaining digital evidence for criminal investigations and prosecutions from a foreign, sovereign State, including reforming the mutual legal assistance treaty process, while protecting civil liberties and due process.

Excuse the legalese but clearly an effort that could provide a factual as opposed to fantasy basis for further on digital security. No one can guarantee a sensible result but without a factual basis, any legislation is certainly going to be wrong.

For your convenience and possible employment/volunteering, here are the co-sponsors of this bill, with hyperlinks to their congressional homepages:

Now would be a good time to pitch yourself for involvement in this possible commission.

Pay attention to Section 8 of the bill:

SEC. 8. Staff.

(a)Appointment.—The chairman and vice chairman shall jointly appoint and fix the compensation of an executive director and of and such other personnel as may be necessary to enable the Commission to carry out its functions under this Act.

(b)Security clearances.—The appropriate Federal agencies or departments shall cooperate with the Commission in expeditiously providing appropriate security clearances to Commission staff, as may be requested, to the extent possible pursuant to existing procedures and requirements, except that no person shall be provided with access to classified information without the appropriate security clearances.

(c)Detailees.—Any Federal Government employee may be detailed to the Commission on a reimbursable basis, and such detailee shall retain without interruption the rights, status, and privileges of his or her regular employment.

(d)Expert and consultant services.—The Commission is authorized to procure the services of experts and consultants in accordance with section 3109 of title 5, United States Code, but at rates not to exceed the daily rate paid a person occupying a position level IV of the Executive Schedule under section 5315 of title 5, United States Code.

(e)Volunteer services.—Notwithstanding section 1342 of title 31, United States Code, the Commission may accept and use voluntary and uncompensated services as the Commission determines necessary.

I can sense you wondering what:

the daily rate paid a person occupying a position level IV of the Executive Schedule under section 5315 of title 5, United States Code

means, in practical terms.

I was tempted to point you to: 5 U.S. Code § 5315 – Positions at level IV, but that would be cruel and uninformative. 😉

I did track down the Executive Schedule, which lists position level IV:

Annual: $160,300 or about $80.15/hr. and for an 8-hour day, $641.20.

If you do volunteer or get a paying gig, please remember that omission and/or manipulation of subject identity properties can render otherwise open data opaque.

I don’t know of any primes are making money off of topic maps currently so it isn’t likely to gain traction with current primes. On the other hand, new primes can and do occur. Not often but it happens.

PS: If the extra links to contacts, content, etc. are helpful, please let me know. I started off reading the link poor RSA 2016: McCaul calls backdoors ineffective, pushes for tech panel to solve security issues. Little more than debris on your information horizon.

11 Million Pages of CIA Files [+ Allen Dulles, war criminal]

Thursday, March 3rd, 2016

11 Million Pages of CIA Files May Soon Be Shared By This Kickstarter by Joseph Cox.

From the post:

Millions of pages of CIA documents are stored in Room 3000. The CIA Records Search Tool (CREST), the agency’s database of declassified intelligence files, is only accessible via four computers in the National Archives Building in College Park, MD, and contains everything from Cold War intelligence, research and development files, to images.

Now one activist is aiming to get those documents more readily available to anyone who is interested in them, by methodically printing, scanning, and then archiving them on the internet.

“It boils down to freeing information and getting as much of it as possible into the hands of the public, not to mention journalists, researchers and historians,” Michael Best, analyst and freedom of information activist told Motherboard in an online chat.

Best is trying to raise $10,000 on Kickstarter in order to purchase the high speed scanner necessary for such a project, a laptop, office supplies, and to cover some other costs. If he raises more than the main goal, he might be able to take on the archiving task full-time, as well as pay for FOIAs to remove redactions from some of the files in the database. As a reward, backers will help to choose what gets archived first, according to the Kickstarter page.

“Once those “priority” documents are done, I’ll start going through the digital folders more linearly and upload files by section,” Best said. The files will be hosted on the Internet Archive, which converts documents into other formats too, such as for Kindle devices, and sometimes text-to-speech for e-books. The whole thing has echoes of Cryptome—the freedom of information duo John Young and Deborah Natsios, who started off scanning documents for the infamous cypherpunk mailing list in the 1990s.

Good news! Kickstarter has announced this project funded!

Additional funding will help make this archive of documents available sooner rather than later.

As opposed to an attempt to boil the ocean of 11 million pages of CIA files, what about smaller topic mapping/indexing projects that focus on bounded sub-sets of documents of interest to particular communities?

I don’t have any interest in the STAR GATE project (clairvoyance, precognition, or telepathy, continued now by the DHS at airport screening facilities) but would be very interested in the records of Allen Dulles, a war criminal of some renown.

Just so you know, Michael has already uploaded documents on Allen Dulles from the CIA Records Search Tool (CREST) tool:

History of Allen Welsh Dulles as CIA Director – Volume I: The Man

History of Allen Welsh Dulles as CIA Director – Volume II: Coordination of Intelligence

History of Allen Welsh Dulles as CIA Director – Volume III: Covert Activities

History of Allen Welsh Dulles as CIA Director – Volume IV: Congressional Oversight and Internal Administration

History of Allen Welsh Dulles as CIA Director – Volume V: Intelligence Support of Policy

To describe Allen Dulles as a war criminal is no hyperbole. Among his other crimes, overthrow of President Jacobo Arbenz Guzman of Guatemala (think United Fruit Company), removal of Mohammad Mossadeq, prime minister of Iran (think Shah of Iran), are only two of his crimes, the full extent of which will probably never be known.

Files are being uploaded to That 1 Archive.

Graph Encryption: Going Beyond Encrypted Keyword Search [Subject Identity Based Encryption]

Wednesday, March 2nd, 2016

Graph Encryption: Going Beyond Encrypted Keyword Search by Xiarui Meng.

From the post:

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) store, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph database have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny’s previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today’s world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

2. Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, which can lead to many interesting research directions.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

Apologies for the long quote but I thought this topic might be new to some readers. Xianrui goes on to describe a solution for efficient queries over encrypted graphs.

Chase and Kamara remark in Structured Encryption and Controlled Disclosure, CK10:

To address this problem we introduce the notion of structured encryption. A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key. In addition, the query process reveals no useful information about either the query or the data. An important consideration in this context is the efficiency of the query operation on the server side. In fact, in the context of cloud storage, where one often works with massive datasets, even linear time operations can be infeasible. (emphasis in original)

With just a little nudging, their:

A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key.

could be re-stated as:

A subject identity encryption scheme leaves out merging data in such a way that the resulting topic map can only be queried with knowledge of the subject identity merging key.

You may have topics that represent diagnoses such as cancer, AIDS, sexual contacts, but if none of those can be associated with individuals who are also topics in the map, there is no more disclosure than census results for a metropolitan area and a list of the citizens therein.

That is you are missing the critical merging data that would link up (associate) any diagnosis with a given individual.

Multi-property subject identities would make the problem even harder, so say nothing of conferring properties on the basis of supplied properties as part of the merging process.

One major benefit of a subject identity based approach is that without the merging key, any data set, however sensitive the information, is just a data set, until you have the basis for solving its subject identity riddle.

PS: With the usual caveats of not using social security numbers, birth dates and the like as your subject identity properties. At least not in the map proper. I can think of several ways to generate keys for merging that would be resistant to even brute force attacks.

Ping me if you are interested in pursuing that on a data set.

Earthdata Search – Smells Like A Topic Map?*

Sunday, February 28th, 2016

Earthdata Search

From the webpage:

Search NASA Earth Science data by keyword and filter by time or space.

After choosing tour:

Keyword Search

Here you can enter search terms to find relevant data. Search terms can be science terms, instrument names, or even collection IDs. Let’s start by searching for Snow Cover NRT to find near real-time snow cover data. Type Snow Cover NRT in the keywords box and press Enter.

Which returns a screen in three sections, left to right: Browse Collections, 21 Matching Collections (Add collections to your project to compare and retrieve their data), and the third section displays a world map (navigate by grabbing the view)

Under Browse Collections:

In addition to searching for keywords, you can narrow your search through this list of terms. Click Platform to expand the list of platforms (still in a tour box)

Next step:

Now click Terra to select the Terra satellite.

Comment: Wondering how I will know which “platform” or “instrument” to select? There may be more/better documentation but I haven’t seen it yet.

The data follows the Unified Metadata Model (UMM):

NASA’s Common Metadata Repository (CMR) is a high-performance, high-quality repository for earth science metadata records that is designed to handle metadata at the Concept level. Collections and Granules are common metadata concepts in the Earth Observation (EO) world, but this can be extended out to Visualizations, Parameters, Documentation, Services, and more. The CMR metadata records are supplied by a diverse array of data providers, using a variety of supported metadata standards, including:


Initially, designers of the CMR considered standardizing all CMR metadata to a single, interoperable metadata format – ISO 19115. However, NASA decided to continue supporting multiple metadata standards in the CMR — in response to concerns expressed by the data provider community over the expense involved in converting existing metadata systems to systems capable of generating ISO 19115. In order to continue supporting multiple metadata standards, NASA designed a method to easily translate from one supported standard to another and constructed a model to support the process. Thus, the Unified Metadata Model (UMM) for EOSDIS metadata was born as part of the EOSDIS Metadata Architecture Studies (MAS I and II) conducted between 2012 and 2013.

What is the UMM?

The UMM is an extensible metadata model which provides a ‘Rosetta stone’ or cross-walk for mapping between CMR-supported metadata standards. Rather than create mappings from each CMR-supported metadata standard to each other, each standard is mapped centrally to the UMM model, thus reducing the number of translations required from n x (n-1) to 2n.

Here the mapping graphic:


Granting profiles don’t make the basis for mappings explicit, but the mappings have the same impact post mapping as a topic map would post merging.

The site could use better documentation for the interface and data, at least in the view of this non-expert in the area.

Thoughts on documentation for the interface or making the mapping more robust via use of a topic map?

I first saw this in a tweet by Kirk Borne.

*Smells Like A Topic Map – Sorry, culture bound reference to a routine on the first Cheech & Chong album. No explanation would do it justice.

Challenges of Electronic Dictionary Publication

Wednesday, February 17th, 2016

Challenges of Electronic Dictionary Publication

From the webpage:

April 8-9th, 2016

Venue: University of Leipzig, GWZ, Beethovenstr. 15; H1.5.16

This April we will be hosting our first Dictionary Journal workshop. At this workshop we will give an introduction to our vision of „Dictionaria“, introduce our data model and current workflow and will discuss (among others) the following topics:

  • Methodology and concept: How are dictionaries of „small“ languages different from those of „big“ languages and what does this mean for our endeavour? (documentary dictionaries vs. standard dictionaries)
  • Reviewing process and guidelines: How to review and evaluate a dictionary database of minor languages?
  • User-friendliness: What are the different audiences and their needs?
  • Submission process and guidelines: reports from us and our first authors on how to submit and what to expect
  • Citation: How to cite dictionaries?

If you are interested in attending this event, please send an e-mail to dictionary.journal[AT]

Workshop program

Our workshop program can now be downloaded here.

See the webpage for a list of confirmed participants, some with submitted abstracts.

Any number of topic map related questions arise in a discussion of dictionaries.

  • How to represent dictionary models?
  • What properties should be used to identify the subjects that represent dictionary models?
  • On what basis, if any, should dictionary models be considered the same or different? And for what purposes?
  • What data should be captured by dictionaries and how should it be identified?
  • etc.

Those are only a few of the questions that could be refined into dozens, if not hundreds of more, when you reach the details of constructing a dictionary.

I won’t be attending but wait with great anticipation the output from this workshop!

Topic Maps: On the Cusp of Success (Curate in Place/Death of ETL?)

Tuesday, February 9th, 2016

The Bright Future of Semantic Graphs and Big Connected Data by Alex Woodie.

From the post:

Semantic graph technology is shaping up to play a key role in how organizations access the growing stores of public data. This is particularly true in the healthcare space, where organizations are beginning to store their data using so-called triple stores, often defined by the Resource Description Framework (RDF), which is a model for storing metadata created by the World Wide Web Consortium (W3C).

One person who’s bullish on the prospects for semantic data lakes is Shawn Dolley, Cloudera’s big data expert for the health and life sciences market. Dolley says semantic technology is on the cusp of breaking out and being heavily adopted, particularly among healthcare providers and pharmaceutical companies.

“I have yet to speak with a large pharmaceutical company where there’s not a small group of IT folks who are working on the open Web and are evaluating different technologies to do that,” Dolley says. “These are visionaries who are looking five years out, and saying we’re entering a world where the only way for us to scale….is to not store it internally. Even with Hadoop, the data sizes are going to be too massive, so we need to learn and think about how to federate queries.”

By storing healthcare and pharmaceutical data as semantic triples using graph databases such as Franz’s AllegroGraph, it can dramatically lower the hurdles to accessing huge stores of data stored externally. “Usually the primary use case that I see for AllegroGraph is creating a data fabric or a data ecosystem where they don’t have to pull the data internally,” Dolley tells Datanami. “They can do seamless queries out to data and curate it as it sits, and that’s quite appealing.


This is leading-edge stuff, and there are few mission-critical deployments of semantic graph technologies being used in the real world. However, there are a few of them, and the one that keeps popping up is the one at Montefiore Health System in New York City.

Montefiore is turning heads in the healthcare IT space because it was the first hospital to construct a “longitudinally integrated, semantically enriched” big data analytic infrastructure in support of “next-generation learning healthcare systems and precision medicine,” according to Franz, which supplied the graph database at the heart of the health data lake. Cloudera’s free version of Hadoop provided the distributed architecture for Montefiore’s semantic data lake (SDL), while other components and services were provided by tech big wigs Intel (NASDAQ: INTC) and Cisco Systems (NASDAQ: CSCO).

This approach to building an SDL will bring about big improvements in healthcare, says Dr. Parsa Mirhaji MD. PhD., the director of clinical research informatics at Einstein College of Medicine and Montefiore Health System.

“Our ability to conduct real-time analysis over new combinations of data, to compare results across multiple analyses, and to engage patients, practitioners and researchers as equal partners in big-data analytics and decision support will fuel discoveries, significantly improve efficiencies, personalize care, and ultimately save lives,” Dr. Mirhaji says in a press release. (emphasis added)

If I hadn’t known better, reading passages like:

the only way for us to scale….is to not store it internally

learn and think about how to federate queries

seamless queries out to data and curate it as it sits

I would have sworn I was reading a promotion piece for topic maps!

Of course, it doesn’t mention how to discover valuable data not written in your terminology, but you have to hold something back for the first presentation to the CIO.

The growth of data sets too large for ETL are icing on the cake for topic maps.

Why ETL when the data “appears” as I choose to view it? My topic map may be quite small, at least in relationship to the data set proper.


OK, truth-in-advertising moment, it won’t be quite that easy!

And I don’t take small bills. 😉 Diamonds, other valuable commodities, foreign deposit arrangements can be had.

People are starting to think in a “topic mappish” sort of way. Or at least a way where topic maps deliver what they are looking for.

That’s the key: What do they want?

Then use a topic map to deliver it.

Sunlight launches Hall of Justice… [ Topic Map “like” features?]

Tuesday, February 2nd, 2016

Sunlight launches Hall of Justice, a massive data inventory on criminal justice across the U.S. by Josh Stewart.

From the post:

Today, Sunlight is launching Hall of Justice, a robust, searchable data inventory of nearly 10,000 datasets and research documents from across all 50 states, the District of Columbia and the federal government. Hall of Justice is the culmination of 18 months of work gathering data and refining technology.

The process was no easy task: Building Hall of Justice required manual entry of publicly available data sources from a multitude of locations across the country.

Sunlight’s team went from state to state, meeting and calling local officials to inquire about and find data related to criminal justice. Some states like California have created a data portal dedicated to making criminal justice data easily accessible to the public; others had their data buried within hard to find websites. We also found data collected by state departments of justice, police forces, court systems, universities and everything in between.

“Data is shaping the future of how we address some of our most pressing problems,” said John Wonderlich, executive director of the Sunlight Foundation. “This new resource is an experiment in how a robust snapshot of data can inform policy and research decisions.”

In addition to being a great data collection, the Hall of Justice attempts to deliver topic map like capability for searches:

The resource attempts to consolidate different terminology across multiple states, which is far from uniform or standardized. For example, if you search solitary confinement you will return results for data around solitary confinement, but also for the terms “segregated housing unit,” “SHU,” “administrative segregation” and “restrictive housing.” This smart search functionality makes finding datasets much easier and accessible.


Looking at all thirteen results for a search on “solitary confinement,” I don’t see the mapping in question. Or certainly no mapping based on characteristics of the subject, “solitary confinement.”

As close as Georgia’s 2013 Juvenile Justice Reform is using the word “restrictive” as in:

Create a two-class system within the Designated Felony Act. Designated felony offenses are divided into two classes, based on severity—Class A and Class B—that continue to allow restrictive custody while also adjusting available sanctions to account for both offense severity and risk level.

Restrictive custody is what jail systems are about so that doesn’t trip the wire for “solitary confinement.”

Of course, the links are to entire reports/documents/data sets so each researcher will have to extract and collate content individually. When that happens, a means to contribute that collation/mapping to the Hall of Justice would be a boon for other researchers. (Can you say “topic map?”)

As I write this, you will need to prefer Mozilla over Chrome, at least on Ubuntu.

Trigger Warning: If you are sensitive to traumatic events and/or reports of traumatic events, you may want to ask someone less sensitive to review these data sources.

The only difference between a concentration camp and American prisons is the lack of mass gas chambers. Every horror and abuse that you can imagine and some you probably can’t, are visited on people in U.S. prisons everyday.

As Joan Baez says in Prison Triology:

Sunlight’s Hall of Justice is a great step forward in documenting the chambers of horror we call American prisons.

And we’re gonna raze, raze the prisons

To the ground

Help us raze, raze the prisons

To the ground

Are you ready?

3 Decades of High Quality News! (code reuse)

Monday, February 1st, 2016

‘NewsHour’ archives to be digitized and available online by Dru Sefton.

From the post:

More than three decades of NewsHour are heading to an online home, the American Archive of Public Broadcasting.

Nearly 10,000 episodes that aired from 1975 to 2007 will be archived through a collaboration among AAPB; WGBH in Boston; WETA in Arlington, Va.; and the Library of Congress. The organizations jointly announced the project Thursday.

“The project will take place over the next year and a half,” WGBH spokesperson Emily Balk said. “The collection should be up in its entirety by mid-2018, but AAPB will be adding content from the collection to its website on an ongoing, monthly basis.”

Looking forward to that collection!

Useful on its own, but even more so if you had an indexical object that could point to a subject in a PBS news episode and at the same time, point to episodes on the same subject from other TV and radio news archives, not to mention the same subject in newspapers and magazines.

Oh, sorry, that would be a topic in ISO 13250-2 parlance and the more general concept of a proxy in ISO 13250-5. Thought I should mention that before someone at IBM runs off to patent another pre-existing idea.

I don’t suppose padding patent statistics hurts all that much, considering that the Supremes are poised to invalidate process and software patents in one fell swoop.

Hopefully economists will be ready to measure the amount of increased productivity (legal worries about and enforcement of process/software patents aren’t productive activities) from foreclosing even the potential of process or software patents.

Copyright is more than sufficient to protect source code, as is any programmer is going to use another programmers code. They say that scientists would rather use another scientist’s toothbrush that his vocabulary.

Copying another programmer’s code (code re-use) is more akin to sharing a condom. It’s just not done.

Writing Clickbait TopicMaps?

Wednesday, January 20th, 2016

‘Shocking Celebrity Nip Slips’: Secrets I Learned Writing Clickbait Journalism by Kate Lloyd.

I’m sat at a desk in a glossy London publishing house. On the floors around me, writers are working on tough investigations and hard news. I, meanwhile, am updating a feature called “Shocking celebrity nip-slips: boobs on the loose.” My computer screen is packed with images of tanned reality star flesh as I write captions in the voice of a strip club announcer: “Snooki’s nunga-nungas just popped out to say hello!” I type. “Whoops! Looks like Kim Kardashian forgot to wear a bra today!”

Back in 2013, I worked for a women’s celebrity news website. I stumbled into the industry at a time when online editors were panicking: Their sites were funded by advertisers who demanded that as many people as possible viewed stories. This meant writing things readers loved and shared, but also resorting to shadier tactics. With views dwindling, publications like mine often turned to the gospel of search engine optimisation, also known as SEO, for guidance.

Like making a deal with a highly-optimized devil, relying heavily on SEO to push readers to websites has a high moral price for publishers. When it comes to female pop stars and actors, people are often more likely to search for the celebrity’s name with the words “naked,” “boobs,” “butt,” “weight,” and “bikini” than with the names of their albums or movies. Since 2008, “Miley Cyrus naked” has been consistently Googled more than “Miley Cyrus music,” “Miley Cyrus album,” “Miley Cyrus show,” and “Miley Cyrus Instagram.” Plus, “Emma Watson naked” has been Googled more than “Emma Watson movie” since she was 15. In fact, “Emma Watson feet” gets more search traffic than “Emma Watson style,” which might explain why one women’s site has a fashion feature called “Emma Watson is an excellent foot fetish candidate.”

If you don’t know what other people are be searching for, try these two resources on Google Trends:

Hacking the Google Trends API (2014)

PyTrends – Pseudo API for Google Trends (Updated six days ago)

Depending on your sensibilities, you could collect content on celebrities into a topic map and when their searches spike, you can release links to the new material plus save readers the time of locating older content.

That might even be a viable business model.


Intuitionism and Constructive Mathematics 80-518/818 — Spring 2016

Saturday, January 9th, 2016

Intuitionism and Constructive Mathematics 80-518/818 — Spring 2016

From the course description:

In this seminar we shall read primary and secondary sources on the origins and developments of intuitionism and constructive mathematics from Brouwer and the Russian constructivists, Bishop, Martin-Löf, up to and including modern developments such as homotopy type theory. We shall focus both on philosophical and metamathematical aspects. Topics could include the Brouwer-Heyting-Kolmogorov (BHK) interpretation, Kripke semantics, topological semantics, the Curry-Howard correspondence with constructive type theories, constructive set theory, realizability, relations to topos theory, formal topology, meaning explanations, homotopy type theory, and/or additional topics according to the interests of participants.


  • Jean van Heijenoort (1967), From Frege to Gödel: A Source Book in Mathematical Logic 1879–1931, Cambridge, MA: Harvard University Press.
  • Michael Dummett (1977/2000), Elements of Intuitionism (Oxford Logic Guides, 39), Oxford: Clarendon Press, 1977; 2nd edition, 2000.
  • Michael Beeson (1985), Foundations of Constructive Mathematics, Heidelberg: Springer Verlag.
  • Anne Sjerp Troelstra and Dirk van Dalen (1988), Constructivism in Mathematics: An Introduction (two volumes), Amsterdam: North Holland.

Additional resources

Not online but a Spring course at Carnegie Mellon with a reading list that should exercise your mental engines!

Any subject with a two volume “introduction” (Anne Sjerp Troelstra and Dirk van Dalen), is likely to be heavy sledding. 😉

But the immediate relevance to topic maps is evident by this statement from Rosalie Iemhoff:

Intuitionism is a philosophy of mathematics that was introduced by the Dutch mathematician L.E.J. Brouwer (1881–1966). Intuitionism is based on the idea that mathematics is a creation of the mind. The truth of a mathematical statement can only be conceived via a mental construction that proves it to be true, and the communication between mathematicians only serves as a means to create the same mental process in different minds.

I would recast that to say:

Language is a creation of the mind. The truth of a language statement can only be conceived via a mental construction that proves it to be true, and the communication between people only serves as a means to create the same mental process in different minds.

There are those who claim there is some correspondence between language and something they call “reality.” Since no one has experienced “reality” in the absence of language, I prefer to ask: Is X useful for purpose Y? rather than the doubtful metaphysics of “Is X true?”

Think of it as helping get down to what’s really important, what’s in this for you?

BTW, don’t be troubled by anyone who suggests this position removes all limits on discussion. What motivations do you think caused people to adopt the varying positions they have now?

It certainly wasn’t a detached and disinterested search for the truth, whatever people may pretend once they have found the “truth” they are presently defending. The same constraints will persist even if we are truthful with ourselves.

Math Translator Wanted/Topic Map Needed: Mochizuki and the ABC Conjecture

Monday, January 4th, 2016

What if you Discovered the Answer to a Famous Math Problem, but No One was able to Understand It? by Kevin Knudson.

From the post:

The conjecture is fairly easy to state. Suppose we have three positive integers a,b,c satisfying a+b=c and having no prime factors in common. Let d denote the product of the distinct prime factors of the product abc. Then the conjecture asserts roughly there are only finitely many such triples with c > d. Or, put another way, if a and b are built up from small prime factors then c is usually divisible only by large primes.

Here’s a simple example. Take a=16, b=21, and c=37. In this case, d = 2x3x7x37 = 1554, which is greater than c. The ABC conjecture says that this happens almost all the time. There is plenty of numerical evidence to support the conjecture, and most experts in the field believe it to be true. But it hasn’t been mathematically proven — yet.

Enter Mochizuki. His papers develop a subject he calls Inter-Universal Teichmüller Theory, and in this setting he proves a vast collection of results that culminate in a putative proof of the ABC conjecture. Full of definitions and new terminology invented by Mochizuki (there’s something called a Frobenioid, for example), almost everyone who has attempted to read and understand it has given up in despair. Add to that Mochizuki’s odd refusal to speak to the press or to travel to discuss his work and you would think the mathematical community would have given up on the papers by now, dismissing them as unlikely to be correct. And yet, his previous work is so careful and clever that the experts aren’t quite ready to give up.

It’s not clear what the future holds for Mochizuki’s proof. A small handful of mathematicians claim to have read, understood and verified the argument; a much larger group remains completely baffled. The December workshop reinforced the community’s desperate need for a translator, someone who can explain Mochizuki’s strange new universe of ideas and provide concrete examples to illustrate the concepts. Until that happens, the status of the ABC conjecture will remain unclear.

It’s hard to imagine a more classic topic map problem.

At some point, Shinichi Mochizuki shared a common vocabulary with his colleagues in number theory and arithmetic geometry but no longer.

As Kevin points out:

The December workshop reinforced the community’s desperate need for a translator, someone who can explain Mochizuki’s strange new universe of ideas and provide concrete examples to illustrate the concepts.

Taking Mochizuki’s present vocabulary and working backwards to where he shared a common vocabulary with colleagues is simple enough to say.

The crux of the problem being that discussions are going to be fragmented, distributed in a variety of formal and informal venues.

Combining those discussions to construct a path back to where most number theorists reside today would require something with as few starting assumptions as is possible.

Where you could describe as much or as little about new subjects and their relations to other subjects as is necessary for an expert audience to continue to fill in any gaps.

I’m not qualified to venture an opinion on the conjecture or Mochizuki’s proof but the problem of mapping from new terminology that has its own context back to “standard” terminology is a problem uniquely suited to topic maps.

Going Viral in 2016

Tuesday, December 29th, 2015

How To Go Viral: Lessons From The Most Shared Content of 2015 by Steve Rayson.

I offer this as at least as amusing as it may be useful.

The topic element of a viral post is said to include:

Trending topic (e.g. Zombies), Health & fitness, Cats & Dogs, Babies, Long Life, Love

Hard to get any of those in with technical blog but I could try:

TM’s produce healthy and fit ED-free 90 year-old bi-sexuals with dogs & cats as pets who love all non-Zombies.

That’s 115 characters if you are counting.

Produce random variations on that until I find one that goes viral. 😉

But, I have never cared for click-bait or false advertising. Personally I find it insulting when marketers falsify research.

I may have to document some of those cases in 2016. There is no shortage of it.

None of my tweets may go viral in 2016 but Steve’s post will make it more likely they will be re-tweeted.

Feel free to re-use my suggested tweet as I am fairly certain that “…healthy and fit ED-free 90 year-old bi-sexuals…” is in the public domain.

Apache Ignite – In-Memory Data Fabric – With No Semantics

Friday, December 25th, 2015

I saw a tweet from the Apache Ignite project pointing to its contributors page: Start Contributing.

The documentation describes Apache Ignite™ as:

Apache Ignite™ In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.

If you think that is impressive, here’s a block representation of Ignite:


Or a more textual view:

You can view Ignite as a collection of independent, well-integrated, in-memory components geared to improve performance and scalability of your application. Some of these components include:

Imagine my surprise when as search on “semantics” said

No Results Found.”

Even without data, whose semantics could be documented, there should be hooks for documenting of the semantics of future data.

I’m not advocating Apache Ignite jury-rig some means of documenting the semantics of data and Ignite processes.

The need for semantic documentation varies what is sufficient for one case will be wholly inadequate for another. Not to mention that documentation and semantics, often require different skills than possessed by most developers.

What semantics do you need documented with your Apache Ignite installation?

What’s New for 2016 MeSH

Thursday, December 17th, 2015

What’s New for 2016 MeSH by Jacque-Lynne Schulman.

From the post:

MeSH is the National Library of Medicine controlled vocabulary thesaurus which is updated annually. NLM uses the MeSH thesaurus to index articles from thousands of biomedical journals for the MEDLINE/PubMed database and for the cataloging of books, documents, and audiovisuals acquired by the Library.

MeSH experts/users will need to absorb the details but some of the changes include:

Overview of Vocabulary Development and Changes for 2016 MeSH

  • 438 Descriptors added
  • 17 Descriptor terms replaced with more up-to-date terminology
  • 9 Descriptors deleted
  • 1 Qualifier (Subheading) deleted


MeSH Tree Changes: Uncle vs. Nephew Project

In the past, MeSH headings were loosely organized in trees and could appear in multiple locations depending upon the importance and specificity. In some cases the heading would appear two or more times in the same tree at higher and lower levels. This arrangement led to some headings appearing as a sibling (uncle) next to the heading under which they were treed as a nephew. In other cases a heading was included at a top level so it could be seen more readily in printed material. We reviewed these headings in MeSH and removed either the Uncle or Nephew depending upon the judgement of our Internal and External reviewers. There were over 1,000 tree changes resulting from this work, many of which will affect search retrieval in MEDLINE/PubMed and the NLM Catalog.


MeSH Scope Notes

MeSH had a policy that each descriptor should have a scope note regardless of how obvious its meaning. There were many legacy headings that were created without scope notes before this rule came into effect. This year we initiated a project to write scope notes for all existing headings. Thus far 481 scope notes to MeSH were added and the project continues for 2017 MeSH.

Echoes of Heraclitus:

It is not possible to step twice into the same river according to Heraclitus, or to come into contact twice with a mortal being in the same state. (Plutarch) (Heraclitus)

Semantics and the words we use to invoke them are always in a state of flux. Sometimes more, sometimes less.

The lesson here is that anyone who says you can have a fixed and stable vocabulary is not only selling something, they are selling you a broken something. If not broken on the day you start to use it, then fairly soon thereafter.

It took time for me to come to the realization that the same is true about information systems that attempt to capture changing semantics at any given point.

Topic maps in the sense of ISO 13250-2, for example, can capture and map changing semantics, but if and only if you are willing to accept its data model.

Which is good as far as it goes but what if I want a different data model? That is to still capture changing semantics and map between them, but using a different data model.

We may have a use case to map back to ISO 13250-2 or to some other data model. The point being that we should not privilege any data model or syntax in advance, at least not absolutely.

Not only do communities change but their preferences for technologies change as well. It seems just a bit odd to be selling an approach on the basis of capturing change only to build a dike to prevent change in your implementation.


Kidnapping Caitlynn (47 AKAs – Is There a Topic Map in the House?)

Thursday, December 10th, 2015

Kidnapping Caitlynn in 10 minutes long, but has accumulated forty-seven (47 AKAs).

Imagine the search difficulty in finding reviews under all forty-eight (48) titles.

Even better, imagine your search request was for something that really mattered.

Like known terrorists crossing national borders using their real names and passports.

Intelligence services aren’t doing all that hot even with string to string matches.

Perhaps that explains their inability to consider more sophisticated doctrines of identity.

If you can’t do string to string, more complex notions will grind your system to a halt.

Maybe intelligence agencies need new contractors. You think?

Learning from Distributed Data:… [Beating the Bounds]

Sunday, December 6th, 2015

Learning from Distributed Data: Mathematical and Computational Methods to Analyze De-centralized Information.

From the post:

Scientific advances typically produce massive amounts of data, which is, of course, a good thing. But when many of these datasets are at multiple locations, instead of all in one place, it becomes difficult and costly for researchers to extract meaningful information from them.

So, the question becomes: “How do we learn from these datasets if they cannot be shared or placed in a central location?” says Trilce Estrada-Piedra.

Estrada-Piedra, an assistant professor of computer sciences at the University of New Mexico (UNM) is working to find the solution. She designs software that will enable researchers to collaborate with one another, using decentralized data, without jeopardizing privacy or raising infrastructure concerns.

“Our contributions will help speed research in a variety of sciences like health informatics, astronomy, high energy physics, climate simulations and drug design,” Estrada-Piedra says. “It will be relevant for problems where data is spread out in many different locations.”

The aim of the National Science Foundation (NSF)-funded scientist’s project is to build mathematical models from each of the “local” data banks — those at each distributed site. These models will capture data patterns, rather than specific data points.

“Researchers then can share only the models, instead of sharing the actual data,” she says, citing a medical database as an example. “The original data, for example, would have the patient’s name, age, gender and particular metrics like blood pressure, heart rate, etcetera, and that one patient would be a data point. But the models will project his or her information and extract knowledge from the data. It would just be math. The idea is to build these local models that don’t have personal information, and then share the models without compromising privacy.”

Estrada-Piedra is designing algorithms for data projections and middleware: software that acts as a bridge between an operating system or database and applications, especially on a network. This will allow distributed data to be analyzed effectively.

I’m looking forward to hearing more about Estrada-Piedra’s work, although we all know there are more than data projection and middleware issues involved. Those are very real and very large problems, but as with all human endeavors, the last mile is defined by local semantics.

Efficiently managing local semantics, that is enabling others to seamlessly navigate your local semantics and to in turn navigate the local semantics of others, isn’t a technical task, or at least not primarily.

The primary obstacle to such a task is captured by John D. Cook in Medieval software project management.

The post isn’t long so I will quite it here:

Centuries ago, English communities would walk the little boys around the perimeter of their parish as a way of preserving land records. This was called “beating the bounds.” The idea was that by teaching the boundaries to someone young, the knowledge would be preserved for the lifespan of that person. Of course modern geological survey techniques make beating the bounds unnecessary.

Software development hasn’t reached the sophistication of geographic survey. Many software shops use a knowledge management system remarkably similar to beating the bounds. They hire a new developer to work on a new project. That developer will remain tied to that project for the rest of his or her career, like a serf tied to the land. The knowledge essential to maintaining that project resides only in the brain of its developer. There are no useful written records or reliable maps, just like medieval property boundaries.

Does that sound familiar? That only you or another person “know” the semantics of your datastores? Are you still “beating the bounds” to document your data semantics?

Or as John puts it:

There are no useful written records or reliable maps, just like medieval property boundaries.

It doesn’t have to be that way. You could have reliable maps, reliable maps that are updated when your data is mapped for yet another project. Another ETL is the acronym.

You can, as a manager, of course, simply allow data knowledge to evaporate from your projects but that seems like a very poor business practice.

Johanna Rothman responded to John’s post in Breaking Free of Legacy Projects with the suggestion that every project should have several young boys and girls “beating the bounds” for every major project.

The equivalent of avoiding a single point of failure in medieval software project management.

Better than relying on a single programmer but using more modern information management/retention techniques would be a better option.

I guess the question is do you like using medieval project management techniques for your data or not?

If you do, you won’t be any worse off than any of your competitors with a similar policy.

On the other hand, should one of your competitors break ranks, start using topic maps for example for mission critical data, well, you have been warned.

Connecting News Stories and Topic Maps

Monday, November 16th, 2015

New WordPress plug-in Catamount aims to connect data sets and stories by Mădălina Ciobanu.

From the post:

Non-profit news organisation VT Digger, based in the United States, is building an open-source WordPress plug-in that can automatically link news stories to relevant information collected in data sets.

The tool, called Catamount, is being developed with a $35,000 (£22,900) grant from Knight Foundation Prototype Fund, and aims to give news organisations a better way of linking existing data to their daily news coverage.

Rather than hyperlinking a person’s name in a story and sending readers to a different website, publishers can use the open-source plug-in to build a small window that pops up when readers hover over a selected section of the text.

“We have this great data set, but if people don’t know it exists, they’re not going to be racing to it every single day.

“The news cycle, however, provides a hook into data,” Diane Zeigler, publisher at VT Digger, told

If a person is mentioned in a news story and they are also a donor, candidate or representative of an organisation involved in campaign finance, for example, an editor would be able to check the two names coincide, and give Catamount permission to link the individual to all relevant information that exists in the database.

A brief overview of this information will then be available in a pop-up box, which readers can click in order to access the full data in a separate browser window or tab.

“It’s about being able to take large data sets and make them relevant to a daily news story, so thinking about ‘why does it matter that this data has been collected for years and years’?

“In theory, it might just sit there if people don’t have a reason to draw a connection,” said Zeigler.

While Catamount only works with WordPress, the code will be made available for publishers to customise and integrate with their own content management systems. reports on the grant and other winners in Knight Foundation awards $35,000 grant to VTDigger.

Assuming that the plugin will be agnostic as to the data source, this looks like an excellent opportunity to bind topic map managed content to news stories.

You could, I suppose, return one of those dreary listings of all the prior related stories from a news source.

But that is always a lot of repetitive text to wade through for very little gain.

If you curated content with a topic map, excerpting paragraphs from prior stories when necessary for quotes, that would be a high value return for a user following your link.

Since the award was made only days ago I assume there isn’t much to be reported on the Catamount tool, as of yet. I will be following the project and will report back when something testable surfaces.

I first saw this story in an alert from If you aren’t already following them you should be.