Archive for the ‘Topic Maps’ Category

Reboot Your $100+ Million F-35 Stealth Jet Every 10 Hours Instead of 4 (TM Fusion)

Wednesday, April 27th, 2016

Pentagon identifies cause of F-35 radar software issue

From the post:

The Pentagon has found the root cause of stability issues with the radar software being tested for the F-35 stealth fighter jet made by Lockheed Martin Corp, U.S. Defense Acquisition Chief Frank Kendall told a congressional hearing on Tuesday.

Last month the Pentagon said the software instability issue meant the sensors had to be restarted once every four hours of flying.

Kendall and Air Force Lieutenant General Christopher Bogdan, the program executive officer for the F-35, told a Senate Armed Service Committee hearing in written testimony that the cause of the problem was the timing of “software messages from the sensors to the main F-35” computer. They added that stability issues had improved to where the sensors only needed to be restarted after more than 10 hours.

“We are cautiously optimistic that these fixes will resolve the current stability problems, but are waiting to see how the software performs in an operational test environment,” the officials said in a written statement.
… (emphasis added)

At $100+ Million plane that requires rebooting every ten hours? I’m not a pilot but that sounds like a real weakness.

The precise nature of the software glitch isn’t described but you can guess one of the problems from Lockheed Martin’s, Software You Wish You Had: Inside the F-35 Supercomputer:

The human brain relies on five senses—sight, smell, taste, touch and hearing—to provide the information it needs to analyze and understand the surrounding environment.

Similarly, the F-35 relies on five types of sensors: Electronic Warfare (EW), Radar, Communication, Navigation and Identification (CNI), Electro-Optical Targeting System (EOTS) and the Distributed Aperture System (DAS). The F-35 “brain”—the process that combines this stellar amount of information into an integrated picture of the environment—is known as sensor fusion.

At any given moment, fusion processes large amounts of data from sensors around the aircraft—plus additional information from datalinks with other in-air F-35s—and combines them into a centralized view of activity in the jet’s environment, displayed to the pilot.

In everyday life, you can imagine how useful this software might be—like going out for a jog in your neighborhood and picking up on real-time information about obstacles that lie ahead, changes in traffic patterns that may affect your route, and whether or not you are likely to pass by a friend near the local park.

F-35 fusion not only combines data, but figures out what additional information is needed and automatically tasks sensors to gather it—without the pilot ever having to ask.
… (emphasis added)

The fusion of data from other in-air F-35s is a classic topic map merging of data problem.

You have one subject, say an anti-aircraft missile site, seen from up to four (in the F-35 specs) F-35s. As is the habit of most physical objects, it has only one geographic location but the fusion computer for the F-35 doesn’t come up with than answer.

Kris Osborn writes in Software Glitch Causes F-35 to Incorrectly Detect Targets in Formation:

“When you have two, three or four F-35s looking at the same threat, they don’t all see it exactly the same because of the angles that they are looking at and what their sensors pick up,” Bogdan told reporters Tuesday. “When there is a slight difference in what those four airplanes might be seeing, the fusion model can’t decide if it’s one threat or more than one threat. If two airplanes are looking at the same thing, they see it slightly differently because of the physics of it.”

For example, if a group of F-35s detect a single ground threat such as anti-aircraft weaponry, the sensors on the planes may have trouble distinguishing whether it was an isolated threat or several objects, Bogdan explained.

As a result, F-35 engineers are working with Navy experts and academics from John’s Hopkins Applied Physics Laboratory to adjust the sensitivity of the fusion algorithms for the JSF’s 2B software package so that groups of planes can correctly identify or discern threats.

“What we want to have happen is no matter which airplane is picking up the threat – whatever the angles or the sensors – they correctly identify a single threat and then pass that information to all four airplanes so that all four airplanes are looking at the same threat at the same place,” Bogdan said.

Unless Bogdan is using “sensitivity” in a very unusual sense, that doesn’t sound like the issue with the fusion computer of the F-35.

Rather the problem is the fusion computer has no explicit doctrine of subject identity to use when it is merging data from different F-35s, whether it be two, three, four or even more F-35s. The display of tactical information should be seamless to the pilot and without human intervention.

I’m sure members of Congress were impressed with General Bogdan using words like “angles” and “physics,” but the underlying subject identity issue isn’t hard to address.

At issue is the location of a potential target on the ground. Within some pre-defined metric, anything located within a given area is the “same target.”

The Air Force has already paid for this type of analysis and the mathematics of what is called Circular Error Probability (CEP) has been published in Use of Circular Error Probability in Target Detection by William Nelson (1988).

You need to use the “current” location of the detecting aircraft, allowances for inaccuracy in estimating the location of the target, etc., but once you call out the subject identity as an issue, its a matter of making choices of how accurate you want the subject identification to be.

Before you forward this to Gen. Bogdan as a way forward on the fusion computer, realize that CEP is only one aspect of target identification. But, calling the subject identity of targets out explicitly, enables reliable presentation of single/multiple targets to pilots.

Your call, confusing displays or a reliable, useful display.

PS: I assume military subject identity systems would not be running XTM software. Same principles apply even if the syntax is different.

Seriously, Who’s Gonna Find It?

Monday, April 25th, 2016


Graphic whimsy via Bruce Sterling,

Are your information requirements met by finding something or by finding the right thing?

Similar Pages for Wikipedia – Lateral – Indexing Practices

Saturday, April 23rd, 2016

Similar Pages for Wikipedia (Chrome extension)

I started looking at this software with a mis-impression that I hope you can avoid.

I installed the extension and as advertised, if I am on a Wikipedia page, it recommends “similar” Wikipedia pages.

Unless I’m billing time, plowing through page after page of tangentially related material isn’t my idea of a good time.

Ah, but I confused “document” with “page.”

I discovered that error while reading Adding Documents at Lateral, which gives the following example:


Ah! So “document” means as much or as little text as I choose to use when I add the document.

Which means if I were creating a document store of graph papers, I would capture only the new material and not the inevitable a “graph consists of nodes and edges….”

There are pre-populatd data sets, News 350,000+ news and blog articles, updated every 15 mins; arXiv 1M+ papers (all), updated daily; PubMed 6M+ medical journals from before July 2014; SEC 6,000+ yearly financial reports / 10-K filings from 2014; Wikipedia 463,000 pages which had 20+ page views in 2013.

I suspect the granularity on the pre-populated data sets is “document” in the usual sense size.

Glad to see the option to define a “document” to be an arbitrary span of text.

I don’t need to find more “documents” (in the usual sense) but more relevant snippets that are directly on point.

Hmmm, perhaps indexing at the level of paragraphs instead of documents (usual sense)?

Which makes me wonder why we index at the level of documents (usual sense) anyway? Is it simply tradition from when indexes were prepared by human indexers? And indexes were limited by physical constraints?

Corporate Bribery/Corruption – Poland/U.S./Russia – A Trio

Friday, April 22nd, 2016

GIJN (Global Investigation Journalism Network) tweeted a link to Corporate misconduct – individual consequences, 14th Global Fraud Survey this morning.

From the foreword by David L. Stulb:

In the aftermath of recent major terrorist attacks and the revelations regarding widespread possible misuse of offshore jurisdictions, and in an environment where geopolitical tensions have reached levels not seen since the Cold War, governments around the world are under increased pressure to face up to the immense global challenges of terrorist financing, migration and corruption. At the same time, certain positive events, such as the agreement by the P5+1 group (China, France, Russia, the United Kingdom, the United States, plus Germany) with Iran to limit Iran’s sensitive nuclear activities are grounds for cautious optimism.

These issues contribute to volatility in financial markets. The banking sector remains under significant regulatory focus, with serious stress points remaining. Governments, meanwhile, are increasingly coordinated in their approaches to investigating misconduct, including recovering the proceeds of corruption. The reason for this is clear. Bribery and corruption continue to represent a substantial threat to sluggish global growth and fragile financial markets.

Law enforcement agencies, including the United States Department of Justice and the United States Securities and Exchange Commission, are increasingly focusing on individual misconduct when investigating impropriety. In this context, boards and executives need to be confident that their businesses comply with rapidly changing laws and regulations wherever they operate.

For this, our 14th Global Fraud Survey, EY interviewed senior executives with responsibility for tackling fraud, bribery and corruption. These individuals included chief financial officers, chief compliance officers, heads of internal audit and heads of legal departments. They are ideally placed to provide insight into the impact that fraud and corruption is having on business globally.

Despite increased regulatory activity, our research finds that boards could do significantly more to protect both themselves and their companies.

Many businesses have failed to execute anti-corruption programs to proactively mitigate their risk of corruption. Similarly, many businesses are not yet taking advantage of rich seams of information that would help them identify and mitigate fraud, bribery and corruption issues earlier.

Between October 2015 and January 2016, we interviewed 2,825 individuals from 62 countries and territories. The interviews identified trends, apparent contradictions and issues about which boards of directors should be aware.

Partners from our Fraud Investigation & Dispute Services practice subsequently supplemented the Ipsos MORI research with in-depth discussions with senior executives of multinational companies. In these interviews, we explored the executives’ experiences of operating in certain key business environments that are perceived to expose companies to higher fraud and corruption risks. Our conversations provided us with additional insights into the impact that changing legislation, levels of enforcement and cultural behaviors are having on their businesses. Our discussions also gave us the opportunity to explore pragmatic steps that leading companies have been taking to address these risks.

The executives to whom we spoke highlighted many matters that businesses must confront when operating across borders: how to adapt market-entry strategies in countries where cultural expectations of acceptable behaviors can differ; how to get behind a corporate structure to understand a third party’s true ownership; the potential negative impact that highly variable pay can have on incentives to commit fraud and how to encourage whistleblowers to speak up despite local social norms to the contrary, to highlight a few.

Our survey finds that many respondents still maintain the view that fraud, bribery and corruption are other people’s problems despite recognizing the prevalence of the issue in their own countries. There remains a worryingly high tolerance or misunderstanding of conduct that can be considered inappropriate — particularly among respondents from finance functions. While companies are typically aware of the historic risks, they are generally lagging behind on the emerging ones, for instance the potential impact of cybercrime on corporate reputation and value, while now well publicized, remains a matter of varying priority for our respondents. In this context, companies need to bolster their defenses. They should apply anti-corruption compliance programs, undertake appropriate due diligence on third parties with which they do business and encourage and support whistleblowers to come forward with confidence. Above all, with an increasing focus on the accountability of the individual, company leadership needs to set the right tone from the top. It is only by taking such steps that boards will be able to mitigate the impact should the worst happen.

This survey is intended to raise challenging questions for boards. It will, we hope, drive better conversations and ongoing dialogue with stakeholders on what are truly global issues of major importance.

We acknowledge and thank all those executives and business leaders who participated in our survey, either as respondents to Ipsos MORI or through meeting us in person, for their contributions and insights. (emphasis in original)

Apologies for the long quote but it was necessary to set the stage of the significance of:

…increasingly focusing on individual misconduct when investigating impropriety.

That policy grants a “bye” to corporations who benefit from individual mis-coduct, in favor of punishing individual actors within a corporation.

While granting the legitimacy of punishing individuals, corporations cannot act except by their agents, failing to punish corporations enables their shareholders to continue to benefit from illegal behavior.

Another point of significance, listing of countries on page 44, gives the percentage of respondents that agree “…bribery/corrupt practices happen widely…” as follows (in part):

Rank Country % Agree
30 Poland 34
31 Russia 34
32 U.S. 34

When the Justice Department gets hoity-toity about law and corruption, keep those figures in mind.

If the Justice Department representative you are talking to isn’t corrupt, it happens, there’s one on either side of them that probably is.

Topic maps can help ferret out or manage “corruption,” depending upon your point of view. Even structural corruption, take the U.S. political campaign donation process.

Scope Rules!

Thursday, April 21st, 2016

I was reminded of the power of scope (in the topic map sense) when I saw John D. Cook’s Quaternions in Paradise Lost.


See John’s post for the details but in summary, Kuiper’s Quaternions and Rotation Sequences quoted a passage from Milton that used the term quarterion.

Your search appliance and most if not all of the public search engines will happily return all uses of quarterion without distinction. (Yes, I am implying there is more than one meaning for quarterion. See John’s post for the details.)

In addition to distinguishing between usages in Milton and Kuiper, scope can cleanly separate terms by agency, activity, government or other distinctions.

Or you can simply wade through search glut.

Your call.

Searching for Subjects: Which Method is Right for You?

Wednesday, April 20th, 2016

Leaving to one side how to avoid re-evaluating the repetitive glut of materials from any search, there is the more fundamental problem of how to you search for a subject?

This is a back-of-the-envelope sketch that I will be expanding, but here goes:

Basic Search

At its most basic, a search consists of a <term> and the search seeks to match strings that match that <term>.

Even allowing for Boolean operators, the matches against <term> are only and forever string matches.

Basic Search + Synonyms

Of course, as skilled searchers you will try not only one <term>, but several <synonym>s for the term as well.

A good example of that strategy is used at PubMed:

If you enter an entry term for a MeSH term the translation will also include an all fields search for the MeSH term associated with the entry term. For example, a search for odontalgia will translate to: “toothache”[MeSH Terms] OR “toothache”[All Fields] OR “odontalgia”[All Fields] because Odontalgia is an entry term for the MeSH term toothache. [PubMed Help]

The expansion to include the MeSH term Odontalgia is useful, but how do you maintain it?

A reader can see “toothache” and “Odontalgia” are treated as synonyms, but why remains elusive.

This is the area of owl:sameAs, the mapping of multiple subject identifiers/locators to a single topic, etc. You know that “sameness” exists, but why isn’t clear.

Subject Identity Properties

In order to maintain a PubMed or similar mapping, you need people who either “know” the basis for the mappings or you can have the mappings documented. That is you can say on what basis the mapping happened and what properties were present.

For example:


Key Value
symptom pain
general-location mouth
specific-location tooth

So if we are mapping terms to other terms and the specific location value reads “tongue,” then we know that isn’t a mapping to “toothache.”

How Far Do You Need To Go?

Of course for every term that we use as a key or value, there can be an expansion into key/value pairs, such as for tooth:


Key Value
general-location mouth
composition enamel coated bone
use biting, chewing


Each step towards more precise gathering of information increases your pre-search costs but decreases your post-search cost of casting out irrelevant material.

Moreover, precise gathering of information will help you avoid missing data simply due to data glut returns.

If maintenance of your mapping across generations is a concern, doing more than mapping of synonyms for reason or reasons unknown may be in order.

The point being that your current retrieval or lack thereof of current and correct information has a cost. As does improving your current retrieval.

The question of improved retrieval isn’t ideological but an ROI driven one.

  • If you have better mappings will that give you an advantage over N department/agency?
  • Will better retrieval slow down (never stop) the time wasted by staff on voluminous search results?
  • Will more precision focus your limited resources (always limited) on highly relevant materials?

Formulate your own ROI questions and means of measuring them. Then reach out to topic maps to see how they improve (or not) your ROI.

Properly used, I think you are in for a pleasant surprise with topic maps.

Dictionary of Fantastic Vocabulary [Increasing the Need for Topic Maps]

Monday, April 18th, 2016

Dictionary of Fantastic Vocabulary by Greg Borenstein.

Alexis Lloyd tweeted this link along with:

This is utterly fantastic.

Well, it certainly increases the need for topic maps!

From the bot description on Twitter:

Generating new words with new meanings out of the atoms of English.

Ahem, are you sure about that?

Is a bot is generating meaning?

Or are readers conferring meaning on the new words as they are read?

If, as I contend, readers confer meaning, the utterance of every “new” word, opens up as many new meanings as there are readers of the “new” word.

Example of people conferring different meanings on a term?

Ask a dozen people what is meant by “shot” in:

It’s just a shot away

When Lisa Fischer breaks into her solo in:

(Best played loud.)

Differences in meanings make for funny moments, awkward pauses, blushes, in casual conversation.

What if the stakes are higher?

What if you need to produce (or destroy) all the emails by “bobby1.”

Is it enough to find some of them?

What have you looked for lately? Did you find all of it? Or only some of it?

New words appear everyday.

You are already behind. You will get further behind using search.

Visualizing Data Loss From Search

Thursday, April 14th, 2016

I used searches for “duplicate detection” (3,854) and “coreference resolution” (3290) in “Ironically, Entity Resolution has many duplicate names” [Data Loss] to illustrate potential data loss in searches.

Here is a rough visualization of the information loss if you use only one of those terms:


If you search for “duplicate detection,” you miss all the articles shaded in blue.

If you search for “coreference resolution,” you miss all the articles shaded in yellow.

Suggestions for improving this visualization?

It is a visualization that could be performed on client’s data, using their search engine/database.

In order to identify the data loss they are suffering now from search across departments.

With the caveat that not all data loss is bad and/or worth avoiding.

Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.

That is a starting point for evaluating the use of topic maps.

While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?

Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.

I assume the most relevant terms are going to be those internal to customers and/or potential customers.

Interest in working this up into a client prospecting/topic map marketing tool?

Separately I want to note my discovery (you probably already knew about it) of VennDIS: a JavaFX-based Venn and Euler diagram software to generate publication quality figures. Download here. (Apologies, the publication itself if firewalled.)

The export defaults to 800 x 800 resolution. If you need something smaller, edit the resulting image in Gimp.

It’s a testimony to the software that I was able to produce a useful image in less than a day. Kudos to the software!

“Ironically, Entity Resolution has many duplicate names” [Data Loss]

Wednesday, April 13th, 2016

Nancy Baym tweeted:

“Ironically, Entity Resolution has many duplicate names” – Lise Getoor


I can’t think of any subject that doesn’t have duplicate names.

Can you?

In a “search driven” environment, not knowing the “duplicate” names for a subject means data loss.

Data loss that could include “smoking gun” data.

Topic mappers have been making that pitch for decades but it never has really caught fire.

I don’t think anyone doubts that data loss occurs, but the gravity of that data loss remains elusive.

For example, let’s take three duplicate names for entity resolution from the slide, duplicate detection, reference reconciliation, coreference resolution.

Supplying all three as quoted strings to CiteSeerX, any guesses on the number of “hits” returned?

As of April 13, 2016:

  • duplicate detection – 3,854
  • reference reconciliation – 253
  • coreference resolution – 3,290

When running the query "duplicate detection" "coreference resolution", only 76 “hits” are returned, meaning that there are only 76 overlapping cases reported in the total of 7,144 for both of those terms separately.

That’s assuming CiteSeerX isn’t shorting me on results due to server load, etc. I would have to cross-check the data itself before I would swear to those figures.

But consider just the raw numbers I report today: duplicate detection – 3,854, coreference resolution – 3,290, with 76 overlapping cases.

That’s two distinct lines of research on the same problem, for the most part, ignoring the other.

What do you think the odds are of duplication of techniques, experiences, etc., spread out over those 7,144 articles?

Instead of you or your client duplicating a known-to-somebody solution, you could be building an enhanced solution.

Well, except for the data loss due to “duplicate names” in a search environment.

And that you would have to re-read all the articles in order to find which technique or advancement was made in each article.

Multiply that by everyone who is interested in a subject and its a non-trivial amount of effort.

How would you like to avoid data loss and duplication of effort?

Coeffects: Context-aware programming languages – Subject Identity As Type Checking?

Tuesday, April 12th, 2016

Coeffects: Context-aware programming languages by Tomas Petricek.

From the webpage:

Coeffects are Tomas Petricek‘s PhD research project. They are a programming language abstraction for understanding how programs access the context or environment in which they execute.

The context may be resources on your mobile phone (battery, GPS location or a network printer), IoT devices in a physical neighborhood or historical stock prices. By understanding the neighborhood or history, a context-aware programming language can catch bugs earlier and run more efficiently.

This page is an interactive tutorial that shows a prototype implementation of coeffects in a browser. You can play with two simple context-aware languages, see how the type checking works and how context-aware programs run.

This page is also an experiment in presenting programming language research. It is a live environment where you can play with the theory using the power of new media, rather than staring at a dead pieces of wood (although we have those too).

(break from summary)

Programming languages evolve to reflect the changes in the computing ecosystem. The next big challenge for programming language designers is building languages that understand the context in which programs run.

This challenge is not easy to see. We are so used to working with context using the current cumbersome methods that we do not even see that there is an issue. We also do not realize that many programming features related to context can be captured by a simple unified abstraction. This is what coeffects do!

What if we extend the idea of context to include the context within which words appear?

For example, writing a police report, the following sentence appeared:

There were 20 or more <proxy string=”black” pos=”noun” synonym=”African” type=”race”/>s in the group.

For display purposes, the string value “black” appears in the sentence:

There were 20 or more blacks in the group.

But a search for the color “black” would not return that report because the type = color does not match type = race.

On the other hand, if I searched for African-American, that report would show up because “black” with type = race is recognized as a synonym for people of African extraction.

Inline proxies are the easiest to illustrate but that is only one way to serialize such a result.

If done in an authoring interface, such an approach would have the distinct advantage of offering the original author the choice of subject properties.

The advantage of involving the original author is that they have an interest in and awareness of the document in question. Quite unlike automated processes that later attempt annotation by rote.

No Perception Without Cartography [Failure To Communicate As Cartographic Failure]

Saturday, April 9th, 2016

Dan Klyn tweeted:

No perception without cartography

with an image of this text (from Self comes to mind: constructing the conscious mind by Antonio R Damasio):

The nonverbal kinds of images are those that help you display mentally the concepts that correspond to words. The feelings that make up the background of each mental instant and that largely signify aspects of the body state are images as well. Perception, in whatever sensory modality, is the result of the brain’s cartographic skill.

Images represent physical properties of entities and their spatial and temporal relationships, as well as their actions. Some images, which probably result from the brain’s making maps of itself making maps, are actually quite abstract. They describe patterns of occurrence of objects in time and space, the spatial relationships and movement of objects in terms of velocity and trajectory, and so forth.

Dan’s tweet spurred me to think that our failures to communicate to others could be described as cartographic failures.

If we use a term that is unknown to the average reader, say “daat,” the reader lacks a mental mapping that enables interpretation of that term.

Even if you know the term, it doesn’t stand in isolation in your mind. It fits into a number of maps, some of which you may be able to articulate and very possibly into other maps, which remain beyond your (and our) ken.

Not that this is a light going off experience for you or me but perhaps the cartographic imagery may be helpful in illustrating both the value and the risks of topic maps.

The value of topic maps is spoken of often but the risks of topic maps rarely get equal press.

How would topic maps be risky?

Well, consider the average spreadsheet using in a business setting.

Felienne Hermans in Spreadsheets: The Ununderstood Dark Matter of IT makes a persuasive case that spreadsheets are on an average five years old with little or no documentation.

If those spreadsheets remain undocumented, both users and auditors are equally stymied by their ignorance, a cartographic failure that leaves both wondering what must have been meant by columns and operations in the spreadsheet.

To the extent that a topic map or other disclosure mechanism preserves and/or restores the cartography that enables interpretation of the spreadsheet, suddenly staff are no longer plausibly ignorant of the purpose or consequences of using the spreadsheet.

Facile explanations that change from audit to audit are no longer possible. Auditors are chargeable with consistent auditing from one audit to another.

Does it sound like there is going to be a rush to use topic maps or other mechanisms to make spreadsheets transparent?

Still, transparency that befalls one could well benefit another.

Or to paraphrase King David (2 Samuel 11:25):

Don’t concern yourself about this. In business, transparency falls on first one and then another.

Ready to inflict transparency on others?

“No One Willingly Gives Away Power”

Friday, April 8th, 2016

Matthew Schofield in European anti-terror efforts hobbled by lack of trust, shared intelligence hits upon the primary reason for resistance to topic maps and other knowledge integration technologies.

Especially in intelligence, knowledge is power. No one willingly gives away power.” (Magnus Ranstorp, Swedish National Defense University)

From clerks who sort mail to accountants who cook the books to lawyers that defend patents and everyone else in between, everyone in an enterprise has knowledge, knowledge that gives them power others don’t have.

Topic maps have been pitched on a “greater good for the whole” basis but as Magnus points out, who the hell really wants that?

When confronted with a new technique, technology, methodology, the first and foremost question on everyone’s mind is:

Do I have more/less power/status with X?


approach loses power.


approach gains power.

Relevant lyrics:

Oh, there ain’t no rest for the wicked
Money don’t grow on trees
I got bills to pay
I got mouths to feed
And ain’t nothing in this world for free
No I can’t slow down
I can’t hold back
Though you know I wish I could
No there ain’t no rest for the wicked
Until we close our eyes for good

Sell topic maps to increase/gain power.

PS: Keep the line, “No one willingly gives away power” in discussions of why the ICIJ refuses to share the Panama Papers with the public.

Pentagon Confirms Crowdsourcing of Map Data

Tuesday, April 5th, 2016

I have mentioned before, Tracking NSA/CIA/FBI Agents Just Got Easier, The DEA is Stalking You!, how citizens can invite federal agents to join the gold fish bowl being prepared for the average citizen.

Of course, that’s just me saying it, unless and until the Pentagon confirms the crowdsourcing of map data!

Aliya Sternstein writes
in Soldiers to Help Crowdsource Spy Maps:

“What a great idea if we can get our soldiers adding fidelity to the maps and operational picture that we already have” in Defense systems, Gordon told Nextgov. “All it requires is pushing out our product in a manner that they can add data to it against a common framework.”

Comparing mapping parties to combat support activities, she said, soldiers are deployed in some pretty remote areas where U.S. forces are not always familiar with the roads and the land, partly because they tend to change.

If troops have a base layer, “they can do basically the same things that that social party does and just drop pins and add data,” Gordon said from a meeting room at the annual Esri conference. “Think about some of the places in Africa and some of the less advantaged countries that just don’t have addresses in the way we do” in the United States.

Of course, you already realize the value of crowd-sourcing surveillance of government agents but for the c-suite crowd, confirmation from a respected source (the Pentagon) may help push your citizen surveillance proposal forward.

BTW, while looking at Army GeoData research plans (pages 228-232), I ran across this passage:

This effort integrates behavior and population dynamics research and analysis to depict the operational environment including culture, demographics, terrain, climate, and infrastructure, into geospatial frameworks. Research exploits existing open source text, leverages multi-media and cartographic materials, and investigates data collection methods to ingest geospatial data directly from the tactical edge to characterize parameters of social, cultural, and economic geography. Results of this research augment existing conventional geospatial datasets by providing the rich context of the human aspects of the operational environment, which offers a holistic understanding of the operational environment for the Warfighter. This item continues efforts from Imagery and GeoData Sciences, and Geospatial and Temporal Information Structure and Framework and complements the work in PE 0602784A/Project T41.

Doesn’t that just reek with subjects that would be identified differently in intersecting information systems?

One solution would be to fashion top down mapping systems that are months if not years behind demands in an operational environment. Sort of like tanks that overheat in jungle warfare.

Or you could do something a bit more dynamic that provides a “good enough” mapping for operational needs and yet also has the information necessary to integrate it with other temporary solutions.

Pardon the Intermission

Friday, March 18th, 2016

Apologies for the absence of posts starting on March 15, 2016 until this one today.

I made an unplanned trip to the local hospital via ambulance around 8:00 AM on the 15th and managed to escape on the afternoon of March 17, 2016.

On the downside I didn’t have anyway to explain my sudden absence from the Net.

On the upside I had a lot of non-computer assisted time to think about topic maps, etc., while being poked, prodded, waiting for lab results, etc.

Not to mention I re-read the first two Harry Potter books. 😉

I have one interesting item for today and will be posting about my non-computer assisted thinking about topic maps in the near future.

Your interest in this blog and comments are always appreciated!

Technology Adoption – Nearly A Vertical Line (To A Diminished IQ)

Thursday, March 10th, 2016


From: There’s a major long-term trend in the economy that isn’t getting enough attention by Rick Rieder.

From the post:

As the chart above shows, people in the U.S. today are adopting new technologies, including tablets and smartphones, at the swiftest pace we’ve seen since the advent of the television. However, while television arguably detracted from U.S. productivity, today’s advances in technology are generally geared toward greater efficiency at lower costs. Indeed, when you take into account technology’s downward influence on price, U.S. consumption and productivity figures look much better than headline numbers would suggest.

Hmmm, did you catch that?

…while television arguably detracted from U.S. productivity, today’s advances in technology are generally geared toward greater efficiency at lower costs.

Really? Rick must have missed the memo on how multitasking (one aspect of smart phones, tablets, etc.) lowers your IQ by 15 points. About what you would expect from smoking a joint.

If technology encourages multitasking, making us dumber, then we are becoming less efficient. Yes?

Imagine if instead of scrolling past tweets with images of cats, food, irrelevant messages, every time you look at your Twitter time line, you got the two or three tweets relevant to your job function.

Each of those not-on-task tweets chips away at the amount of attention span you have to spend on the two or three truly important tweets.

Apps that consolidate, filter and diminish information flow are the path to greater productivity.

Topic maps anyone?


Thursday, March 10th, 2016


From the webpage:

Growthverse was built for marketers, by marketers, with input from more than 100 CMOs.

Explore 800 marketing technology companies (and growing).

I originally arrived at this site here.

Interesting visualization that may result in suspects (their not prospects until you have serious discussions) for topic map based tools.

The site says the last update was in September 2015 so take heed that the data is stale by about six months.

That said, its easier than hunting down the 800+ companies on your own.

Good hunting!

Wandora – New Release – 2016-03-08

Wednesday, March 9th, 2016

Wandora – New Release – 2016-03-08

The homepage reports:

New Wandora version has been released today (2016-03-08). The release adds Wandora support to MariaDB and PostgreSQL database topic maps. Wandora has now more stylish look, especially in Traditional topic map. The release fixes many known bugs.

I’m no style or UI expert but I’m not sure where I should be looking for the “…more stylish look….” 😉

From the main window:


If you select Tools and then Tools Manager (or Cntrl-t [lower case, contrary to the drop down menu]), you will see a list of all tools (300+) with the message:

All known tools are listed here. The list contains also unfinished, buggy and depracated tools. Running such tool may cause exceptions and unpredictable behavior. We suggest you don’t run the tools listed here unless you really know what you are doing.

It is a very impressive set of tools!

There is no lack of place to explore in Wandora and to explore with Wandora.


Overlay Journal – Discrete Analysis

Saturday, March 5th, 2016

The arXiv overlay journal Discrete Analysis has launched by Christian Lawson-Perfect.

From the post:

Discrete Analysis, a new open-access journal for articles which are “analytical in flavour but that also have an impact on the study of discrete structures”, launched this week. What’s interesting about it is that it’s an arXiv overlay journal founded by, among others, Timothy Gowers.

What that means is that you don’t get articles from Discrete Analysis – it just arranges peer review of papers held on the arXiv, cutting out almost all of the expensive parts of traditional journal publishing. I wasn’t really prepared for how shallow that makes the journal’s website – there’s a front page, and when you click on an article you’re shown a brief editorial comment with a link to the corresponding arXiv page, and that’s it.

But that’s all it needs to do – the opinion of Gowers and co. is that the only real value that journals add to the papers they publish is the seal of approval gained by peer review, so that’s the only thing they’re doing. Maths papers tend not to benefit from the typesetting services traditional publishers provide (or, more often than you’d like, are actively hampered by it).

One way the journal is adding value beyond a “yes, this is worth adding to the list of papers we approve of” is by providing an “editorial introduction” to accompany each article. These are brief notes, written by members of the editorial board, which introduce the topics discussed in the paper and provide some context, to help you decide if you want to read the paper. That’s a good idea, and it makes browsing through the articles – and this is something unheard of on the internet – quite pleasurable.

It’s not difficult to imagine “editorial introductions” with underlying mini-topic maps that could be explored on their own or that as you reach the “edge” of a particular topic map, it “unfolds” to reveal more associations/topics.

Not unlike a traditional street map for New York which you can unfold to find general areas but can then fold it up to focus more tightly on a particular area.

I hesitate to say “zoom” because in the application I have seen (important qualification), “zoom” uniformly reduces your field of view.

A more nuanced notion of “zoom,” for a topic map and perhaps for other maps as well, would be to hold portions of the current view stationary, say a starting point on an interstate highway and to “zoom” only a portion of the current view to show a detailed street map. That would enable the user to see a particular location while maintaining its larger context.

Pointers to applications that “zoom” but also maintain different levels of “zoom” in the same view? Given the fascination with “hairy” presentations of graphs that would have to be real winner.

Facts Before Policy? – Digital Security – Contacts – Volunteer Opportunity

Friday, March 4th, 2016

Rep. McCaul, Michael T. [R-TX-10] has introduced H.R.4651 – Digital Security Commission Act of 2016, full text here, a proposal to form the National Commission on Security and Technology Challenges.

From the proposal:

(2) To submit to Congress a report, which shall include, at a minimum, each of the following:

(A) An assessment of the issue of multiple security interests in the digital world, including public safety, privacy, national security, and communications and data protection, both now and throughout the next 10 years.

(B) A qualitative and quantitative assessment of—

(i) the economic and commercial value of cryptography and digital security and communications technology to the economy of the United States;

(ii) the benefits of cryptography and digital security and communications technology to national security and crime prevention;

(iii) the role of cryptography and digital security and communications technology in protecting the privacy and civil liberties of the people of the United States;

(iv) the effects of the use of cryptography and other digital security and communications technology on Federal, State, and local criminal investigations and counterterrorism enterprises;

(v) the costs of weakening cryptography and digital security and communications technology standards; and

(vi) international laws, standards, and practices regarding legal access to communications and data protected by cryptography and digital security and communications technology, and the potential effect the development of disparate, and potentially conflicting, laws, standards, and practices might have.

(C) Recommendations for policy and practice, including, if the Commission determines appropriate, recommendations for legislative changes, regarding—

(i) methods to be used to allow the United States Government and civil society to take advantage of the benefits of digital security and communications technology while at the same time ensuring that the danger posed by the abuse of digital security and communications technology by terrorists and criminals is sufficiently mitigated;

(ii) the tools, training, and resources that could be used by law enforcement and national security agencies to adapt to the new realities of the digital landscape;

(iii) approaches to cooperation between the Government and the private sector to make it difficult for terrorists to use digital security and communications technology to mobilize, facilitate, and operationalize attacks;

(iv) any revisions to the law applicable to wiretaps and warrants for digital data content necessary to better correspond with present and future innovations in communications and data security, while preserving privacy and market competitiveness;

(v) proposed changes to the procedures for obtaining and executing warrants to make such procedures more efficient and cost-effective for the Government, technology companies, and telecommunications and broadband service providers; and

(vi) any steps the United States could take to lead the development of international standards for requesting and obtaining digital evidence for criminal investigations and prosecutions from a foreign, sovereign State, including reforming the mutual legal assistance treaty process, while protecting civil liberties and due process.

Excuse the legalese but clearly an effort that could provide a factual as opposed to fantasy basis for further on digital security. No one can guarantee a sensible result but without a factual basis, any legislation is certainly going to be wrong.

For your convenience and possible employment/volunteering, here are the co-sponsors of this bill, with hyperlinks to their congressional homepages:

Now would be a good time to pitch yourself for involvement in this possible commission.

Pay attention to Section 8 of the bill:

SEC. 8. Staff.

(a)Appointment.—The chairman and vice chairman shall jointly appoint and fix the compensation of an executive director and of and such other personnel as may be necessary to enable the Commission to carry out its functions under this Act.

(b)Security clearances.—The appropriate Federal agencies or departments shall cooperate with the Commission in expeditiously providing appropriate security clearances to Commission staff, as may be requested, to the extent possible pursuant to existing procedures and requirements, except that no person shall be provided with access to classified information without the appropriate security clearances.

(c)Detailees.—Any Federal Government employee may be detailed to the Commission on a reimbursable basis, and such detailee shall retain without interruption the rights, status, and privileges of his or her regular employment.

(d)Expert and consultant services.—The Commission is authorized to procure the services of experts and consultants in accordance with section 3109 of title 5, United States Code, but at rates not to exceed the daily rate paid a person occupying a position level IV of the Executive Schedule under section 5315 of title 5, United States Code.

(e)Volunteer services.—Notwithstanding section 1342 of title 31, United States Code, the Commission may accept and use voluntary and uncompensated services as the Commission determines necessary.

I can sense you wondering what:

the daily rate paid a person occupying a position level IV of the Executive Schedule under section 5315 of title 5, United States Code

means, in practical terms.

I was tempted to point you to: 5 U.S. Code § 5315 – Positions at level IV, but that would be cruel and uninformative. 😉

I did track down the Executive Schedule, which lists position level IV:

Annual: $160,300 or about $80.15/hr. and for an 8-hour day, $641.20.

If you do volunteer or get a paying gig, please remember that omission and/or manipulation of subject identity properties can render otherwise open data opaque.

I don’t know of any primes are making money off of topic maps currently so it isn’t likely to gain traction with current primes. On the other hand, new primes can and do occur. Not often but it happens.

PS: If the extra links to contacts, content, etc. are helpful, please let me know. I started off reading the link poor RSA 2016: McCaul calls backdoors ineffective, pushes for tech panel to solve security issues. Little more than debris on your information horizon.

11 Million Pages of CIA Files [+ Allen Dulles, war criminal]

Thursday, March 3rd, 2016

11 Million Pages of CIA Files May Soon Be Shared By This Kickstarter by Joseph Cox.

From the post:

Millions of pages of CIA documents are stored in Room 3000. The CIA Records Search Tool (CREST), the agency’s database of declassified intelligence files, is only accessible via four computers in the National Archives Building in College Park, MD, and contains everything from Cold War intelligence, research and development files, to images.

Now one activist is aiming to get those documents more readily available to anyone who is interested in them, by methodically printing, scanning, and then archiving them on the internet.

“It boils down to freeing information and getting as much of it as possible into the hands of the public, not to mention journalists, researchers and historians,” Michael Best, analyst and freedom of information activist told Motherboard in an online chat.

Best is trying to raise $10,000 on Kickstarter in order to purchase the high speed scanner necessary for such a project, a laptop, office supplies, and to cover some other costs. If he raises more than the main goal, he might be able to take on the archiving task full-time, as well as pay for FOIAs to remove redactions from some of the files in the database. As a reward, backers will help to choose what gets archived first, according to the Kickstarter page.

“Once those “priority” documents are done, I’ll start going through the digital folders more linearly and upload files by section,” Best said. The files will be hosted on the Internet Archive, which converts documents into other formats too, such as for Kindle devices, and sometimes text-to-speech for e-books. The whole thing has echoes of Cryptome—the freedom of information duo John Young and Deborah Natsios, who started off scanning documents for the infamous cypherpunk mailing list in the 1990s.

Good news! Kickstarter has announced this project funded!

Additional funding will help make this archive of documents available sooner rather than later.

As opposed to an attempt to boil the ocean of 11 million pages of CIA files, what about smaller topic mapping/indexing projects that focus on bounded sub-sets of documents of interest to particular communities?

I don’t have any interest in the STAR GATE project (clairvoyance, precognition, or telepathy, continued now by the DHS at airport screening facilities) but would be very interested in the records of Allen Dulles, a war criminal of some renown.

Just so you know, Michael has already uploaded documents on Allen Dulles from the CIA Records Search Tool (CREST) tool:

History of Allen Welsh Dulles as CIA Director – Volume I: The Man

History of Allen Welsh Dulles as CIA Director – Volume II: Coordination of Intelligence

History of Allen Welsh Dulles as CIA Director – Volume III: Covert Activities

History of Allen Welsh Dulles as CIA Director – Volume IV: Congressional Oversight and Internal Administration

History of Allen Welsh Dulles as CIA Director – Volume V: Intelligence Support of Policy

To describe Allen Dulles as a war criminal is no hyperbole. Among his other crimes, overthrow of President Jacobo Arbenz Guzman of Guatemala (think United Fruit Company), removal of Mohammad Mossadeq, prime minister of Iran (think Shah of Iran), are only two of his crimes, the full extent of which will probably never be known.

Files are being uploaded to That 1 Archive.

Graph Encryption: Going Beyond Encrypted Keyword Search [Subject Identity Based Encryption]

Wednesday, March 2nd, 2016

Graph Encryption: Going Beyond Encrypted Keyword Search by Xiarui Meng.

From the post:

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) store, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph database have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny’s previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today’s world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

2. Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, which can lead to many interesting research directions.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

Apologies for the long quote but I thought this topic might be new to some readers. Xianrui goes on to describe a solution for efficient queries over encrypted graphs.

Chase and Kamara remark in Structured Encryption and Controlled Disclosure, CK10:

To address this problem we introduce the notion of structured encryption. A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key. In addition, the query process reveals no useful information about either the query or the data. An important consideration in this context is the efficiency of the query operation on the server side. In fact, in the context of cloud storage, where one often works with massive datasets, even linear time operations can be infeasible. (emphasis in original)

With just a little nudging, their:

A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key.

could be re-stated as:

A subject identity encryption scheme leaves out merging data in such a way that the resulting topic map can only be queried with knowledge of the subject identity merging key.

You may have topics that represent diagnoses such as cancer, AIDS, sexual contacts, but if none of those can be associated with individuals who are also topics in the map, there is no more disclosure than census results for a metropolitan area and a list of the citizens therein.

That is you are missing the critical merging data that would link up (associate) any diagnosis with a given individual.

Multi-property subject identities would make the problem even harder, so say nothing of conferring properties on the basis of supplied properties as part of the merging process.

One major benefit of a subject identity based approach is that without the merging key, any data set, however sensitive the information, is just a data set, until you have the basis for solving its subject identity riddle.

PS: With the usual caveats of not using social security numbers, birth dates and the like as your subject identity properties. At least not in the map proper. I can think of several ways to generate keys for merging that would be resistant to even brute force attacks.

Ping me if you are interested in pursuing that on a data set.

Earthdata Search – Smells Like A Topic Map?*

Sunday, February 28th, 2016

Earthdata Search

From the webpage:

Search NASA Earth Science data by keyword and filter by time or space.

After choosing tour:

Keyword Search

Here you can enter search terms to find relevant data. Search terms can be science terms, instrument names, or even collection IDs. Let’s start by searching for Snow Cover NRT to find near real-time snow cover data. Type Snow Cover NRT in the keywords box and press Enter.

Which returns a screen in three sections, left to right: Browse Collections, 21 Matching Collections (Add collections to your project to compare and retrieve their data), and the third section displays a world map (navigate by grabbing the view)

Under Browse Collections:

In addition to searching for keywords, you can narrow your search through this list of terms. Click Platform to expand the list of platforms (still in a tour box)

Next step:

Now click Terra to select the Terra satellite.

Comment: Wondering how I will know which “platform” or “instrument” to select? There may be more/better documentation but I haven’t seen it yet.

The data follows the Unified Metadata Model (UMM):

NASA’s Common Metadata Repository (CMR) is a high-performance, high-quality repository for earth science metadata records that is designed to handle metadata at the Concept level. Collections and Granules are common metadata concepts in the Earth Observation (EO) world, but this can be extended out to Visualizations, Parameters, Documentation, Services, and more. The CMR metadata records are supplied by a diverse array of data providers, using a variety of supported metadata standards, including:


Initially, designers of the CMR considered standardizing all CMR metadata to a single, interoperable metadata format – ISO 19115. However, NASA decided to continue supporting multiple metadata standards in the CMR — in response to concerns expressed by the data provider community over the expense involved in converting existing metadata systems to systems capable of generating ISO 19115. In order to continue supporting multiple metadata standards, NASA designed a method to easily translate from one supported standard to another and constructed a model to support the process. Thus, the Unified Metadata Model (UMM) for EOSDIS metadata was born as part of the EOSDIS Metadata Architecture Studies (MAS I and II) conducted between 2012 and 2013.

What is the UMM?

The UMM is an extensible metadata model which provides a ‘Rosetta stone’ or cross-walk for mapping between CMR-supported metadata standards. Rather than create mappings from each CMR-supported metadata standard to each other, each standard is mapped centrally to the UMM model, thus reducing the number of translations required from n x (n-1) to 2n.

Here the mapping graphic:


Granting profiles don’t make the basis for mappings explicit, but the mappings have the same impact post mapping as a topic map would post merging.

The site could use better documentation for the interface and data, at least in the view of this non-expert in the area.

Thoughts on documentation for the interface or making the mapping more robust via use of a topic map?

I first saw this in a tweet by Kirk Borne.

*Smells Like A Topic Map – Sorry, culture bound reference to a routine on the first Cheech & Chong album. No explanation would do it justice.

Challenges of Electronic Dictionary Publication

Wednesday, February 17th, 2016

Challenges of Electronic Dictionary Publication

From the webpage:

April 8-9th, 2016

Venue: University of Leipzig, GWZ, Beethovenstr. 15; H1.5.16

This April we will be hosting our first Dictionary Journal workshop. At this workshop we will give an introduction to our vision of „Dictionaria“, introduce our data model and current workflow and will discuss (among others) the following topics:

  • Methodology and concept: How are dictionaries of „small“ languages different from those of „big“ languages and what does this mean for our endeavour? (documentary dictionaries vs. standard dictionaries)
  • Reviewing process and guidelines: How to review and evaluate a dictionary database of minor languages?
  • User-friendliness: What are the different audiences and their needs?
  • Submission process and guidelines: reports from us and our first authors on how to submit and what to expect
  • Citation: How to cite dictionaries?

If you are interested in attending this event, please send an e-mail to dictionary.journal[AT]

Workshop program

Our workshop program can now be downloaded here.

See the webpage for a list of confirmed participants, some with submitted abstracts.

Any number of topic map related questions arise in a discussion of dictionaries.

  • How to represent dictionary models?
  • What properties should be used to identify the subjects that represent dictionary models?
  • On what basis, if any, should dictionary models be considered the same or different? And for what purposes?
  • What data should be captured by dictionaries and how should it be identified?
  • etc.

Those are only a few of the questions that could be refined into dozens, if not hundreds of more, when you reach the details of constructing a dictionary.

I won’t be attending but wait with great anticipation the output from this workshop!

Topic Maps: On the Cusp of Success (Curate in Place/Death of ETL?)

Tuesday, February 9th, 2016

The Bright Future of Semantic Graphs and Big Connected Data by Alex Woodie.

From the post:

Semantic graph technology is shaping up to play a key role in how organizations access the growing stores of public data. This is particularly true in the healthcare space, where organizations are beginning to store their data using so-called triple stores, often defined by the Resource Description Framework (RDF), which is a model for storing metadata created by the World Wide Web Consortium (W3C).

One person who’s bullish on the prospects for semantic data lakes is Shawn Dolley, Cloudera’s big data expert for the health and life sciences market. Dolley says semantic technology is on the cusp of breaking out and being heavily adopted, particularly among healthcare providers and pharmaceutical companies.

“I have yet to speak with a large pharmaceutical company where there’s not a small group of IT folks who are working on the open Web and are evaluating different technologies to do that,” Dolley says. “These are visionaries who are looking five years out, and saying we’re entering a world where the only way for us to scale….is to not store it internally. Even with Hadoop, the data sizes are going to be too massive, so we need to learn and think about how to federate queries.”

By storing healthcare and pharmaceutical data as semantic triples using graph databases such as Franz’s AllegroGraph, it can dramatically lower the hurdles to accessing huge stores of data stored externally. “Usually the primary use case that I see for AllegroGraph is creating a data fabric or a data ecosystem where they don’t have to pull the data internally,” Dolley tells Datanami. “They can do seamless queries out to data and curate it as it sits, and that’s quite appealing.


This is leading-edge stuff, and there are few mission-critical deployments of semantic graph technologies being used in the real world. However, there are a few of them, and the one that keeps popping up is the one at Montefiore Health System in New York City.

Montefiore is turning heads in the healthcare IT space because it was the first hospital to construct a “longitudinally integrated, semantically enriched” big data analytic infrastructure in support of “next-generation learning healthcare systems and precision medicine,” according to Franz, which supplied the graph database at the heart of the health data lake. Cloudera’s free version of Hadoop provided the distributed architecture for Montefiore’s semantic data lake (SDL), while other components and services were provided by tech big wigs Intel (NASDAQ: INTC) and Cisco Systems (NASDAQ: CSCO).

This approach to building an SDL will bring about big improvements in healthcare, says Dr. Parsa Mirhaji MD. PhD., the director of clinical research informatics at Einstein College of Medicine and Montefiore Health System.

“Our ability to conduct real-time analysis over new combinations of data, to compare results across multiple analyses, and to engage patients, practitioners and researchers as equal partners in big-data analytics and decision support will fuel discoveries, significantly improve efficiencies, personalize care, and ultimately save lives,” Dr. Mirhaji says in a press release. (emphasis added)

If I hadn’t known better, reading passages like:

the only way for us to scale….is to not store it internally

learn and think about how to federate queries

seamless queries out to data and curate it as it sits

I would have sworn I was reading a promotion piece for topic maps!

Of course, it doesn’t mention how to discover valuable data not written in your terminology, but you have to hold something back for the first presentation to the CIO.

The growth of data sets too large for ETL are icing on the cake for topic maps.

Why ETL when the data “appears” as I choose to view it? My topic map may be quite small, at least in relationship to the data set proper.


OK, truth-in-advertising moment, it won’t be quite that easy!

And I don’t take small bills. 😉 Diamonds, other valuable commodities, foreign deposit arrangements can be had.

People are starting to think in a “topic mappish” sort of way. Or at least a way where topic maps deliver what they are looking for.

That’s the key: What do they want?

Then use a topic map to deliver it.

Sunlight launches Hall of Justice… [ Topic Map “like” features?]

Tuesday, February 2nd, 2016

Sunlight launches Hall of Justice, a massive data inventory on criminal justice across the U.S. by Josh Stewart.

From the post:

Today, Sunlight is launching Hall of Justice, a robust, searchable data inventory of nearly 10,000 datasets and research documents from across all 50 states, the District of Columbia and the federal government. Hall of Justice is the culmination of 18 months of work gathering data and refining technology.

The process was no easy task: Building Hall of Justice required manual entry of publicly available data sources from a multitude of locations across the country.

Sunlight’s team went from state to state, meeting and calling local officials to inquire about and find data related to criminal justice. Some states like California have created a data portal dedicated to making criminal justice data easily accessible to the public; others had their data buried within hard to find websites. We also found data collected by state departments of justice, police forces, court systems, universities and everything in between.

“Data is shaping the future of how we address some of our most pressing problems,” said John Wonderlich, executive director of the Sunlight Foundation. “This new resource is an experiment in how a robust snapshot of data can inform policy and research decisions.”

In addition to being a great data collection, the Hall of Justice attempts to deliver topic map like capability for searches:

The resource attempts to consolidate different terminology across multiple states, which is far from uniform or standardized. For example, if you search solitary confinement you will return results for data around solitary confinement, but also for the terms “segregated housing unit,” “SHU,” “administrative segregation” and “restrictive housing.” This smart search functionality makes finding datasets much easier and accessible.


Looking at all thirteen results for a search on “solitary confinement,” I don’t see the mapping in question. Or certainly no mapping based on characteristics of the subject, “solitary confinement.”

As close as Georgia’s 2013 Juvenile Justice Reform is using the word “restrictive” as in:

Create a two-class system within the Designated Felony Act. Designated felony offenses are divided into two classes, based on severity—Class A and Class B—that continue to allow restrictive custody while also adjusting available sanctions to account for both offense severity and risk level.

Restrictive custody is what jail systems are about so that doesn’t trip the wire for “solitary confinement.”

Of course, the links are to entire reports/documents/data sets so each researcher will have to extract and collate content individually. When that happens, a means to contribute that collation/mapping to the Hall of Justice would be a boon for other researchers. (Can you say “topic map?”)

As I write this, you will need to prefer Mozilla over Chrome, at least on Ubuntu.

Trigger Warning: If you are sensitive to traumatic events and/or reports of traumatic events, you may want to ask someone less sensitive to review these data sources.

The only difference between a concentration camp and American prisons is the lack of mass gas chambers. Every horror and abuse that you can imagine and some you probably can’t, are visited on people in U.S. prisons everyday.

As Joan Baez says in Prison Triology:

Sunlight’s Hall of Justice is a great step forward in documenting the chambers of horror we call American prisons.

And we’re gonna raze, raze the prisons

To the ground

Help us raze, raze the prisons

To the ground

Are you ready?

3 Decades of High Quality News! (code reuse)

Monday, February 1st, 2016

‘NewsHour’ archives to be digitized and available online by Dru Sefton.

From the post:

More than three decades of NewsHour are heading to an online home, the American Archive of Public Broadcasting.

Nearly 10,000 episodes that aired from 1975 to 2007 will be archived through a collaboration among AAPB; WGBH in Boston; WETA in Arlington, Va.; and the Library of Congress. The organizations jointly announced the project Thursday.

“The project will take place over the next year and a half,” WGBH spokesperson Emily Balk said. “The collection should be up in its entirety by mid-2018, but AAPB will be adding content from the collection to its website on an ongoing, monthly basis.”

Looking forward to that collection!

Useful on its own, but even more so if you had an indexical object that could point to a subject in a PBS news episode and at the same time, point to episodes on the same subject from other TV and radio news archives, not to mention the same subject in newspapers and magazines.

Oh, sorry, that would be a topic in ISO 13250-2 parlance and the more general concept of a proxy in ISO 13250-5. Thought I should mention that before someone at IBM runs off to patent another pre-existing idea.

I don’t suppose padding patent statistics hurts all that much, considering that the Supremes are poised to invalidate process and software patents in one fell swoop.

Hopefully economists will be ready to measure the amount of increased productivity (legal worries about and enforcement of process/software patents aren’t productive activities) from foreclosing even the potential of process or software patents.

Copyright is more than sufficient to protect source code, as is any programmer is going to use another programmers code. They say that scientists would rather use another scientist’s toothbrush that his vocabulary.

Copying another programmer’s code (code re-use) is more akin to sharing a condom. It’s just not done.

Writing Clickbait TopicMaps?

Wednesday, January 20th, 2016

‘Shocking Celebrity Nip Slips’: Secrets I Learned Writing Clickbait Journalism by Kate Lloyd.

I’m sat at a desk in a glossy London publishing house. On the floors around me, writers are working on tough investigations and hard news. I, meanwhile, am updating a feature called “Shocking celebrity nip-slips: boobs on the loose.” My computer screen is packed with images of tanned reality star flesh as I write captions in the voice of a strip club announcer: “Snooki’s nunga-nungas just popped out to say hello!” I type. “Whoops! Looks like Kim Kardashian forgot to wear a bra today!”

Back in 2013, I worked for a women’s celebrity news website. I stumbled into the industry at a time when online editors were panicking: Their sites were funded by advertisers who demanded that as many people as possible viewed stories. This meant writing things readers loved and shared, but also resorting to shadier tactics. With views dwindling, publications like mine often turned to the gospel of search engine optimisation, also known as SEO, for guidance.

Like making a deal with a highly-optimized devil, relying heavily on SEO to push readers to websites has a high moral price for publishers. When it comes to female pop stars and actors, people are often more likely to search for the celebrity’s name with the words “naked,” “boobs,” “butt,” “weight,” and “bikini” than with the names of their albums or movies. Since 2008, “Miley Cyrus naked” has been consistently Googled more than “Miley Cyrus music,” “Miley Cyrus album,” “Miley Cyrus show,” and “Miley Cyrus Instagram.” Plus, “Emma Watson naked” has been Googled more than “Emma Watson movie” since she was 15. In fact, “Emma Watson feet” gets more search traffic than “Emma Watson style,” which might explain why one women’s site has a fashion feature called “Emma Watson is an excellent foot fetish candidate.”

If you don’t know what other people are be searching for, try these two resources on Google Trends:

Hacking the Google Trends API (2014)

PyTrends – Pseudo API for Google Trends (Updated six days ago)

Depending on your sensibilities, you could collect content on celebrities into a topic map and when their searches spike, you can release links to the new material plus save readers the time of locating older content.

That might even be a viable business model.


Intuitionism and Constructive Mathematics 80-518/818 — Spring 2016

Saturday, January 9th, 2016

Intuitionism and Constructive Mathematics 80-518/818 — Spring 2016

From the course description:

In this seminar we shall read primary and secondary sources on the origins and developments of intuitionism and constructive mathematics from Brouwer and the Russian constructivists, Bishop, Martin-Löf, up to and including modern developments such as homotopy type theory. We shall focus both on philosophical and metamathematical aspects. Topics could include the Brouwer-Heyting-Kolmogorov (BHK) interpretation, Kripke semantics, topological semantics, the Curry-Howard correspondence with constructive type theories, constructive set theory, realizability, relations to topos theory, formal topology, meaning explanations, homotopy type theory, and/or additional topics according to the interests of participants.


  • Jean van Heijenoort (1967), From Frege to Gödel: A Source Book in Mathematical Logic 1879–1931, Cambridge, MA: Harvard University Press.
  • Michael Dummett (1977/2000), Elements of Intuitionism (Oxford Logic Guides, 39), Oxford: Clarendon Press, 1977; 2nd edition, 2000.
  • Michael Beeson (1985), Foundations of Constructive Mathematics, Heidelberg: Springer Verlag.
  • Anne Sjerp Troelstra and Dirk van Dalen (1988), Constructivism in Mathematics: An Introduction (two volumes), Amsterdam: North Holland.

Additional resources

Not online but a Spring course at Carnegie Mellon with a reading list that should exercise your mental engines!

Any subject with a two volume “introduction” (Anne Sjerp Troelstra and Dirk van Dalen), is likely to be heavy sledding. 😉

But the immediate relevance to topic maps is evident by this statement from Rosalie Iemhoff:

Intuitionism is a philosophy of mathematics that was introduced by the Dutch mathematician L.E.J. Brouwer (1881–1966). Intuitionism is based on the idea that mathematics is a creation of the mind. The truth of a mathematical statement can only be conceived via a mental construction that proves it to be true, and the communication between mathematicians only serves as a means to create the same mental process in different minds.

I would recast that to say:

Language is a creation of the mind. The truth of a language statement can only be conceived via a mental construction that proves it to be true, and the communication between people only serves as a means to create the same mental process in different minds.

There are those who claim there is some correspondence between language and something they call “reality.” Since no one has experienced “reality” in the absence of language, I prefer to ask: Is X useful for purpose Y? rather than the doubtful metaphysics of “Is X true?”

Think of it as helping get down to what’s really important, what’s in this for you?

BTW, don’t be troubled by anyone who suggests this position removes all limits on discussion. What motivations do you think caused people to adopt the varying positions they have now?

It certainly wasn’t a detached and disinterested search for the truth, whatever people may pretend once they have found the “truth” they are presently defending. The same constraints will persist even if we are truthful with ourselves.

Math Translator Wanted/Topic Map Needed: Mochizuki and the ABC Conjecture

Monday, January 4th, 2016

What if you Discovered the Answer to a Famous Math Problem, but No One was able to Understand It? by Kevin Knudson.

From the post:

The conjecture is fairly easy to state. Suppose we have three positive integers a,b,c satisfying a+b=c and having no prime factors in common. Let d denote the product of the distinct prime factors of the product abc. Then the conjecture asserts roughly there are only finitely many such triples with c > d. Or, put another way, if a and b are built up from small prime factors then c is usually divisible only by large primes.

Here’s a simple example. Take a=16, b=21, and c=37. In this case, d = 2x3x7x37 = 1554, which is greater than c. The ABC conjecture says that this happens almost all the time. There is plenty of numerical evidence to support the conjecture, and most experts in the field believe it to be true. But it hasn’t been mathematically proven — yet.

Enter Mochizuki. His papers develop a subject he calls Inter-Universal Teichmüller Theory, and in this setting he proves a vast collection of results that culminate in a putative proof of the ABC conjecture. Full of definitions and new terminology invented by Mochizuki (there’s something called a Frobenioid, for example), almost everyone who has attempted to read and understand it has given up in despair. Add to that Mochizuki’s odd refusal to speak to the press or to travel to discuss his work and you would think the mathematical community would have given up on the papers by now, dismissing them as unlikely to be correct. And yet, his previous work is so careful and clever that the experts aren’t quite ready to give up.

It’s not clear what the future holds for Mochizuki’s proof. A small handful of mathematicians claim to have read, understood and verified the argument; a much larger group remains completely baffled. The December workshop reinforced the community’s desperate need for a translator, someone who can explain Mochizuki’s strange new universe of ideas and provide concrete examples to illustrate the concepts. Until that happens, the status of the ABC conjecture will remain unclear.

It’s hard to imagine a more classic topic map problem.

At some point, Shinichi Mochizuki shared a common vocabulary with his colleagues in number theory and arithmetic geometry but no longer.

As Kevin points out:

The December workshop reinforced the community’s desperate need for a translator, someone who can explain Mochizuki’s strange new universe of ideas and provide concrete examples to illustrate the concepts.

Taking Mochizuki’s present vocabulary and working backwards to where he shared a common vocabulary with colleagues is simple enough to say.

The crux of the problem being that discussions are going to be fragmented, distributed in a variety of formal and informal venues.

Combining those discussions to construct a path back to where most number theorists reside today would require something with as few starting assumptions as is possible.

Where you could describe as much or as little about new subjects and their relations to other subjects as is necessary for an expert audience to continue to fill in any gaps.

I’m not qualified to venture an opinion on the conjecture or Mochizuki’s proof but the problem of mapping from new terminology that has its own context back to “standard” terminology is a problem uniquely suited to topic maps.

Going Viral in 2016

Tuesday, December 29th, 2015

How To Go Viral: Lessons From The Most Shared Content of 2015 by Steve Rayson.

I offer this as at least as amusing as it may be useful.

The topic element of a viral post is said to include:

Trending topic (e.g. Zombies), Health & fitness, Cats & Dogs, Babies, Long Life, Love

Hard to get any of those in with technical blog but I could try:

TM’s produce healthy and fit ED-free 90 year-old bi-sexuals with dogs & cats as pets who love all non-Zombies.

That’s 115 characters if you are counting.

Produce random variations on that until I find one that goes viral. 😉

But, I have never cared for click-bait or false advertising. Personally I find it insulting when marketers falsify research.

I may have to document some of those cases in 2016. There is no shortage of it.

None of my tweets may go viral in 2016 but Steve’s post will make it more likely they will be re-tweeted.

Feel free to re-use my suggested tweet as I am fairly certain that “…healthy and fit ED-free 90 year-old bi-sexuals…” is in the public domain.