Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 13, 2016

Flawed Input Validation = Flawed Subject Recognition

Filed under: Cybersecurity,Security,Software,Subject Identity,Topic Maps — Patrick Durusau @ 7:47 pm

In Vulnerable 7-Zip As Poster Child For Open Source, I covered some of the details of two vulnerabilities in 7-Zip.

Both of those vulnerabilities were summarized by the discoverers:

Sadly, many security vulnerabilities arise from applications which fail to properly validate their input data. Both of these 7-Zip vulnerabilities resulted from flawed input validation. Because data can come from a potentially untrusted source, data input validation is of critical importance to all applications’ security.

The first vulnerability is described as:

TALOS-CAN-0094, OUT-OF-BOUNDS READ VULNERABILITY, [CVE-2016-2335]

An out-of-bounds read vulnerability exists in the way 7-Zip handles Universal Disk Format (UDF) files. The UDF file system was meant to replace the ISO-9660 file format, and was eventually adopted as the official file system for DVD-Video and DVD-Audio.

Central to 7-Zip’s processing of UDF files is the CInArchive::ReadFileItem method. Because volumes can have more than one partition map, their objects are kept in an object vector. To start looking for an item, this method tries to reference the proper object using the partition map’s object vector and the “PartitionRef” field from the Long Allocation Descriptor. Lack of checking whether the “PartitionRef” field is bigger than the available amount of partition map objects causes a read out-of-bounds and can lead, in some circumstances, to arbitrary code execution.

(code in original post omitted)

This vulnerability can be triggered by any entry that contains a malformed Long Allocation Descriptor. As you can see in lines 898-905 from the code above, the program searches for elements on a particular volume, and the file-set starts based on the RootDirICB Long Allocation Descriptor. That record can be purposely malformed for malicious purpose. The vulnerability appears in line 392, when the PartitionRef field exceeds the number of elements in PartitionMaps vector.

I would describe the lack of a check on the “PartitionRef” field in topic maps terms as allowing a subject, here a string, of indeterminate size. That is there is no constraint on the size of the subject, which is here a string.

That may seem like an obtuse way of putting it, but consider that for a subject, here a string that is longer than the “available amount of partition may objects,” can be in association with other subjects, such as the user (subject) who has invoked the application(association) containing the 7-Zip vulnerability (subject).

Err, you don’t allow users with shell access to suid root do you?

If you don’t, at least not running a vulnerable program as root may help dodge that bullet.

Or in topic maps terms, knowing the associations between applications and users may be a window on the severity of vulnerabilities.

Lest you think logging suid is an answer, remember they were logging Edward Snowden’s logins as well.

Suid logs may help for next time, but aren’t preventative in nature.

BTW, if you are interested in the details on buffer overflows, Smashing The Stack For Fun And Profit looks like a fun read.

May 6, 2016

Deep Learning: Image Similarity and Beyond (Webinar, May 10, 2016)

Filed under: Authoring Topic Maps,Deep Learning,Machine Learning,Similarity,Topic Maps — Patrick Durusau @ 4:15 pm

Deep Learning: Image Similarity and Beyond (Webinar, May 10, 2016)

From the registration page:

Deep Learning is a powerful machine learning method for image tagging, object recognition, speech recognition, and text analysis. In this demo, we’ll cover the basic concept of deep learning and walk you through the steps to build an application that finds similar images using an already-trained deep learning model.

Recommended for:

  • Data scientists and engineers
  • Developers and technical team managers
  • Technical product managers

What you’ll learn:

  • How to leverage existing deep learning models
  • How to extract deep features and use them using GraphLab Create
  • How to build and deploy an image similarity service using Dato Predictive Services

What we’ll cover:

  • Using an already-trained deep learning model
  • Extracting deep features
  • Building and deploying an image similarity service for pictures 

Deep learning has difficulty justifying its choices, just like human judges of similarity, but could it play a role in assisting topic map authors in constructing explicit decisions for merging?

Once trained, could deep learning suggest properties and/or values to consider for merging it has not yet experienced?

I haven’t seen any webinars recently so I am ready to gamble on this being an interesting one.

Enjoy!

May 4, 2016

No Label (read “name”) for Medical Error – Fear of Terror

Filed under: Names,Subject Identity,Topic Maps — Patrick Durusau @ 2:06 pm

Medical error is third biggest cause of death in the US, experts say by Amanda Holpuch.

From the post:

Medical error is the third leading cause of death in the US, accounting for 250,000 deaths every year, according to an analysis released on Tuesday.

There is no US system for coding these deaths, but Martin Makary and Michael Daniel, researchers at Johns Hopkins University’s school of medicine, used studies from 1999 onward to find that medical errors account for more than 9.5% of all fatalities in the US.

Only heart disease and cancer are more deadly, according to the Centers for Disease Control and Prevention (CDC).

The analysis, which was published in the British Medical Journal, said that the science behind medical errors would improve if data was shared internationally and nationally “in the same way as clinicians share research and innovation about coronary artery disease, melanoma, and influenza”.

But death by medical error is not captured by government reports because the US system for assigning a code to cause of death, the international classification of disease (ICD), does not have a label for medical error.

In contrast to topic maps, where you can talk about any subject you want, the international classification of disease (ICD), does not have a label for medical error.

Impact? Not having a label conceals approximately 250,000 deaths per year in the United States.

What if Fear of Terror press releases were broadcast but along with “deaths due to medical error to date this year” as contextual information?

Medical errors result in approximately 685 deaths per day.

If you heard the report of the shootings in San Bernardino, December 2, 2015 and that 14 people were killed and the report pointed out that to date, approximately 230,160 had died due to medical errors, which one would you judge to be the more serious problem?

Lacking a label for medical error as cause of death, prevents public discussion of the third leading cause of death in the United States.

Contrast that with the public discussion over the largely non-existent problem of terrorism in the United States.

April 27, 2016

Topic Map Fooddie Alert!

Filed under: Food,Topic Maps — Patrick Durusau @ 4:02 pm

Our Tagged Ingredients Data is Now on GitHub by Erica Greene and Adam McKaig.

From the post:

Since publishing our post about “Extracting Structured Data From Recipes Using Conditional Random Fields,” we’ve received a tremendous number of requests to release the data and our code. Today, we’re excited to release the roughly 180,000 labeled ingredient phrases that we used to train our machine learning model.

You can find the data and code in the ingredient-phrase-tagger GitHub repo. Instructions are in the README and the raw data is in nyt-ingredients-snapshot-2015.csv.

Reaching a critical mass for any domain is a stumbling block for any topic map. Erica and Adam kick start your foodie topic map adventures with ~ 180,000 labeled ingredient phrases.

You are looking at the end result of six years of data mining and some clever programming so be sure to:

  1. Always acknowledge this project along with Erica and Alex in your work.
  2. Contribute back improved data.
  3. Contribute back improvements on the conditional random fields (CRF).
  4. Have a great time extending this data set!

Possible extensions include automatic translation (with mapping of “equivalent” terms), melding in the USDA food database (it’s formally known as: USDA National Nutrient Database for Standard Reference) with nutrient content information on ~8,800 foods, and, of course, the “correct” way to make a roux as reflected in your mother’s cookbook.

It is, unfortunately, true that you can buy a mix for roux in a cardboard box. That requires a food processor to chop up the cardboard to enjoy with the roux that came in it. I’m originally from Louisiana and the thought of a roux mix is depressing, if not heretical.

Reboot Your $100+ Million F-35 Stealth Jet Every 10 Hours Instead of 4 (TM Fusion)

Filed under: Cybersecurity,Data Fusion,Data Integration,Data Science,Topic Maps — Patrick Durusau @ 10:07 am

Pentagon identifies cause of F-35 radar software issue

From the post:

The Pentagon has found the root cause of stability issues with the radar software being tested for the F-35 stealth fighter jet made by Lockheed Martin Corp, U.S. Defense Acquisition Chief Frank Kendall told a congressional hearing on Tuesday.

Last month the Pentagon said the software instability issue meant the sensors had to be restarted once every four hours of flying.

Kendall and Air Force Lieutenant General Christopher Bogdan, the program executive officer for the F-35, told a Senate Armed Service Committee hearing in written testimony that the cause of the problem was the timing of “software messages from the sensors to the main F-35” computer. They added that stability issues had improved to where the sensors only needed to be restarted after more than 10 hours.

“We are cautiously optimistic that these fixes will resolve the current stability problems, but are waiting to see how the software performs in an operational test environment,” the officials said in a written statement.
… (emphasis added)

At $100+ Million plane that requires rebooting every ten hours? I’m not a pilot but that sounds like a real weakness.

The precise nature of the software glitch isn’t described but you can guess one of the problems from Lockheed Martin’s, Software You Wish You Had: Inside the F-35 Supercomputer:


The human brain relies on five senses—sight, smell, taste, touch and hearing—to provide the information it needs to analyze and understand the surrounding environment.

Similarly, the F-35 relies on five types of sensors: Electronic Warfare (EW), Radar, Communication, Navigation and Identification (CNI), Electro-Optical Targeting System (EOTS) and the Distributed Aperture System (DAS). The F-35 “brain”—the process that combines this stellar amount of information into an integrated picture of the environment—is known as sensor fusion.

At any given moment, fusion processes large amounts of data from sensors around the aircraft—plus additional information from datalinks with other in-air F-35s—and combines them into a centralized view of activity in the jet’s environment, displayed to the pilot.

In everyday life, you can imagine how useful this software might be—like going out for a jog in your neighborhood and picking up on real-time information about obstacles that lie ahead, changes in traffic patterns that may affect your route, and whether or not you are likely to pass by a friend near the local park.

F-35 fusion not only combines data, but figures out what additional information is needed and automatically tasks sensors to gather it—without the pilot ever having to ask.
… (emphasis added)

The fusion of data from other in-air F-35s is a classic topic map merging of data problem.

You have one subject, say an anti-aircraft missile site, seen from up to four (in the F-35 specs) F-35s. As is the habit of most physical objects, it has only one geographic location but the fusion computer for the F-35 doesn’t come up with than answer.

Kris Osborn writes in Software Glitch Causes F-35 to Incorrectly Detect Targets in Formation:


“When you have two, three or four F-35s looking at the same threat, they don’t all see it exactly the same because of the angles that they are looking at and what their sensors pick up,” Bogdan told reporters Tuesday. “When there is a slight difference in what those four airplanes might be seeing, the fusion model can’t decide if it’s one threat or more than one threat. If two airplanes are looking at the same thing, they see it slightly differently because of the physics of it.”

For example, if a group of F-35s detect a single ground threat such as anti-aircraft weaponry, the sensors on the planes may have trouble distinguishing whether it was an isolated threat or several objects, Bogdan explained.

As a result, F-35 engineers are working with Navy experts and academics from John’s Hopkins Applied Physics Laboratory to adjust the sensitivity of the fusion algorithms for the JSF’s 2B software package so that groups of planes can correctly identify or discern threats.

“What we want to have happen is no matter which airplane is picking up the threat – whatever the angles or the sensors – they correctly identify a single threat and then pass that information to all four airplanes so that all four airplanes are looking at the same threat at the same place,” Bogdan said.

Unless Bogdan is using “sensitivity” in a very unusual sense, that doesn’t sound like the issue with the fusion computer of the F-35.

Rather the problem is the fusion computer has no explicit doctrine of subject identity to use when it is merging data from different F-35s, whether it be two, three, four or even more F-35s. The display of tactical information should be seamless to the pilot and without human intervention.

I’m sure members of Congress were impressed with General Bogdan using words like “angles” and “physics,” but the underlying subject identity issue isn’t hard to address.

At issue is the location of a potential target on the ground. Within some pre-defined metric, anything located within a given area is the “same target.”

The Air Force has already paid for this type of analysis and the mathematics of what is called Circular Error Probability (CEP) has been published in Use of Circular Error Probability in Target Detection by William Nelson (1988).

You need to use the “current” location of the detecting aircraft, allowances for inaccuracy in estimating the location of the target, etc., but once you call out the subject identity as an issue, its a matter of making choices of how accurate you want the subject identification to be.

Before you forward this to Gen. Bogdan as a way forward on the fusion computer, realize that CEP is only one aspect of target identification. But, calling the subject identity of targets out explicitly, enables reliable presentation of single/multiple targets to pilots.

Your call, confusing displays or a reliable, useful display.

PS: I assume military subject identity systems would not be running XTM software. Same principles apply even if the syntax is different.

April 25, 2016

Seriously, Who’s Gonna Find It?

Filed under: Humor,Librarian/Expert Searchers,Library,Topic Maps — Patrick Durusau @ 7:54 pm

infrastructure-online-humor

Graphic whimsy via Bruce Sterling, bruces@well.com.

Are your information requirements met by finding something or by finding the right thing?

April 23, 2016

Similar Pages for Wikipedia – Lateral – Indexing Practices

Filed under: Similarity,Topic Maps,Wikipedia — Patrick Durusau @ 3:20 pm

Similar Pages for Wikipedia (Chrome extension)

I started looking at this software with a mis-impression that I hope you can avoid.

I installed the extension and as advertised, if I am on a Wikipedia page, it recommends “similar” Wikipedia pages.

Unless I’m billing time, plowing through page after page of tangentially related material isn’t my idea of a good time.

Ah, but I confused “document” with “page.”

I discovered that error while reading Adding Documents at Lateral, which gives the following example:

lateral-add-doc-example

Ah! So “document” means as much or as little text as I choose to use when I add the document.

Which means if I were creating a document store of graph papers, I would capture only the new material and not the inevitable a “graph consists of nodes and edges….”

There are pre-populatd data sets, News 350,000+ news and blog articles, updated every 15 mins; arXiv 1M+ papers (all), updated daily; PubMed 6M+ medical journals from before July 2014; SEC 6,000+ yearly financial reports / 10-K filings from 2014; Wikipedia 463,000 pages which had 20+ page views in 2013.

I suspect the granularity on the pre-populated data sets is “document” in the usual sense size.

Glad to see the option to define a “document” to be an arbitrary span of text.

I don’t need to find more “documents” (in the usual sense) but more relevant snippets that are directly on point.

Hmmm, perhaps indexing at the level of paragraphs instead of documents (usual sense)?

Which makes me wonder why we index at the level of documents (usual sense) anyway? Is it simply tradition from when indexes were prepared by human indexers? And indexes were limited by physical constraints?

April 22, 2016

Corporate Bribery/Corruption – Poland/U.S./Russia – A Trio

Filed under: Auditing,Business Intelligence,Government,Topic Maps — Patrick Durusau @ 2:22 pm

GIJN (Global Investigation Journalism Network) tweeted a link to Corporate misconduct – individual consequences, 14th Global Fraud Survey this morning.

From the foreword by David L. Stulb:

In the aftermath of recent major terrorist attacks and the revelations regarding widespread possible misuse of offshore jurisdictions, and in an environment where geopolitical tensions have reached levels not seen since the Cold War, governments around the world are under increased pressure to face up to the immense global challenges of terrorist financing, migration and corruption. At the same time, certain positive events, such as the agreement by the P5+1 group (China, France, Russia, the United Kingdom, the United States, plus Germany) with Iran to limit Iran’s sensitive nuclear activities are grounds for cautious optimism.

These issues contribute to volatility in financial markets. The banking sector remains under significant regulatory focus, with serious stress points remaining. Governments, meanwhile, are increasingly coordinated in their approaches to investigating misconduct, including recovering the proceeds of corruption. The reason for this is clear. Bribery and corruption continue to represent a substantial threat to sluggish global growth and fragile financial markets.

Law enforcement agencies, including the United States Department of Justice and the United States Securities and Exchange Commission, are increasingly focusing on individual misconduct when investigating impropriety. In this context, boards and executives need to be confident that their businesses comply with rapidly changing laws and regulations wherever they operate.

For this, our 14th Global Fraud Survey, EY interviewed senior executives with responsibility for tackling fraud, bribery and corruption. These individuals included chief financial officers, chief compliance officers, heads of internal audit and heads of legal departments. They are ideally placed to provide insight into the impact that fraud and corruption is having on business globally.

Despite increased regulatory activity, our research finds that boards could do significantly more to protect both themselves and their companies.

Many businesses have failed to execute anti-corruption programs to proactively mitigate their risk of corruption. Similarly, many businesses are not yet taking advantage of rich seams of information that would help them identify and mitigate fraud, bribery and corruption issues earlier.

Between October 2015 and January 2016, we interviewed 2,825 individuals from 62 countries and territories. The interviews identified trends, apparent contradictions and issues about which boards of directors should be aware.

Partners from our Fraud Investigation & Dispute Services practice subsequently supplemented the Ipsos MORI research with in-depth discussions with senior executives of multinational companies. In these interviews, we explored the executives’ experiences of operating in certain key business environments that are perceived to expose companies to higher fraud and corruption risks. Our conversations provided us with additional insights into the impact that changing legislation, levels of enforcement and cultural behaviors are having on their businesses. Our discussions also gave us the opportunity to explore pragmatic steps that leading companies have been taking to address these risks.

The executives to whom we spoke highlighted many matters that businesses must confront when operating across borders: how to adapt market-entry strategies in countries where cultural expectations of acceptable behaviors can differ; how to get behind a corporate structure to understand a third party’s true ownership; the potential negative impact that highly variable pay can have on incentives to commit fraud and how to encourage whistleblowers to speak up despite local social norms to the contrary, to highlight a few.

Our survey finds that many respondents still maintain the view that fraud, bribery and corruption are other people’s problems despite recognizing the prevalence of the issue in their own countries. There remains a worryingly high tolerance or misunderstanding of conduct that can be considered inappropriate — particularly among respondents from finance functions. While companies are typically aware of the historic risks, they are generally lagging behind on the emerging ones, for instance the potential impact of cybercrime on corporate reputation and value, while now well publicized, remains a matter of varying priority for our respondents. In this context, companies need to bolster their defenses. They should apply anti-corruption compliance programs, undertake appropriate due diligence on third parties with which they do business and encourage and support whistleblowers to come forward with confidence. Above all, with an increasing focus on the accountability of the individual, company leadership needs to set the right tone from the top. It is only by taking such steps that boards will be able to mitigate the impact should the worst happen.

This survey is intended to raise challenging questions for boards. It will, we hope, drive better conversations and ongoing dialogue with stakeholders on what are truly global issues of major importance.

We acknowledge and thank all those executives and business leaders who participated in our survey, either as respondents to Ipsos MORI or through meeting us in person, for their contributions and insights. (emphasis in original)

Apologies for the long quote but it was necessary to set the stage of the significance of:

…increasingly focusing on individual misconduct when investigating impropriety.

That policy grants a “bye” to corporations who benefit from individual mis-coduct, in favor of punishing individual actors within a corporation.

While granting the legitimacy of punishing individuals, corporations cannot act except by their agents, failing to punish corporations enables their shareholders to continue to benefit from illegal behavior.

Another point of significance, listing of countries on page 44, gives the percentage of respondents that agree “…bribery/corrupt practices happen widely…” as follows (in part):

Rank Country % Agree
30 Poland 34
31 Russia 34
32 U.S. 34

When the Justice Department gets hoity-toity about law and corruption, keep those figures in mind.

If the Justice Department representative you are talking to isn’t corrupt, it happens, there’s one on either side of them that probably is.

Topic maps can help ferret out or manage “corruption,” depending upon your point of view. Even structural corruption, take the U.S. political campaign donation process.

April 21, 2016

Scope Rules!

Filed under: Scope,Topic Maps — Patrick Durusau @ 8:33 pm

I was reminded of the power of scope (in the topic map sense) when I saw John D. Cook’s Quaternions in Paradise Lost.

quaternions

See John’s post for the details but in summary, Kuiper’s Quaternions and Rotation Sequences quoted a passage from Milton that used the term quarterion.

Your search appliance and most if not all of the public search engines will happily return all uses of quarterion without distinction. (Yes, I am implying there is more than one meaning for quarterion. See John’s post for the details.)

In addition to distinguishing between usages in Milton and Kuiper, scope can cleanly separate terms by agency, activity, government or other distinctions.

Or you can simply wade through search glut.

Your call.

April 20, 2016

Searching for Subjects: Which Method is Right for You?

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 3:41 pm

Leaving to one side how to avoid re-evaluating the repetitive glut of materials from any search, there is the more fundamental problem of how to you search for a subject?

This is a back-of-the-envelope sketch that I will be expanding, but here goes:

Basic Search

At its most basic, a search consists of a <term> and the search seeks to match strings that match that <term>.

Even allowing for Boolean operators, the matches against <term> are only and forever string matches.

Basic Search + Synonyms

Of course, as skilled searchers you will try not only one <term>, but several <synonym>s for the term as well.

A good example of that strategy is used at PubMed:

If you enter an entry term for a MeSH term the translation will also include an all fields search for the MeSH term associated with the entry term. For example, a search for odontalgia will translate to: “toothache”[MeSH Terms] OR “toothache”[All Fields] OR “odontalgia”[All Fields] because Odontalgia is an entry term for the MeSH term toothache. [PubMed Help]

The expansion to include the MeSH term Odontalgia is useful, but how do you maintain it?

A reader can see “toothache” and “Odontalgia” are treated as synonyms, but why remains elusive.

This is the area of owl:sameAs, the mapping of multiple subject identifiers/locators to a single topic, etc. You know that “sameness” exists, but why isn’t clear.

Subject Identity Properties

In order to maintain a PubMed or similar mapping, you need people who either “know” the basis for the mappings or you can have the mappings documented. That is you can say on what basis the mapping happened and what properties were present.

For example:

toothache

Key Value
symptom pain
general-location mouth
specific-location tooth

So if we are mapping terms to other terms and the specific location value reads “tongue,” then we know that isn’t a mapping to “toothache.”

How Far Do You Need To Go?

Of course for every term that we use as a key or value, there can be an expansion into key/value pairs, such as for tooth:

tooth

Key Value
general-location mouth
composition enamel coated bone
use biting, chewing

Observations:

Each step towards more precise gathering of information increases your pre-search costs but decreases your post-search cost of casting out irrelevant material.

Moreover, precise gathering of information will help you avoid missing data simply due to data glut returns.

If maintenance of your mapping across generations is a concern, doing more than mapping of synonyms for reason or reasons unknown may be in order.

The point being that your current retrieval or lack thereof of current and correct information has a cost. As does improving your current retrieval.

The question of improved retrieval isn’t ideological but an ROI driven one.

  • If you have better mappings will that give you an advantage over N department/agency?
  • Will better retrieval slow down (never stop) the time wasted by staff on voluminous search results?
  • Will more precision focus your limited resources (always limited) on highly relevant materials?

Formulate your own ROI questions and means of measuring them. Then reach out to topic maps to see how they improve (or not) your ROI.

Properly used, I think you are in for a pleasant surprise with topic maps.

April 18, 2016

Dictionary of Fantastic Vocabulary [Increasing the Need for Topic Maps]

Filed under: Dictionary,Topic Maps,Vocabularies — Patrick Durusau @ 4:46 pm

Dictionary of Fantastic Vocabulary by Greg Borenstein.

Alexis Lloyd tweeted this link along with:

This is utterly fantastic.

Well, it certainly increases the need for topic maps!

From the bot description on Twitter:

Generating new words with new meanings out of the atoms of English.

Ahem, are you sure about that?

Is a bot is generating meaning?

Or are readers conferring meaning on the new words as they are read?

If, as I contend, readers confer meaning, the utterance of every “new” word, opens up as many new meanings as there are readers of the “new” word.

Example of people conferring different meanings on a term?

Ask a dozen people what is meant by “shot” in:

It’s just a shot away

When Lisa Fischer breaks into her solo in:

(Best played loud.)

Differences in meanings make for funny moments, awkward pauses, blushes, in casual conversation.

What if the stakes are higher?

What if you need to produce (or destroy) all the emails by “bobby1.”

Is it enough to find some of them?

What have you looked for lately? Did you find all of it? Or only some of it?

New words appear everyday.

You are already behind. You will get further behind using search.

April 14, 2016

Visualizing Data Loss From Search

Filed under: Entity Resolution,Marketing,Record Linkage,Searching,Topic Maps,Visualization — Patrick Durusau @ 3:46 pm

I used searches for “duplicate detection” (3,854) and “coreference resolution” (3290) in “Ironically, Entity Resolution has many duplicate names” [Data Loss] to illustrate potential data loss in searches.

Here is a rough visualization of the information loss if you use only one of those terms:

duplicate-v-coreference-500-clipped

If you search for “duplicate detection,” you miss all the articles shaded in blue.

If you search for “coreference resolution,” you miss all the articles shaded in yellow.

Suggestions for improving this visualization?

It is a visualization that could be performed on client’s data, using their search engine/database.

In order to identify the data loss they are suffering now from search across departments.

With the caveat that not all data loss is bad and/or worth avoiding.

Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.

That is a starting point for evaluating the use of topic maps.

While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?

Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.

I assume the most relevant terms are going to be those internal to customers and/or potential customers.

Interest in working this up into a client prospecting/topic map marketing tool?


Separately I want to note my discovery (you probably already knew about it) of VennDIS: a JavaFX-based Venn and Euler diagram software to generate publication quality figures. Download here. (Apologies, the publication itself if firewalled.)

The export defaults to 800 x 800 resolution. If you need something smaller, edit the resulting image in Gimp.

It’s a testimony to the software that I was able to produce a useful image in less than a day. Kudos to the software!

April 13, 2016

“Ironically, Entity Resolution has many duplicate names” [Data Loss]

Filed under: Entity Resolution,Topic Maps — Patrick Durusau @ 8:45 pm

Nancy Baym tweeted:

“Ironically, Entity Resolution has many duplicate names” – Lise Getoor

entity-resolution

I can’t think of any subject that doesn’t have duplicate names.

Can you?

In a “search driven” environment, not knowing the “duplicate” names for a subject means data loss.

Data loss that could include “smoking gun” data.

Topic mappers have been making that pitch for decades but it never has really caught fire.

I don’t think anyone doubts that data loss occurs, but the gravity of that data loss remains elusive.

For example, let’s take three duplicate names for entity resolution from the slide, duplicate detection, reference reconciliation, coreference resolution.

Supplying all three as quoted strings to CiteSeerX, any guesses on the number of “hits” returned?

As of April 13, 2016:

  • duplicate detection – 3,854
  • reference reconciliation – 253
  • coreference resolution – 3,290

When running the query "duplicate detection" "coreference resolution", only 76 “hits” are returned, meaning that there are only 76 overlapping cases reported in the total of 7,144 for both of those terms separately.

That’s assuming CiteSeerX isn’t shorting me on results due to server load, etc. I would have to cross-check the data itself before I would swear to those figures.

But consider just the raw numbers I report today: duplicate detection – 3,854, coreference resolution – 3,290, with 76 overlapping cases.

That’s two distinct lines of research on the same problem, for the most part, ignoring the other.

What do you think the odds are of duplication of techniques, experiences, etc., spread out over those 7,144 articles?

Instead of you or your client duplicating a known-to-somebody solution, you could be building an enhanced solution.

Well, except for the data loss due to “duplicate names” in a search environment.

And that you would have to re-read all the articles in order to find which technique or advancement was made in each article.

Multiply that by everyone who is interested in a subject and its a non-trivial amount of effort.

How would you like to avoid data loss and duplication of effort?

April 12, 2016

Coeffects: Context-aware programming languages – Subject Identity As Type Checking?

Filed under: Subject Identity,TMRM,Topic Maps,Types — Patrick Durusau @ 3:14 pm

Coeffects: Context-aware programming languages by Tomas Petricek.

From the webpage:

Coeffects are Tomas Petricek‘s PhD research project. They are a programming language abstraction for understanding how programs access the context or environment in which they execute.

The context may be resources on your mobile phone (battery, GPS location or a network printer), IoT devices in a physical neighborhood or historical stock prices. By understanding the neighborhood or history, a context-aware programming language can catch bugs earlier and run more efficiently.

This page is an interactive tutorial that shows a prototype implementation of coeffects in a browser. You can play with two simple context-aware languages, see how the type checking works and how context-aware programs run.

This page is also an experiment in presenting programming language research. It is a live environment where you can play with the theory using the power of new media, rather than staring at a dead pieces of wood (although we have those too).

(break from summary)

Programming languages evolve to reflect the changes in the computing ecosystem. The next big challenge for programming language designers is building languages that understand the context in which programs run.

This challenge is not easy to see. We are so used to working with context using the current cumbersome methods that we do not even see that there is an issue. We also do not realize that many programming features related to context can be captured by a simple unified abstraction. This is what coeffects do!

What if we extend the idea of context to include the context within which words appear?

For example, writing a police report, the following sentence appeared:

There were 20 or more <proxy string=”black” pos=”noun” synonym=”African” type=”race”/>s in the group.

For display purposes, the string value “black” appears in the sentence:

There were 20 or more blacks in the group.

But a search for the color “black” would not return that report because the type = color does not match type = race.

On the other hand, if I searched for African-American, that report would show up because “black” with type = race is recognized as a synonym for people of African extraction.

Inline proxies are the easiest to illustrate but that is only one way to serialize such a result.

If done in an authoring interface, such an approach would have the distinct advantage of offering the original author the choice of subject properties.

The advantage of involving the original author is that they have an interest in and awareness of the document in question. Quite unlike automated processes that later attempt annotation by rote.

April 9, 2016

No Perception Without Cartography [Failure To Communicate As Cartographic Failure]

Filed under: Perception,Subject Identity,Topic Maps — Patrick Durusau @ 8:12 pm

Dan Klyn tweeted:

No perception without cartography

with an image of this text (from Self comes to mind: constructing the conscious mind by Antonio R Damasio):


The nonverbal kinds of images are those that help you display mentally the concepts that correspond to words. The feelings that make up the background of each mental instant and that largely signify aspects of the body state are images as well. Perception, in whatever sensory modality, is the result of the brain’s cartographic skill.

Images represent physical properties of entities and their spatial and temporal relationships, as well as their actions. Some images, which probably result from the brain’s making maps of itself making maps, are actually quite abstract. They describe patterns of occurrence of objects in time and space, the spatial relationships and movement of objects in terms of velocity and trajectory, and so forth.

Dan’s tweet spurred me to think that our failures to communicate to others could be described as cartographic failures.

If we use a term that is unknown to the average reader, say “daat,” the reader lacks a mental mapping that enables interpretation of that term.

Even if you know the term, it doesn’t stand in isolation in your mind. It fits into a number of maps, some of which you may be able to articulate and very possibly into other maps, which remain beyond your (and our) ken.

Not that this is a light going off experience for you or me but perhaps the cartographic imagery may be helpful in illustrating both the value and the risks of topic maps.

The value of topic maps is spoken of often but the risks of topic maps rarely get equal press.

How would topic maps be risky?

Well, consider the average spreadsheet using in a business setting.

Felienne Hermans in Spreadsheets: The Ununderstood Dark Matter of IT makes a persuasive case that spreadsheets are on an average five years old with little or no documentation.

If those spreadsheets remain undocumented, both users and auditors are equally stymied by their ignorance, a cartographic failure that leaves both wondering what must have been meant by columns and operations in the spreadsheet.

To the extent that a topic map or other disclosure mechanism preserves and/or restores the cartography that enables interpretation of the spreadsheet, suddenly staff are no longer plausibly ignorant of the purpose or consequences of using the spreadsheet.

Facile explanations that change from audit to audit are no longer possible. Auditors are chargeable with consistent auditing from one audit to another.

Does it sound like there is going to be a rush to use topic maps or other mechanisms to make spreadsheets transparent?

Still, transparency that befalls one could well benefit another.

Or to paraphrase King David (2 Samuel 11:25):

Don’t concern yourself about this. In business, transparency falls on first one and then another.

Ready to inflict transparency on others?

April 8, 2016

“No One Willingly Gives Away Power”

Filed under: Government,Intelligence,Topic Maps — Patrick Durusau @ 3:33 pm

Matthew Schofield in European anti-terror efforts hobbled by lack of trust, shared intelligence hits upon the primary reason for resistance to topic maps and other knowledge integration technologies.

Especially in intelligence, knowledge is power. No one willingly gives away power.” (Magnus Ranstorp, Swedish National Defense University)

From clerks who sort mail to accountants who cook the books to lawyers that defend patents and everyone else in between, everyone in an enterprise has knowledge, knowledge that gives them power others don’t have.

Topic maps have been pitched on a “greater good for the whole” basis but as Magnus points out, who the hell really wants that?

When confronted with a new technique, technology, methodology, the first and foremost question on everyone’s mind is:

Do I have more/less power/status with X?

A

approach loses power.

A

approach gains power.

Relevant lyrics:

Oh, there ain’t no rest for the wicked
Money don’t grow on trees
I got bills to pay
I got mouths to feed
And ain’t nothing in this world for free
No I can’t slow down
I can’t hold back
Though you know I wish I could
No there ain’t no rest for the wicked
Until we close our eyes for good

Sell topic maps to increase/gain power.

PS: Keep the line, “No one willingly gives away power” in discussions of why the ICIJ refuses to share the Panama Papers with the public.

April 5, 2016

Pentagon Confirms Crowdsourcing of Map Data

Filed under: Crowd Sourcing,Mapping,Maps,Military,Topic Maps — Patrick Durusau @ 10:01 am

I have mentioned before, Tracking NSA/CIA/FBI Agents Just Got Easier, The DEA is Stalking You!, how citizens can invite federal agents to join the gold fish bowl being prepared for the average citizen.

Of course, that’s just me saying it, unless and until the Pentagon confirms the crowdsourcing of map data!

Aliya Sternstein writes
in Soldiers to Help Crowdsource Spy Maps:


“What a great idea if we can get our soldiers adding fidelity to the maps and operational picture that we already have” in Defense systems, Gordon told Nextgov. “All it requires is pushing out our product in a manner that they can add data to it against a common framework.”

Comparing mapping parties to combat support activities, she said, soldiers are deployed in some pretty remote areas where U.S. forces are not always familiar with the roads and the land, partly because they tend to change.

If troops have a base layer, “they can do basically the same things that that social party does and just drop pins and add data,” Gordon said from a meeting room at the annual Esri conference. “Think about some of the places in Africa and some of the less advantaged countries that just don’t have addresses in the way we do” in the United States.

Of course, you already realize the value of crowd-sourcing surveillance of government agents but for the c-suite crowd, confirmation from a respected source (the Pentagon) may help push your citizen surveillance proposal forward.

BTW, while looking at Army GeoData research plans (pages 228-232), I ran across this passage:

This effort integrates behavior and population dynamics research and analysis to depict the operational environment including culture, demographics, terrain, climate, and infrastructure, into geospatial frameworks. Research exploits existing open source text, leverages multi-media and cartographic materials, and investigates data collection methods to ingest geospatial data directly from the tactical edge to characterize parameters of social, cultural, and economic geography. Results of this research augment existing conventional geospatial datasets by providing the rich context of the human aspects of the operational environment, which offers a holistic understanding of the operational environment for the Warfighter. This item continues efforts from Imagery and GeoData Sciences, and Geospatial and Temporal Information Structure and Framework and complements the work in PE 0602784A/Project T41.

Doesn’t that just reek with subjects that would be identified differently in intersecting information systems?

One solution would be to fashion top down mapping systems that are months if not years behind demands in an operational environment. Sort of like tanks that overheat in jungle warfare.

Or you could do something a bit more dynamic that provides a “good enough” mapping for operational needs and yet also has the information necessary to integrate it with other temporary solutions.

March 18, 2016

Pardon the Intermission

Filed under: Topic Maps — Patrick Durusau @ 8:18 pm

Apologies for the absence of posts starting on March 15, 2016 until this one today.

I made an unplanned trip to the local hospital via ambulance around 8:00 AM on the 15th and managed to escape on the afternoon of March 17, 2016.

On the downside I didn’t have anyway to explain my sudden absence from the Net.

On the upside I had a lot of non-computer assisted time to think about topic maps, etc., while being poked, prodded, waiting for lab results, etc.

Not to mention I re-read the first two Harry Potter books. 😉

I have one interesting item for today and will be posting about my non-computer assisted thinking about topic maps in the near future.

Your interest in this blog and comments are always appreciated!

March 10, 2016

Technology Adoption – Nearly A Vertical Line (To A Diminished IQ)

Filed under: Filters,Topic Maps — Patrick Durusau @ 9:11 pm

tech-adoption

From: There’s a major long-term trend in the economy that isn’t getting enough attention by Rick Rieder.

From the post:

As the chart above shows, people in the U.S. today are adopting new technologies, including tablets and smartphones, at the swiftest pace we’ve seen since the advent of the television. However, while television arguably detracted from U.S. productivity, today’s advances in technology are generally geared toward greater efficiency at lower costs. Indeed, when you take into account technology’s downward influence on price, U.S. consumption and productivity figures look much better than headline numbers would suggest.

Hmmm, did you catch that?

…while television arguably detracted from U.S. productivity, today’s advances in technology are generally geared toward greater efficiency at lower costs.

Really? Rick must have missed the memo on how multitasking (one aspect of smart phones, tablets, etc.) lowers your IQ by 15 points. About what you would expect from smoking a joint.

If technology encourages multitasking, making us dumber, then we are becoming less efficient. Yes?

Imagine if instead of scrolling past tweets with images of cats, food, irrelevant messages, every time you look at your Twitter time line, you got the two or three tweets relevant to your job function.

Each of those not-on-task tweets chips away at the amount of attention span you have to spend on the two or three truly important tweets.

Apps that consolidate, filter and diminish information flow are the path to greater productivity.

Topic maps anyone?

Growthverse

Filed under: Marketing,Topic Maps — Patrick Durusau @ 8:41 pm

Growthverse

From the webpage:

Growthverse was built for marketers, by marketers, with input from more than 100 CMOs.

Explore 800 marketing technology companies (and growing).

I originally arrived at this site here.

Interesting visualization that may result in suspects (their not prospects until you have serious discussions) for topic map based tools.

The site says the last update was in September 2015 so take heed that the data is stale by about six months.

That said, its easier than hunting down the 800+ companies on your own.

Good hunting!

March 9, 2016

Wandora – New Release – 2016-03-08

Filed under: Topic Map Software,Topic Maps,Wandora — Patrick Durusau @ 11:21 am

Wandora – New Release – 2016-03-08

The homepage reports:

New Wandora version has been released today (2016-03-08). The release adds Wandora support to MariaDB and PostgreSQL database topic maps. Wandora has now more stylish look, especially in Traditional topic map. The release fixes many known bugs.

I’m no style or UI expert but I’m not sure where I should be looking for the “…more stylish look….” 😉

From the main window:

wandora-main

If you select Tools and then Tools Manager (or Cntrl-t [lower case, contrary to the drop down menu]), you will see a list of all tools (300+) with the message:

All known tools are listed here. The list contains also unfinished, buggy and depracated tools. Running such tool may cause exceptions and unpredictable behavior. We suggest you don’t run the tools listed here unless you really know what you are doing.

It is a very impressive set of tools!

There is no lack of place to explore in Wandora and to explore with Wandora.

Enjoy!

March 5, 2016

Overlay Journal – Discrete Analysis

Filed under: Discrete Structures,Mathematics,Publishing,Topic Maps,Visualization — Patrick Durusau @ 10:45 am

The arXiv overlay journal Discrete Analysis has launched by Christian Lawson-Perfect.

From the post:

Discrete Analysis, a new open-access journal for articles which are “analytical in flavour but that also have an impact on the study of discrete structures”, launched this week. What’s interesting about it is that it’s an arXiv overlay journal founded by, among others, Timothy Gowers.

What that means is that you don’t get articles from Discrete Analysis – it just arranges peer review of papers held on the arXiv, cutting out almost all of the expensive parts of traditional journal publishing. I wasn’t really prepared for how shallow that makes the journal’s website – there’s a front page, and when you click on an article you’re shown a brief editorial comment with a link to the corresponding arXiv page, and that’s it.

But that’s all it needs to do – the opinion of Gowers and co. is that the only real value that journals add to the papers they publish is the seal of approval gained by peer review, so that’s the only thing they’re doing. Maths papers tend not to benefit from the typesetting services traditional publishers provide (or, more often than you’d like, are actively hampered by it).

One way the journal is adding value beyond a “yes, this is worth adding to the list of papers we approve of” is by providing an “editorial introduction” to accompany each article. These are brief notes, written by members of the editorial board, which introduce the topics discussed in the paper and provide some context, to help you decide if you want to read the paper. That’s a good idea, and it makes browsing through the articles – and this is something unheard of on the internet – quite pleasurable.

It’s not difficult to imagine “editorial introductions” with underlying mini-topic maps that could be explored on their own or that as you reach the “edge” of a particular topic map, it “unfolds” to reveal more associations/topics.

Not unlike a traditional street map for New York which you can unfold to find general areas but can then fold it up to focus more tightly on a particular area.

I hesitate to say “zoom” because in the application I have seen (important qualification), “zoom” uniformly reduces your field of view.

A more nuanced notion of “zoom,” for a topic map and perhaps for other maps as well, would be to hold portions of the current view stationary, say a starting point on an interstate highway and to “zoom” only a portion of the current view to show a detailed street map. That would enable the user to see a particular location while maintaining its larger context.

Pointers to applications that “zoom” but also maintain different levels of “zoom” in the same view? Given the fascination with “hairy” presentations of graphs that would have to be real winner.

March 4, 2016

Facts Before Policy? – Digital Security – Contacts – Volunteer Opportunity

Filed under: Cybersecurity,Government,Topic Maps — Patrick Durusau @ 12:53 pm

Rep. McCaul, Michael T. [R-TX-10] has introduced H.R.4651 – Digital Security Commission Act of 2016, full text here, a proposal to form the National Commission on Security and Technology Challenges.

From the proposal:

(2) To submit to Congress a report, which shall include, at a minimum, each of the following:

(A) An assessment of the issue of multiple security interests in the digital world, including public safety, privacy, national security, and communications and data protection, both now and throughout the next 10 years.

(B) A qualitative and quantitative assessment of—

(i) the economic and commercial value of cryptography and digital security and communications technology to the economy of the United States;

(ii) the benefits of cryptography and digital security and communications technology to national security and crime prevention;

(iii) the role of cryptography and digital security and communications technology in protecting the privacy and civil liberties of the people of the United States;

(iv) the effects of the use of cryptography and other digital security and communications technology on Federal, State, and local criminal investigations and counterterrorism enterprises;

(v) the costs of weakening cryptography and digital security and communications technology standards; and

(vi) international laws, standards, and practices regarding legal access to communications and data protected by cryptography and digital security and communications technology, and the potential effect the development of disparate, and potentially conflicting, laws, standards, and practices might have.

(C) Recommendations for policy and practice, including, if the Commission determines appropriate, recommendations for legislative changes, regarding—

(i) methods to be used to allow the United States Government and civil society to take advantage of the benefits of digital security and communications technology while at the same time ensuring that the danger posed by the abuse of digital security and communications technology by terrorists and criminals is sufficiently mitigated;

(ii) the tools, training, and resources that could be used by law enforcement and national security agencies to adapt to the new realities of the digital landscape;

(iii) approaches to cooperation between the Government and the private sector to make it difficult for terrorists to use digital security and communications technology to mobilize, facilitate, and operationalize attacks;

(iv) any revisions to the law applicable to wiretaps and warrants for digital data content necessary to better correspond with present and future innovations in communications and data security, while preserving privacy and market competitiveness;

(v) proposed changes to the procedures for obtaining and executing warrants to make such procedures more efficient and cost-effective for the Government, technology companies, and telecommunications and broadband service providers; and

(vi) any steps the United States could take to lead the development of international standards for requesting and obtaining digital evidence for criminal investigations and prosecutions from a foreign, sovereign State, including reforming the mutual legal assistance treaty process, while protecting civil liberties and due process.

Excuse the legalese but clearly an effort that could provide a factual as opposed to fantasy basis for further on digital security. No one can guarantee a sensible result but without a factual basis, any legislation is certainly going to be wrong.

For your convenience and possible employment/volunteering, here are the co-sponsors of this bill, with hyperlinks to their congressional homepages:

Now would be a good time to pitch yourself for involvement in this possible commission.

Pay attention to Section 8 of the bill:

SEC. 8. Staff.

(a)Appointment.—The chairman and vice chairman shall jointly appoint and fix the compensation of an executive director and of and such other personnel as may be necessary to enable the Commission to carry out its functions under this Act.

(b)Security clearances.—The appropriate Federal agencies or departments shall cooperate with the Commission in expeditiously providing appropriate security clearances to Commission staff, as may be requested, to the extent possible pursuant to existing procedures and requirements, except that no person shall be provided with access to classified information without the appropriate security clearances.

(c)Detailees.—Any Federal Government employee may be detailed to the Commission on a reimbursable basis, and such detailee shall retain without interruption the rights, status, and privileges of his or her regular employment.

(d)Expert and consultant services.—The Commission is authorized to procure the services of experts and consultants in accordance with section 3109 of title 5, United States Code, but at rates not to exceed the daily rate paid a person occupying a position level IV of the Executive Schedule under section 5315 of title 5, United States Code.

(e)Volunteer services.—Notwithstanding section 1342 of title 31, United States Code, the Commission may accept and use voluntary and uncompensated services as the Commission determines necessary.

I can sense you wondering what:

the daily rate paid a person occupying a position level IV of the Executive Schedule under section 5315 of title 5, United States Code

means, in practical terms.

I was tempted to point you to: 5 U.S. Code § 5315 – Positions at level IV, but that would be cruel and uninformative. 😉

I did track down the Executive Schedule, which lists position level IV:

Annual: $160,300 or about $80.15/hr. and for an 8-hour day, $641.20.

If you do volunteer or get a paying gig, please remember that omission and/or manipulation of subject identity properties can render otherwise open data opaque.

I don’t know of any primes are making money off of topic maps currently so it isn’t likely to gain traction with current primes. On the other hand, new primes can and do occur. Not often but it happens.

PS: If the extra links to contacts, content, etc. are helpful, please let me know. I started off reading the link poor RSA 2016: McCaul calls backdoors ineffective, pushes for tech panel to solve security issues. Little more than debris on your information horizon.

March 3, 2016

11 Million Pages of CIA Files [+ Allen Dulles, war criminal]

Filed under: Government,Government Data,Topic Maps — Patrick Durusau @ 6:54 pm

11 Million Pages of CIA Files May Soon Be Shared By This Kickstarter by Joseph Cox.

From the post:

Millions of pages of CIA documents are stored in Room 3000. The CIA Records Search Tool (CREST), the agency’s database of declassified intelligence files, is only accessible via four computers in the National Archives Building in College Park, MD, and contains everything from Cold War intelligence, research and development files, to images.

Now one activist is aiming to get those documents more readily available to anyone who is interested in them, by methodically printing, scanning, and then archiving them on the internet.

“It boils down to freeing information and getting as much of it as possible into the hands of the public, not to mention journalists, researchers and historians,” Michael Best, analyst and freedom of information activist told Motherboard in an online chat.

Best is trying to raise $10,000 on Kickstarter in order to purchase the high speed scanner necessary for such a project, a laptop, office supplies, and to cover some other costs. If he raises more than the main goal, he might be able to take on the archiving task full-time, as well as pay for FOIAs to remove redactions from some of the files in the database. As a reward, backers will help to choose what gets archived first, according to the Kickstarter page.

“Once those “priority” documents are done, I’ll start going through the digital folders more linearly and upload files by section,” Best said. The files will be hosted on the Internet Archive, which converts documents into other formats too, such as for Kindle devices, and sometimes text-to-speech for e-books. The whole thing has echoes of Cryptome—the freedom of information duo John Young and Deborah Natsios, who started off scanning documents for the infamous cypherpunk mailing list in the 1990s.

Good news! Kickstarter has announced this project funded!

Additional funding will help make this archive of documents available sooner rather than later.

As opposed to an attempt to boil the ocean of 11 million pages of CIA files, what about smaller topic mapping/indexing projects that focus on bounded sub-sets of documents of interest to particular communities?

I don’t have any interest in the STAR GATE project (clairvoyance, precognition, or telepathy, continued now by the DHS at airport screening facilities) but would be very interested in the records of Allen Dulles, a war criminal of some renown.

Just so you know, Michael has already uploaded documents on Allen Dulles from the CIA Records Search Tool (CREST) tool:

History of Allen Welsh Dulles as CIA Director – Volume I: The Man

History of Allen Welsh Dulles as CIA Director – Volume II: Coordination of Intelligence

History of Allen Welsh Dulles as CIA Director – Volume III: Covert Activities

History of Allen Welsh Dulles as CIA Director – Volume IV: Congressional Oversight and Internal Administration

History of Allen Welsh Dulles as CIA Director – Volume V: Intelligence Support of Policy

To describe Allen Dulles as a war criminal is no hyperbole. Among his other crimes, overthrow of President Jacobo Arbenz Guzman of Guatemala (think United Fruit Company), removal of Mohammad Mossadeq, prime minister of Iran (think Shah of Iran), are only two of his crimes, the full extent of which will probably never be known.

Files are being uploaded to That 1 Archive.

March 2, 2016

Graph Encryption: Going Beyond Encrypted Keyword Search [Subject Identity Based Encryption]

Filed under: Cryptography,Cybersecurity,Graphs,Subject Identity,Topic Maps — Patrick Durusau @ 4:49 pm

Graph Encryption: Going Beyond Encrypted Keyword Search by Xiarui Meng.

From the post:

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) store, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph database have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny’s previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today’s world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

2. Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, which can lead to many interesting research directions.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

Apologies for the long quote but I thought this topic might be new to some readers. Xianrui goes on to describe a solution for efficient queries over encrypted graphs.

Chase and Kamara remark in Structured Encryption and Controlled Disclosure, CK10:


To address this problem we introduce the notion of structured encryption. A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key. In addition, the query process reveals no useful information about either the query or the data. An important consideration in this context is the efficiency of the query operation on the server side. In fact, in the context of cloud storage, where one often works with massive datasets, even linear time operations can be infeasible. (emphasis in original)

With just a little nudging, their:

A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key.

could be re-stated as:

A subject identity encryption scheme leaves out merging data in such a way that the resulting topic map can only be queried with knowledge of the subject identity merging key.

You may have topics that represent diagnoses such as cancer, AIDS, sexual contacts, but if none of those can be associated with individuals who are also topics in the map, there is no more disclosure than census results for a metropolitan area and a list of the citizens therein.

That is you are missing the critical merging data that would link up (associate) any diagnosis with a given individual.

Multi-property subject identities would make the problem even harder, so say nothing of conferring properties on the basis of supplied properties as part of the merging process.

One major benefit of a subject identity based approach is that without the merging key, any data set, however sensitive the information, is just a data set, until you have the basis for solving its subject identity riddle.

PS: With the usual caveats of not using social security numbers, birth dates and the like as your subject identity properties. At least not in the map proper. I can think of several ways to generate keys for merging that would be resistant to even brute force attacks.

Ping me if you are interested in pursuing that on a data set.

February 28, 2016

Earthdata Search – Smells Like A Topic Map?*

Filed under: Ecoinformatics,Government Data,Topic Maps — Patrick Durusau @ 3:06 pm

Earthdata Search

From the webpage:

Search NASA Earth Science data by keyword and filter by time or space.

After choosing tour:

Keyword Search

Here you can enter search terms to find relevant data. Search terms can be science terms, instrument names, or even collection IDs. Let’s start by searching for Snow Cover NRT to find near real-time snow cover data. Type Snow Cover NRT in the keywords box and press Enter.

Which returns a screen in three sections, left to right: Browse Collections, 21 Matching Collections (Add collections to your project to compare and retrieve their data), and the third section displays a world map (navigate by grabbing the view)

Under Browse Collections:

In addition to searching for keywords, you can narrow your search through this list of terms. Click Platform to expand the list of platforms (still in a tour box)

Next step:

Now click Terra to select the Terra satellite.

Comment: Wondering how I will know which “platform” or “instrument” to select? There may be more/better documentation but I haven’t seen it yet.

The data follows the Unified Metadata Model (UMM):

NASA’s Common Metadata Repository (CMR) is a high-performance, high-quality repository for earth science metadata records that is designed to handle metadata at the Concept level. Collections and Granules are common metadata concepts in the Earth Observation (EO) world, but this can be extended out to Visualizations, Parameters, Documentation, Services, and more. The CMR metadata records are supplied by a diverse array of data providers, using a variety of supported metadata standards, including:

umm-page-table-1-metadata-standards

Initially, designers of the CMR considered standardizing all CMR metadata to a single, interoperable metadata format – ISO 19115. However, NASA decided to continue supporting multiple metadata standards in the CMR — in response to concerns expressed by the data provider community over the expense involved in converting existing metadata systems to systems capable of generating ISO 19115. In order to continue supporting multiple metadata standards, NASA designed a method to easily translate from one supported standard to another and constructed a model to support the process. Thus, the Unified Metadata Model (UMM) for EOSDIS metadata was born as part of the EOSDIS Metadata Architecture Studies (MAS I and II) conducted between 2012 and 2013.

What is the UMM?

The UMM is an extensible metadata model which provides a ‘Rosetta stone’ or cross-walk for mapping between CMR-supported metadata standards. Rather than create mappings from each CMR-supported metadata standard to each other, each standard is mapped centrally to the UMM model, thus reducing the number of translations required from n x (n-1) to 2n.

Here the mapping graphic:

umm-page-benefits-diagram

Granting profiles don’t make the basis for mappings explicit, but the mappings have the same impact post mapping as a topic map would post merging.

The site could use better documentation for the interface and data, at least in the view of this non-expert in the area.

Thoughts on documentation for the interface or making the mapping more robust via use of a topic map?

I first saw this in a tweet by Kirk Borne.


*Smells Like A Topic Map – Sorry, culture bound reference to a routine on the first Cheech & Chong album. No explanation would do it justice.

February 17, 2016

Challenges of Electronic Dictionary Publication

Filed under: Dictionary,Digital Research,Subject Identity,Topic Maps — Patrick Durusau @ 4:20 pm

Challenges of Electronic Dictionary Publication

From the webpage:

April 8-9th, 2016

Venue: University of Leipzig, GWZ, Beethovenstr. 15; H1.5.16

This April we will be hosting our first Dictionary Journal workshop. At this workshop we will give an introduction to our vision of „Dictionaria“, introduce our data model and current workflow and will discuss (among others) the following topics:

  • Methodology and concept: How are dictionaries of „small“ languages different from those of „big“ languages and what does this mean for our endeavour? (documentary dictionaries vs. standard dictionaries)
  • Reviewing process and guidelines: How to review and evaluate a dictionary database of minor languages?
  • User-friendliness: What are the different audiences and their needs?
  • Submission process and guidelines: reports from us and our first authors on how to submit and what to expect
  • Citation: How to cite dictionaries?

If you are interested in attending this event, please send an e-mail to dictionary.journal[AT]uni-leipzig.de.

Workshop program

Our workshop program can now be downloaded here.

See the webpage for a list of confirmed participants, some with submitted abstracts.

Any number of topic map related questions arise in a discussion of dictionaries.

  • How to represent dictionary models?
  • What properties should be used to identify the subjects that represent dictionary models?
  • On what basis, if any, should dictionary models be considered the same or different? And for what purposes?
  • What data should be captured by dictionaries and how should it be identified?
  • etc.

Those are only a few of the questions that could be refined into dozens, if not hundreds of more, when you reach the details of constructing a dictionary.

I won’t be attending but wait with great anticipation the output from this workshop!

February 9, 2016

Topic Maps: On the Cusp of Success (Curate in Place/Death of ETL?)

Filed under: Data Integration,Semantics,Topic Maps — Patrick Durusau @ 7:10 pm

The Bright Future of Semantic Graphs and Big Connected Data by Alex Woodie.

From the post:

Semantic graph technology is shaping up to play a key role in how organizations access the growing stores of public data. This is particularly true in the healthcare space, where organizations are beginning to store their data using so-called triple stores, often defined by the Resource Description Framework (RDF), which is a model for storing metadata created by the World Wide Web Consortium (W3C).

One person who’s bullish on the prospects for semantic data lakes is Shawn Dolley, Cloudera’s big data expert for the health and life sciences market. Dolley says semantic technology is on the cusp of breaking out and being heavily adopted, particularly among healthcare providers and pharmaceutical companies.

“I have yet to speak with a large pharmaceutical company where there’s not a small group of IT folks who are working on the open Web and are evaluating different technologies to do that,” Dolley says. “These are visionaries who are looking five years out, and saying we’re entering a world where the only way for us to scale….is to not store it internally. Even with Hadoop, the data sizes are going to be too massive, so we need to learn and think about how to federate queries.”

By storing healthcare and pharmaceutical data as semantic triples using graph databases such as Franz’s AllegroGraph, it can dramatically lower the hurdles to accessing huge stores of data stored externally. “Usually the primary use case that I see for AllegroGraph is creating a data fabric or a data ecosystem where they don’t have to pull the data internally,” Dolley tells Datanami. “They can do seamless queries out to data and curate it as it sits, and that’s quite appealing.

….

This is leading-edge stuff, and there are few mission-critical deployments of semantic graph technologies being used in the real world. However, there are a few of them, and the one that keeps popping up is the one at Montefiore Health System in New York City.

Montefiore is turning heads in the healthcare IT space because it was the first hospital to construct a “longitudinally integrated, semantically enriched” big data analytic infrastructure in support of “next-generation learning healthcare systems and precision medicine,” according to Franz, which supplied the graph database at the heart of the health data lake. Cloudera’s free version of Hadoop provided the distributed architecture for Montefiore’s semantic data lake (SDL), while other components and services were provided by tech big wigs Intel (NASDAQ: INTC) and Cisco Systems (NASDAQ: CSCO).

This approach to building an SDL will bring about big improvements in healthcare, says Dr. Parsa Mirhaji MD. PhD., the director of clinical research informatics at Einstein College of Medicine and Montefiore Health System.

“Our ability to conduct real-time analysis over new combinations of data, to compare results across multiple analyses, and to engage patients, practitioners and researchers as equal partners in big-data analytics and decision support will fuel discoveries, significantly improve efficiencies, personalize care, and ultimately save lives,” Dr. Mirhaji says in a press release. (emphasis added)

If I hadn’t known better, reading passages like:

the only way for us to scale….is to not store it internally

learn and think about how to federate queries

seamless queries out to data and curate it as it sits

I would have sworn I was reading a promotion piece for topic maps!

Of course, it doesn’t mention how to discover valuable data not written in your terminology, but you have to hold something back for the first presentation to the CIO.

The growth of data sets too large for ETL are icing on the cake for topic maps.

Why ETL when the data “appears” as I choose to view it? My topic map may be quite small, at least in relationship to the data set proper.

computer-money

OK, truth-in-advertising moment, it won’t be quite that easy!

And I don’t take small bills. 😉 Diamonds, other valuable commodities, foreign deposit arrangements can be had.

People are starting to think in a “topic mappish” sort of way. Or at least a way where topic maps deliver what they are looking for.

That’s the key: What do they want?

Then use a topic map to deliver it.

February 2, 2016

Sunlight launches Hall of Justice… [ Topic Map “like” features?]

Filed under: Government,Government Data,Topic Maps — Patrick Durusau @ 6:53 pm

Sunlight launches Hall of Justice, a massive data inventory on criminal justice across the U.S. by Josh Stewart.

From the post:

Today, Sunlight is launching Hall of Justice, a robust, searchable data inventory of nearly 10,000 datasets and research documents from across all 50 states, the District of Columbia and the federal government. Hall of Justice is the culmination of 18 months of work gathering data and refining technology.

The process was no easy task: Building Hall of Justice required manual entry of publicly available data sources from a multitude of locations across the country.

Sunlight’s team went from state to state, meeting and calling local officials to inquire about and find data related to criminal justice. Some states like California have created a data portal dedicated to making criminal justice data easily accessible to the public; others had their data buried within hard to find websites. We also found data collected by state departments of justice, police forces, court systems, universities and everything in between.

“Data is shaping the future of how we address some of our most pressing problems,” said John Wonderlich, executive director of the Sunlight Foundation. “This new resource is an experiment in how a robust snapshot of data can inform policy and research decisions.”

In addition to being a great data collection, the Hall of Justice attempts to deliver topic map like capability for searches:

The resource attempts to consolidate different terminology across multiple states, which is far from uniform or standardized. For example, if you search solitary confinement you will return results for data around solitary confinement, but also for the terms “segregated housing unit,” “SHU,” “administrative segregation” and “restrictive housing.” This smart search functionality makes finding datasets much easier and accessible.

solitary

Looking at all thirteen results for a search on “solitary confinement,” I don’t see the mapping in question. Or certainly no mapping based on characteristics of the subject, “solitary confinement.”

As close as Georgia’s 2013 Juvenile Justice Reform is using the word “restrictive” as in:

Create a two-class system within the Designated Felony Act. Designated felony offenses are divided into two classes, based on severity—Class A and Class B—that continue to allow restrictive custody while also adjusting available sanctions to account for both offense severity and risk level.

Restrictive custody is what jail systems are about so that doesn’t trip the wire for “solitary confinement.”

Of course, the links are to entire reports/documents/data sets so each researcher will have to extract and collate content individually. When that happens, a means to contribute that collation/mapping to the Hall of Justice would be a boon for other researchers. (Can you say “topic map?”)

As I write this, you will need to prefer Mozilla over Chrome, at least on Ubuntu.

Trigger Warning: If you are sensitive to traumatic events and/or reports of traumatic events, you may want to ask someone less sensitive to review these data sources.

The only difference between a concentration camp and American prisons is the lack of mass gas chambers. Every horror and abuse that you can imagine and some you probably can’t, are visited on people in U.S. prisons everyday.

As Joan Baez says in Prison Triology:

Sunlight’s Hall of Justice is a great step forward in documenting the chambers of horror we call American prisons.

And we’re gonna raze, raze the prisons

To the ground

Help us raze, raze the prisons

To the ground

Are you ready?

February 1, 2016

3 Decades of High Quality News! (code reuse)

Filed under: Archives,Journalism,News,Reporting,Topic Maps — Patrick Durusau @ 4:06 pm

‘NewsHour’ archives to be digitized and available online by Dru Sefton.

From the post:

More than three decades of NewsHour are heading to an online home, the American Archive of Public Broadcasting.

Nearly 10,000 episodes that aired from 1975 to 2007 will be archived through a collaboration among AAPB; WGBH in Boston; WETA in Arlington, Va.; and the Library of Congress. The organizations jointly announced the project Thursday.

“The project will take place over the next year and a half,” WGBH spokesperson Emily Balk said. “The collection should be up in its entirety by mid-2018, but AAPB will be adding content from the collection to its website on an ongoing, monthly basis.”

Looking forward to that collection!

Useful on its own, but even more so if you had an indexical object that could point to a subject in a PBS news episode and at the same time, point to episodes on the same subject from other TV and radio news archives, not to mention the same subject in newspapers and magazines.

Oh, sorry, that would be a topic in ISO 13250-2 parlance and the more general concept of a proxy in ISO 13250-5. Thought I should mention that before someone at IBM runs off to patent another pre-existing idea.

I don’t suppose padding patent statistics hurts all that much, considering that the Supremes are poised to invalidate process and software patents in one fell swoop.

Hopefully economists will be ready to measure the amount of increased productivity (legal worries about and enforcement of process/software patents aren’t productive activities) from foreclosing even the potential of process or software patents.

Copyright is more than sufficient to protect source code, as is any programmer is going to use another programmers code. They say that scientists would rather use another scientist’s toothbrush that his vocabulary.

Copying another programmer’s code (code re-use) is more akin to sharing a condom. It’s just not done.

« Newer PostsOlder Posts »

Powered by WordPress