Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 17, 2018

Query Expansion Techniques for Information Retrieval: a Survey

Filed under: Query Expansion,Subject Identity,Subject Recognition,Topic Maps — Patrick Durusau @ 9:12 pm

Query Expansion Techniques for Information Retrieval: a Survey by Hiteshwar Kumar Azad, Akshay Deepak.

With the ever increasing size of web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. To overcome this, query expansion (QE) plays a crucial role in improving the Internet searches, where the user’s initial query is reformulated to a new query by adding new meaningful terms with similar significance. QE — as part of information retrieval (IR) — has long attracted researchers’ attention. It has also become very influential in the field of personalized social document, Question Answering over Linked Data (QALD), and, Text Retrieval Conference (TREC) and REAL sets. This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications (of QE techniques) — bringing out similarities and differences.

Another goodie for the upcoming holiday season. At forty-three (43) pages and needing updating, published in 2017, a real joy for anyone interested in query expansion.

Writing this post I realized that something is missing in discussions of query expansion. It is assumed that end-users are querying the data set and they are called upon to evaluate the results.

What if we change that assumption to an expert user querying the data set and authoring filtered results for end users?

Instead of presenting an end user with a topic map, no matter how clever its merging rules, they are presented with a curated information resource.

Granting that an expert may have been using a topic map to produce the curated information resource but of what concern is that for the end user?

July 29, 2018

When Phishing and “Dropped” USB Fails – Precision Issues in Graphic Libraries

Filed under: Cybersecurity,Security,Subject Identity — Patrick Durusau @ 3:10 pm

Drawing Outside the Box: Precision Issues in Graphic Libraries by Mark Brand and Ivan Fratric, Google Project Zero.

From the post:

In this blog post, we are going to write about a seldom seen vulnerability class that typically affects graphic libraries (though it can also occur in other types of software). The root cause of such issues is using limited precision arithmetic in cases where a precision error would invalidate security assumptions made by the application.

While we could also call other classes of bugs precision issues, namely integer overflows, the major difference is: with integer overflows, we are dealing with arithmetic operations where the magnitude of the result is too large to be accurately represented in the given precision. With the issues described in this blog post, we are dealing with arithmetic operations where the magnitude of the result or a part of the result is too small to be accurately represented in the given precision.

These issues can occur when using floating-point arithmetic in operations where the result is security-sensitive, but, as we’ll demonstrate later, can also occur in integer arithmetic in some cases.

With phishing success rates reported at 90% and the commonly cited 50% of all users who would insert a “found” USB drive in their computer, use of high end hacks is always a fall back position.

The techniques discussed here will be useful for such fall back cases but, the more interesting question to me comes in the conclusion:


When it comes to finding such issues, unfortunately, there doesn’t seem to be a great way to do it. When we started looking at Skia, initially we wanted to try using symbolic execution on the drawing algorithms to find input values that would lead to drawing out-of-bounds, as, on the surface, it seemed this is a problem symbolic execution would be well suited for. However, in practice, there were too many issues: most tools don’t support floating point symbolic variables and, even when running against just the integer parts of the simplest line drawing algorithm, we were unsuccessful in completing the run in a reasonable time (we were using KLEE with STP and Z3 backends).

In the end, what we ended up doing was a combination of the more old-school methods: manual source review, fuzzing (especially with values close to image boundaries) and, in some cases, when we already identified potentially problematic areas of code, even bruteforcing the range of all possible values.

Do you know of other instances where precision errors resulted in security issues? Let us know about them in the comments.

What set of subject identity criteria would enable rough indentification of these issues?

Thoughts?

February 14, 2018

Evolving a Decompiler

Filed under: C/C++,Compilers,Cybersecurity,Programming,Subject Identity — Patrick Durusau @ 8:36 am

Evolving a Decompiler by Matt Noonan.

From the post:

Back in 2016, Eric Schulte, Jason Ruchti, myself, Alexey Loginov, and David Ciarletta (all of the research arm of GrammaTech) spent some time diving into a new approach to decompilation. We made some progress but were eventually all pulled away to other projects, leaving a very interesting work-in-progress prototype behind.

Being a promising but incomplete research prototype, it was quite difficult to find a venue to publish our research. But I am very excited to announce that I will be presenting this work at the NDSS binary analysis research (BAR) workshop next week in San Diego, CA! BAR is a workshop on the state-of-the-art in binary analysis research, including talks about working systems as well as novel prototypes and works-in-progress; I’m really happy that the program committee decided to include discussion of these prototypes, because there are a lot of cool ideas out there that aren’t production-ready, but may flourish once the community gets a chance to start tinkering with them.

How wickedly cool!

Did I mention all the major components are open-source?


GrammaTech recently open-sourced all of the major components of BED, including:

  • SEL, the Software Evolution Library. This is a Common Lisp library for program synthesis and repair, and is quite nice to work with interactively. All of the C-specific mutations used in BED are available as part of SEL; the only missing component is the big code database; just bring your own!
  • clang-mutate, a command-line tool for performing low-level mutations on C and C++ code. All of the actual edits are performed using clang-mutate; it also includes a REPL-like interface for interactively manipulating C and C++ code to quickly produce variants.

The building of the “big code database” sounds like an exercise in subject identity doesn’t it?

Topic maps anyone?

December 27, 2017

Where Do We Write Down Subject Identifications?

Filed under: Subject Identifiers,Subject Identity,Topic Maps — Patrick Durusau @ 11:23 am

Modern Data Integration Paradigms by Matthew D. Sarrel, The Bloor Group.

Introduction:

Businesses of all sizes and industries are rapidly transforming to make smarter, data-driven decisions. To accomplish this transformation to digital business , organizations are capturing, storing, and analyzing massive amounts of structured, semi-structured, and unstructured data from a large variety of sources. The rapid explosion in data types and data volume has left many IT and data science/business analyst leaders reeling.

Digital transformation requires a radical shift in how a business marries technology and processes. This isn’t merely improving existing processes, but
rather redesigning them from the ground up and tightly integrating technology. The end result can be a powerful combination of greater efficiency, insight and scale that may even lead to disrupting existing markets. The shift towards reliance on data-driven decisions requires coupling digital information with powerful analytics and business intelligence tools in order to yield well-informed reasoning and business decisions. The greatest value of this data can be realized when it is analyzed rapidly to provide timely business insights. Any process can only be as timely as the underlying technology allows it to be.

Even data produced on a daily basis can exceed the capacity and capabilities of many pre-existing database management systems. This data can be structured or unstructured, static or streaming, and can undergo rapid, often unanticipated, change. It may require real-time or near-real-time transformation to be read into business intelligence (BI) systems. For these reasons, data integration platforms must be flexible and extensible to accommodate business’s types and usage patterns of the data.

There’s the usual homage to the benefits of data integration:


IT leaders should therefore try to integrate data across systems in a way that exposes them using standard and commonly implemented technologies such as SQL and REST. Integrating data, exposing it to applications, analytics and reporting improves productivity, simplifies maintenance, and decreases the amount of time and effort required to make data-driven decisions.

The paper covers, lightly, Operational Data Store (ODS) / Enterprise Data Hub (EDH), Enterprise Data Warehouse (EDW), Logical Data Warehouse (LDW), and Data Lake as data integration options.

Having found existing systems deficient in one or more ways, the report goes on to recommend replacement with Voracity.

To be fair, as described, all four systems plus Voracity are all deficient in the same way. The hard part of data integration, the rub that lies at the heart of the task, is passed over as ETL.

Efficient and correct ETL performance requires knowledge of what column headers, for instance, identify. For instance, from the Enron spreadsheets, can you specify the transformation of the data in the following columns? “A, B, C, D, E, F…” from andrea_ring_15_IFERCnov.xlsx, or “A, B, C, D, E,…” from andy_zipper__129__Success-TradeLog.xlsx?

With enough effort, no doubt you could go through speadsheets of interest and create a mapping sufficient to transform data of interest, but where are you going to write down the facts you established for each column that underlie your transformation?

In topic maps, we may the mistake of mystifying the facts for each column by claiming to talk about subject identity, which has heavy ontological overtones.

What we should have said was we wanted to talk about where do we write down subject identifications?

Thus:

  1. What do you want to talk about?
  2. Data in column F in andrea_ring_15_IFERCnov.xlsx
  3. Do you want to talk about each entry separately?
  4. What subject is each entry? (date written month/day (no year))
  5. What calendar system was used for the date?
  6. Who created that date entry? (If want to talk about them as well, create a separate topic and an association to the spreadsheet.)
  7. The date is the date of … ?
  8. Conversion rules for dates in column F, such as supplying year.
  9. Merging rules for #2? (date comparison)
  10. Do you want relationship between #2 and the other data in each row? (more associations)

With simple questions, we have documented column F of a particular spreadsheet for any present or future ETL operation. No magic, no logical conundrums, no special query language, just asking what an author or ETL specialist knew but didn’t write down.

There are subtlties such as distinguishing between subject identifiers (identifies a subject, like a wiki page) and subject locators (points to the subject we want to talk about, like a particular spreadsheet) but identifying what you want to talk about (subject identifications and where to write them down) is more familiar than our prior obscurities.

Once those identifications are written down, you can search those identifications to discover the same subjects identified differently or with properties in one identification and not another. Think of it as capturing the human knowledge that resides in the brains of your staff and ETL experts.

The ETL assumed by Bloor Group should be written: ETLD – Extract, Transform, Load, Dump (knowledge). That seems remarkably inefficient and costly to me. You?

November 16, 2017

Shape Searching Dictionaries?

Facebook, despite its spying, censorship, and being a shill for the U.S. government, isn’t entirely without value.

For example, this post by Simon St. Laurent:

Drew this response from Peter Cooper:

Which if you follow the link: Shapecatcher: Unicode Character Recognition you find:

Draw something in the box!

And let shapecatcher help you to find the most similar unicode characters!

Currently, there are 11817 unicode character glyphs in the database. Japanese, Korean and Chinese characters are currently not supported.
(emphasis in original)

I take “Japanese, Korean and Chinese characters are currently not supported.” means Anatolian Hieroglyphs; Cuneiform, Cuneiform Numbers and Punctuation, Early Dynastic Cuneiform, Old Persian, Ugaritic; Egyptian Hieroglyphs; Meroitic Cursive, and Meroitic Hieroglphs are not supported as well.

But my first thought wasn’t discovery of glyphs in Unicode Code Charts, although useful, but shape searching dictionaries, such as Faulkner’s A Concise Dictionary of Middle Egyptian.

A sample from Faulkner’s (1991 edition):

Or, The Student’s English-Sanskrit Dictionary by Vaman Shivram Apte (1893):

Imagine being able to search by shape for either dictionary! Not just as a gylph but as a set of glyphs, within any entry!

I suspect that’s doable based on Benjamin Milde‘s explanation of Shapecatcher:


Under the hood, Shapecatcher uses so called “shape contexts” to find similarities between two shapes. Shape contexts, a robust mathematical way of describing the concept of similarity between shapes, is a feature descriptor first proposed by Serge Belongie and Jitendra Malik.

You can find an indepth explanation of the shape context matching framework that I used in my Bachelor thesis (“On the Security of reCATPCHA”). In the end, it is quite a bit different from the matching framework that Belongie and Malik proposed in 2000, but still based on the idea of shape contexts.

The engine that runs this site is a rewrite of what I developed during my bachelor thesis. To make things faster, I used CUDA to accelereate some portions of the framework. This is a fairly new technology that enables me to use my NVIDIA graphics card for general purpose computing. Newer cards are quite powerful devices!

That was written in 2011 and no doubt shape matching has progressed since then.

No technique will be 100% but even less than 100% accuracy will unlock generations of scholarly dictionaries, in ways not imagined by their creators.

If you are interested, I’m sure Benjamin Milde would love to hear from you.

July 27, 2017

Dimensions of Subject Identification

Filed under: Subject Identifiers,Subject Identity,Topic Maps — Patrick Durusau @ 2:30 pm

This isn’t a new idea, but it occurred to me that introducing readers to “dimensions of subject identification” might be an easier on ramp for topic maps. It enables us to dodge the sticky issues of “identity,” in favor of asking what do you want to talk about? and how many do you want/need to identify it?

To start with a classic example, if we only have one dimension and the string “Paris,” ambiguity is destined to follow.

If we add a country dimension, now having two dimensions, “Paris” + “France” can be distinguished from all other uses of “Paris” with the string + country dimension.

The string + country dimension fares less well for “Paris” + country = “United States:”

For the United States you need “Paris” + country + state dimensions, at a minimum, but that leaves you with two instances of Paris in Ohio.

One advantage of speaking of “dimensions of subject identification” is that we can order systems of subject identification by the number of dimensions they offer. Not to mention examining the consequences of the choices of dimensions.

One dimensional systems, that is a solitary string, "Paris," as we said above, leave users with no means to distinguish one use from another. They are useful and common in CSV files or database tables, but risk ambiguity and being difficult to communicate accurately to others.

Two dimensional systems, that is city = "Paris," enables users to distinguish usages other than for city, but as you can see from the Paris example in the U.S., that may not be sufficient.

Moreover, city itself may be a subject identified by multiple dimensions, as different governmental bodies define “city” differently.

Just as some information systems only use one dimensional strings for headers, other information systems may use one dimensional strings for the subject city in city = "Paris." But all systems can capture multiple dimensions of identification for any subjects, separate from those systems.

Perhaps the most useful aspect of dimensions of identification is enabling user to ask their information architects what dimensions and their values serve to identify subjects in information systems.

Such as the headers in database tables or spreadsheets. 😉

June 29, 2017

Fuzzing To Find Subjects

Filed under: Cybersecurity,Fuzzing,Security,Subject Identity — Patrick Durusau @ 4:51 pm

Guido Vranken‘s post: The OpenVPN post-audit bug bonanza is an important review of bugs discovered in OpenVPN.

Jump to “How I fuzzed OpenVPN” for the details on Vranken fuzzing OpenVPN.

Not for the novice but an inspiration to devote time to the art of fuzzing.

The Open Web Application Security Project (OWASP) defines fuzzing this way:

Fuzz testing or Fuzzing is a Black Box software testing technique, which basically consists in finding implementation bugs using malformed/semi-malformed data injection in an automated fashion.

OWASP’s fuzzing mentions a number of resources and software, but omits the Basic Fuzzing Framework by CERT. That’s odd don’t you think?

The CERT Basic Fuzzing Framework (BFF), is current through 2016. Allen Householder has a description of version 2.8 at: Announcing CERT Basic Fuzzing Framework Version 2.8. Details on BFF, see: CERT BFF – Basic Fuzzing Framework.

Caution: One resource in the top ten (#9) for “fuzzing software” is: Fuzzing: Brute Force Vulnerability Discovery, by Michael Sutton, Adam Greene, and Pedram Amini. Great historical reference but it was published in 2007, some ten years ago. Look for more recent literature and software.

Fuzzing is obviously an important topic in finding subjects (read vulnerabilities) in software. Whether your intent is to fix those vulnerabilities or use them for your own purposes.

While reading Vranken‘s post, it occurred to me that “fuzzing” is also useful in discovering subjects in unmapped data sets.

Not all nine-digit numbers are Social Security Numbers but if you find a column of such numbers, along with what you think are street addresses and zip codes, it would not be a bad guess. Of course, if it is a 16-digit number, a criminal opportunity may be knocking at your door. (credit card)

While TMDM topic maps emphasized the use of URIs for subject identifiers, we all know that subject identifications outside of topic maps are more complex than string matching and far messier.

How would you create “fuzzy” searches to detect subjects across different data sets? Are there general principles for classes of subjects?

While your results might be presented as a curated topic map, the grist for that map would originate in the messy details of diverse information.

This sounds like an empirical question to me, especially since most search engines offer API access.

Thoughts?

June 8, 2017

Open data quality – Subject Identity By Another Name

Filed under: Open Data,Record Linkage,Subject Identity,Topic Maps,XQuery — Patrick Durusau @ 1:03 pm

Open data quality – the next shift in open data? by Danny Lämmerhirt and Mor Rubinstein.

From the post:

Some years ago, open data was heralded to unlock information to the public that would otherwise remain closed. In the pre-digital age, information was locked away, and an array of mechanisms was necessary to bridge the knowledge gap between institutions and people. So when the open data movement demanded “Openness By Default”, many data publishers followed the call by releasing vast amounts of data in its existing form to bridge that gap.

To date, it seems that opening this data has not reduced but rather shifted and multiplied the barriers to the use of data, as Open Knowledge International’s research around the Global Open Data Index (GODI) 2016/17 shows. Together with data experts and a network of volunteers, our team searched, accessed, and verified more than 1400 government datasets around the world.

We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer.

As the Open Data Handbook states, these emerging open data infrastructures resemble the myth of the ‘Tower of Babel’: more information is produced, but it is encoded in different languages and forms, preventing data publishers and their publics from communicating with one another. What makes data usable under these circumstances? How can we close the information chain loop? The short answer: by providing ‘good quality’ open data.

Congratulations to Open Knowledge International on re-discovering the ‘Tower of Babel’ problem that prevents easy re-use of data.

Contrary to Lämmerhirt and Rubinstein’s claim, barriers have not “…shifted and multiplied….” More accurate to say Lämmerhirt and Rubinstein have experienced what so many other researchers have found for decades:


We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer.

The record linkage community, think medical epidemiology, has been working on aspects of this problem since the 1950’s at least (under that name). It has a rich and deep history, focused in part on mapping diverse data sets to a common representation and then performing analysis upon the resulting set.

A common omission in record linkage is to capture in discoverable format, the basis for mapping of the diverse records to a common format. That is subjects represented by “…uncommon signs or codes that are in the worst case only understandable to their producer,” that Lämmerhirt and Rubinstein complain of, although signs and codes need not be “uncommon” to be misunderstood by others.

To their credit, unlike RDF and the topic maps default, record linkage has long recognized that identification consists of multiple parts and not single strings.

Topic maps, at least at their inception, was unaware of record linkage and the vast body of research done under that moniker. Topic maps were bitten by the very problem they were seeking to solve. That being a subject, could be identified many different ways and information discovered by others about that subject, could be nearby but undiscoverable/unknown.

Rather than building on the experience with record linkage, topic maps, at least in the XML version, defaulted to relying on URLs to identify the location of subjects (resources) and/of identifying subjects (identifiers). Avoiding the Philosophy 101 mistakes of RDF, confusing locators and identifiers + refusing to correct the confusion, wasn’t enough for topic maps to become widespread. One suspects in part because topic maps were premised on creating more identifiers for subjects which already had them.

Imagine that your company has 1,000 employees and in order to use a new system, say topic maps, everyone must get a new name. Can’t use the old one. Do you see a problem? Now multiple that by every subject anyone in your company wants to talk about. We won’t run out of identifiers but your staff will certainly run out of patience.

Robust solutions to the open data ‘Tower of Babel’ issue will include the use of multi-part identifications extant in data stores, dynamic creation of multi-part identifications when necessary (note, no change to existing data store), discoverable documentation of multi-part identifications and their mappings, where syntax and data models are up to the user of data.

That sounds like a job for XQuery to me.

You?

May 9, 2017

Patched != Applied / Patches As Vulnerability Patterns

Filed under: Cybersecurity,Microsoft,Security,Subject Identity — Patrick Durusau @ 7:06 pm

Microsoft’s Microsoft Security Advisory 4022344 in response to MsMpEng: Remotely Exploitable Type Confusion in Windows 8, 8.1, 10, Windows Server, SCEP, Microsoft Security Essentials, and more by taviso@google.com, was so timely as to deprive the “responsible disclosure” crowd of a chance to bitch about the notice given to Microsoft.

Two aspects of this vulnerability merit your attention.

Patched != Applied

Under Suggested Actions, the Microsoft bulletin reads:

  • Verify that the update is installed

    Customers should verify that the latest version of the Microsoft Malware Protection Engine and definition updates are being actively downloaded and installed for their Microsoft antimalware products.

    For more information on how to verify the version number for the Microsoft Malware Protection Engine that your software is currently using, see the section, “Verifying Update Installation”, in Microsoft Knowledge Base Article 2510781.

    For affected software, verify that the Microsoft Malware Protection Engine version is 1.1.13704.0 or later.

  • If necessary, install the update

    Administrators of enterprise antimalware deployments should ensure that their update management software is configured to automatically approve and distribute engine updates and new malware definitions. Enterprise administrators should also verify that the latest version of the Microsoft Malware Protection Engine and definition updates are being actively downloaded, approved and deployed in their environment.

    For end-users, the affected software provides built-in mechanisms for the automatic detection and deployment of this update. For these customers, the update will be applied within 48 hours of its availability. The exact time frame depends on the software used, Internet connection, and infrastructure configuration. End users that do not wish to wait can manually update their antimalware software.

    For more information on how to manually update the Microsoft Malware Protection Engine and malware definitions, refer to Microsoft Knowledge Base Article 2510781.

Microsoft knows its customers far better than I do and that suggests unpatched systems can be discovered in the wild. No doubt in diminishing numbers but you won’t know unless you check.

Patches As Vulnerability Patterns

You have to visit CVE-2017-0290 to find links to the details of “MsMpEng: Remotely Exploitable Type Confusion….”

Which raises an interesting use case for the Microsoft/MSRC-Microsoft-Security-Updates-API, which I encountered by by way of a PowerShell script for accessing the MSRC Portal API.

Polling the Microsoft/MSRC-Microsoft-Security-Updates-API provides you with notice of vulnerabilities to look for based on unapplied patches.

You can use the CVE links to find deeper descriptions of underlying vulnerabilities. Those descriptions, assuming you mine the sips (statistically improbable phrases), can result in a powerful search tool to find closely related postings.

Untested but searching by patterns for particular programmers (whether named or not), may be more efficient than an abstract search for coding errors.

Reasoning that programmers tend to commit the same errors, reviewers tend to miss the same errors, and so any discovered error, properly patterned, may be the key to a grab bag of other errors.

That’s an issue where tunable subject identity would be very useful.

April 27, 2017

Facebook Used To Spread Propaganda (The other use of Facebook would be?)

Filed under: Facebook,Government,Journalism,News,Subject Identity,Topic Maps — Patrick Durusau @ 8:31 pm

Facebook admits: governments exploited us to spread propaganda by Olivia Solon.

From the post:

Facebook has publicly acknowledged that its platform has been exploited by governments seeking to manipulate public opinion in other countries – including during the presidential elections in the US and France – and pledged to clamp down on such “information operations”.

In a white paper authored by the company’s security team and published on Thursday, the company detailed well-funded and subtle techniques used by nations and other organizations to spread misleading information and falsehoods for geopolitical goals. These efforts go well beyond “fake news”, the company said, and include content seeding, targeted data collection and fake accounts that are used to amplify one particular view, sow distrust in political institutions and spread confusion.

“We have had to expand our security focus from traditional abusive behavior, such as account hacking, malware, spam and financial scams, to include more subtle and insidious forms of misuse, including attempts to manipulate civic discourse and deceive people,” said the company.

It’s a good white paper and you can intuit a lot from it, but leaks on the details of Facebook counter-measures have commercial value.

Careful media advisers will start farming Facebook users now for the US mid-term elections in 2018. One of the “tells” (a behavior that discloses, unintentionally, a player’s intent) of a “fake” account is recent establishment with many similar accounts.

Such accounts need to be managed so that their “identity” fits the statistical average for similar accounts. They should not all suddenly like a particular post or account, for example.

The doctrines of subject identity in topic maps, can be used to avoid subject recognition as well as to insure it. Just the other side of the same coin.

April 24, 2017

Metron – A Fist Full of Subjects

Filed under: Cybersecurity,Security,Semantics,Subject Identity — Patrick Durusau @ 8:22 pm

Metron – Apache Incubator

From the description:

Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat-intelligence information to security telemetry within a single platform.

Metron can be divided into 4 areas:

  1. A mechanism to capture, store, and normalize any type of security telemetry at extremely high rates. Because security telemetry is constantly being generated, it requires a method for ingesting the data at high speeds and pushing it to various processing units for advanced computation and analytics.
  2. Real time processing and application of enrichments such as threat intelligence, geolocation, and DNS information to telemetry being collected. The immediate application of this information to incoming telemetry provides the context and situational awareness, as well as the “who” and “where” information that is critical for investigation.
  3. Efficient information storage based on how the information will be used:
    1. Logs and telemetry are stored such that they can be efficiently mined and analyzed for concise security visibility
    2. The ability to extract and reconstruct full packets helps an analyst answer questions such as who the true attacker was, what data was leaked, and where that data was sent
    3. Long-term storage not only increases visibility over time, but also enables advanced analytics such as machine learning techniques to be used to create models on the information. Incoming data can then be scored against these stored models for advanced anomaly detection.
  4. An interface that gives a security investigator a centralized view of data and alerts passed through the system. Metron’s interface presents alert summaries with threat intelligence and enrichment data specific to that alert on one single page. Furthermore, advanced search capabilities and full packet extraction tools are presented to the analyst for investigation without the need to pivot into additional tools.

Big data is a natural fit for powerful security analytics. The Metron framework integrates a number of elements from the Hadoop ecosystem to provide a scalable platform for security analytics, incorporating such functionality as full-packet capture, stream processing, batch processing, real-time search, and telemetry aggregation. With Metron, our goal is to tie big data into security analytics and drive towards an extensible centralized platform to effectively enable rapid detection and rapid response for advanced security threats.

Some useful links:

Metron (website)

Metron wiki

Metron Jira

Metron Git

Security threats aren’t going to assign themselves unique and immutable IDs. Which means they will be identified by characteristics and associated with particular acts (think associations), which are composed of other subjects, such as particular malware, dates, etc.

Being able to robustly share such identifications (unlike the “we’ve seen this before at some unknown time, with unknown characteristics,” typical of Russian attribution reports) would be a real plus.

Looks like a great opportunity for topic maps-like thinking.

Yes?

September 8, 2016

No Properties/No Structure – But, Subject Identity

Filed under: Category Theory,Subject Identity,Topic Maps — Patrick Durusau @ 8:08 pm

Jack Park has prodded me into following some category theory and data integration papers. More on that to follow but as part of that, I have been watching Bartosz Milewski’s lectures on category theory, reading his blog, etc.

In Category Theory 1.2, Mileski goes to great lengths to emphasize:

Objects are primitives with no properties/structure – a point

Morphism are primitives with no properties/structure, but do have a start and end point

Late in that lecture, Milewski says categories are the “ultimate in data hiding” (read abstraction).

Despite their lack of properties and structure, both objects and morphisms have subject identity.

Yes?

I think that is more than clever use of language and here’s why:

If I want to talk about objects in category theory as a group subject, what can I say about them? (assuming a scope of category theory)

  1. Objects have no properties
  2. Objects have no structure
  3. Objects mark the start and end of morphisms (distinguishes them from morphisms)
  4. Every object has an identity morphism
  5. Every pair of objects may have 0, 1, or many morphisms between them
  6. Morphisms may go in both directions, between a pair of morphisms
  7. An object can have multiple morphisms that start and end at it

Incomplete and yet a lot of things to say about something that has no properties and no structure. 😉

Bearing in mind, that’s just objects in general.

I can also talk about a specific object at a particular time point in the lecture and screen location, which itself is a subject.

Or an object in a paper or monograph.

We can declare primitives, like objects and morphisms, but we should always bear in mind they are declared to be primitives.

For other purposes, we can declare them to be otherwise.

May 13, 2016

Flawed Input Validation = Flawed Subject Recognition

Filed under: Cybersecurity,Security,Software,Subject Identity,Topic Maps — Patrick Durusau @ 7:47 pm

In Vulnerable 7-Zip As Poster Child For Open Source, I covered some of the details of two vulnerabilities in 7-Zip.

Both of those vulnerabilities were summarized by the discoverers:

Sadly, many security vulnerabilities arise from applications which fail to properly validate their input data. Both of these 7-Zip vulnerabilities resulted from flawed input validation. Because data can come from a potentially untrusted source, data input validation is of critical importance to all applications’ security.

The first vulnerability is described as:

TALOS-CAN-0094, OUT-OF-BOUNDS READ VULNERABILITY, [CVE-2016-2335]

An out-of-bounds read vulnerability exists in the way 7-Zip handles Universal Disk Format (UDF) files. The UDF file system was meant to replace the ISO-9660 file format, and was eventually adopted as the official file system for DVD-Video and DVD-Audio.

Central to 7-Zip’s processing of UDF files is the CInArchive::ReadFileItem method. Because volumes can have more than one partition map, their objects are kept in an object vector. To start looking for an item, this method tries to reference the proper object using the partition map’s object vector and the “PartitionRef” field from the Long Allocation Descriptor. Lack of checking whether the “PartitionRef” field is bigger than the available amount of partition map objects causes a read out-of-bounds and can lead, in some circumstances, to arbitrary code execution.

(code in original post omitted)

This vulnerability can be triggered by any entry that contains a malformed Long Allocation Descriptor. As you can see in lines 898-905 from the code above, the program searches for elements on a particular volume, and the file-set starts based on the RootDirICB Long Allocation Descriptor. That record can be purposely malformed for malicious purpose. The vulnerability appears in line 392, when the PartitionRef field exceeds the number of elements in PartitionMaps vector.

I would describe the lack of a check on the “PartitionRef” field in topic maps terms as allowing a subject, here a string, of indeterminate size. That is there is no constraint on the size of the subject, which is here a string.

That may seem like an obtuse way of putting it, but consider that for a subject, here a string that is longer than the “available amount of partition may objects,” can be in association with other subjects, such as the user (subject) who has invoked the application(association) containing the 7-Zip vulnerability (subject).

Err, you don’t allow users with shell access to suid root do you?

If you don’t, at least not running a vulnerable program as root may help dodge that bullet.

Or in topic maps terms, knowing the associations between applications and users may be a window on the severity of vulnerabilities.

Lest you think logging suid is an answer, remember they were logging Edward Snowden’s logins as well.

Suid logs may help for next time, but aren’t preventative in nature.

BTW, if you are interested in the details on buffer overflows, Smashing The Stack For Fun And Profit looks like a fun read.

May 4, 2016

No Label (read “name”) for Medical Error – Fear of Terror

Filed under: Names,Subject Identity,Topic Maps — Patrick Durusau @ 2:06 pm

Medical error is third biggest cause of death in the US, experts say by Amanda Holpuch.

From the post:

Medical error is the third leading cause of death in the US, accounting for 250,000 deaths every year, according to an analysis released on Tuesday.

There is no US system for coding these deaths, but Martin Makary and Michael Daniel, researchers at Johns Hopkins University’s school of medicine, used studies from 1999 onward to find that medical errors account for more than 9.5% of all fatalities in the US.

Only heart disease and cancer are more deadly, according to the Centers for Disease Control and Prevention (CDC).

The analysis, which was published in the British Medical Journal, said that the science behind medical errors would improve if data was shared internationally and nationally “in the same way as clinicians share research and innovation about coronary artery disease, melanoma, and influenza”.

But death by medical error is not captured by government reports because the US system for assigning a code to cause of death, the international classification of disease (ICD), does not have a label for medical error.

In contrast to topic maps, where you can talk about any subject you want, the international classification of disease (ICD), does not have a label for medical error.

Impact? Not having a label conceals approximately 250,000 deaths per year in the United States.

What if Fear of Terror press releases were broadcast but along with “deaths due to medical error to date this year” as contextual information?

Medical errors result in approximately 685 deaths per day.

If you heard the report of the shootings in San Bernardino, December 2, 2015 and that 14 people were killed and the report pointed out that to date, approximately 230,160 had died due to medical errors, which one would you judge to be the more serious problem?

Lacking a label for medical error as cause of death, prevents public discussion of the third leading cause of death in the United States.

Contrast that with the public discussion over the largely non-existent problem of terrorism in the United States.

April 20, 2016

Searching for Subjects: Which Method is Right for You?

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 3:41 pm

Leaving to one side how to avoid re-evaluating the repetitive glut of materials from any search, there is the more fundamental problem of how to you search for a subject?

This is a back-of-the-envelope sketch that I will be expanding, but here goes:

Basic Search

At its most basic, a search consists of a <term> and the search seeks to match strings that match that <term>.

Even allowing for Boolean operators, the matches against <term> are only and forever string matches.

Basic Search + Synonyms

Of course, as skilled searchers you will try not only one <term>, but several <synonym>s for the term as well.

A good example of that strategy is used at PubMed:

If you enter an entry term for a MeSH term the translation will also include an all fields search for the MeSH term associated with the entry term. For example, a search for odontalgia will translate to: “toothache”[MeSH Terms] OR “toothache”[All Fields] OR “odontalgia”[All Fields] because Odontalgia is an entry term for the MeSH term toothache. [PubMed Help]

The expansion to include the MeSH term Odontalgia is useful, but how do you maintain it?

A reader can see “toothache” and “Odontalgia” are treated as synonyms, but why remains elusive.

This is the area of owl:sameAs, the mapping of multiple subject identifiers/locators to a single topic, etc. You know that “sameness” exists, but why isn’t clear.

Subject Identity Properties

In order to maintain a PubMed or similar mapping, you need people who either “know” the basis for the mappings or you can have the mappings documented. That is you can say on what basis the mapping happened and what properties were present.

For example:

toothache

Key Value
symptom pain
general-location mouth
specific-location tooth

So if we are mapping terms to other terms and the specific location value reads “tongue,” then we know that isn’t a mapping to “toothache.”

How Far Do You Need To Go?

Of course for every term that we use as a key or value, there can be an expansion into key/value pairs, such as for tooth:

tooth

Key Value
general-location mouth
composition enamel coated bone
use biting, chewing

Observations:

Each step towards more precise gathering of information increases your pre-search costs but decreases your post-search cost of casting out irrelevant material.

Moreover, precise gathering of information will help you avoid missing data simply due to data glut returns.

If maintenance of your mapping across generations is a concern, doing more than mapping of synonyms for reason or reasons unknown may be in order.

The point being that your current retrieval or lack thereof of current and correct information has a cost. As does improving your current retrieval.

The question of improved retrieval isn’t ideological but an ROI driven one.

  • If you have better mappings will that give you an advantage over N department/agency?
  • Will better retrieval slow down (never stop) the time wasted by staff on voluminous search results?
  • Will more precision focus your limited resources (always limited) on highly relevant materials?

Formulate your own ROI questions and means of measuring them. Then reach out to topic maps to see how they improve (or not) your ROI.

Properly used, I think you are in for a pleasant surprise with topic maps.

April 12, 2016

Coeffects: Context-aware programming languages – Subject Identity As Type Checking?

Filed under: Subject Identity,TMRM,Topic Maps,Types — Patrick Durusau @ 3:14 pm

Coeffects: Context-aware programming languages by Tomas Petricek.

From the webpage:

Coeffects are Tomas Petricek‘s PhD research project. They are a programming language abstraction for understanding how programs access the context or environment in which they execute.

The context may be resources on your mobile phone (battery, GPS location or a network printer), IoT devices in a physical neighborhood or historical stock prices. By understanding the neighborhood or history, a context-aware programming language can catch bugs earlier and run more efficiently.

This page is an interactive tutorial that shows a prototype implementation of coeffects in a browser. You can play with two simple context-aware languages, see how the type checking works and how context-aware programs run.

This page is also an experiment in presenting programming language research. It is a live environment where you can play with the theory using the power of new media, rather than staring at a dead pieces of wood (although we have those too).

(break from summary)

Programming languages evolve to reflect the changes in the computing ecosystem. The next big challenge for programming language designers is building languages that understand the context in which programs run.

This challenge is not easy to see. We are so used to working with context using the current cumbersome methods that we do not even see that there is an issue. We also do not realize that many programming features related to context can be captured by a simple unified abstraction. This is what coeffects do!

What if we extend the idea of context to include the context within which words appear?

For example, writing a police report, the following sentence appeared:

There were 20 or more <proxy string=”black” pos=”noun” synonym=”African” type=”race”/>s in the group.

For display purposes, the string value “black” appears in the sentence:

There were 20 or more blacks in the group.

But a search for the color “black” would not return that report because the type = color does not match type = race.

On the other hand, if I searched for African-American, that report would show up because “black” with type = race is recognized as a synonym for people of African extraction.

Inline proxies are the easiest to illustrate but that is only one way to serialize such a result.

If done in an authoring interface, such an approach would have the distinct advantage of offering the original author the choice of subject properties.

The advantage of involving the original author is that they have an interest in and awareness of the document in question. Quite unlike automated processes that later attempt annotation by rote.

April 9, 2016

No Perception Without Cartography [Failure To Communicate As Cartographic Failure]

Filed under: Perception,Subject Identity,Topic Maps — Patrick Durusau @ 8:12 pm

Dan Klyn tweeted:

No perception without cartography

with an image of this text (from Self comes to mind: constructing the conscious mind by Antonio R Damasio):


The nonverbal kinds of images are those that help you display mentally the concepts that correspond to words. The feelings that make up the background of each mental instant and that largely signify aspects of the body state are images as well. Perception, in whatever sensory modality, is the result of the brain’s cartographic skill.

Images represent physical properties of entities and their spatial and temporal relationships, as well as their actions. Some images, which probably result from the brain’s making maps of itself making maps, are actually quite abstract. They describe patterns of occurrence of objects in time and space, the spatial relationships and movement of objects in terms of velocity and trajectory, and so forth.

Dan’s tweet spurred me to think that our failures to communicate to others could be described as cartographic failures.

If we use a term that is unknown to the average reader, say “daat,” the reader lacks a mental mapping that enables interpretation of that term.

Even if you know the term, it doesn’t stand in isolation in your mind. It fits into a number of maps, some of which you may be able to articulate and very possibly into other maps, which remain beyond your (and our) ken.

Not that this is a light going off experience for you or me but perhaps the cartographic imagery may be helpful in illustrating both the value and the risks of topic maps.

The value of topic maps is spoken of often but the risks of topic maps rarely get equal press.

How would topic maps be risky?

Well, consider the average spreadsheet using in a business setting.

Felienne Hermans in Spreadsheets: The Ununderstood Dark Matter of IT makes a persuasive case that spreadsheets are on an average five years old with little or no documentation.

If those spreadsheets remain undocumented, both users and auditors are equally stymied by their ignorance, a cartographic failure that leaves both wondering what must have been meant by columns and operations in the spreadsheet.

To the extent that a topic map or other disclosure mechanism preserves and/or restores the cartography that enables interpretation of the spreadsheet, suddenly staff are no longer plausibly ignorant of the purpose or consequences of using the spreadsheet.

Facile explanations that change from audit to audit are no longer possible. Auditors are chargeable with consistent auditing from one audit to another.

Does it sound like there is going to be a rush to use topic maps or other mechanisms to make spreadsheets transparent?

Still, transparency that befalls one could well benefit another.

Or to paraphrase King David (2 Samuel 11:25):

Don’t concern yourself about this. In business, transparency falls on first one and then another.

Ready to inflict transparency on others?

March 2, 2016

Graph Encryption: Going Beyond Encrypted Keyword Search [Subject Identity Based Encryption]

Filed under: Cryptography,Cybersecurity,Graphs,Subject Identity,Topic Maps — Patrick Durusau @ 4:49 pm

Graph Encryption: Going Beyond Encrypted Keyword Search by Xiarui Meng.

From the post:

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) store, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph database have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny’s previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today’s world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

2. Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, which can lead to many interesting research directions.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

Apologies for the long quote but I thought this topic might be new to some readers. Xianrui goes on to describe a solution for efficient queries over encrypted graphs.

Chase and Kamara remark in Structured Encryption and Controlled Disclosure, CK10:


To address this problem we introduce the notion of structured encryption. A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key. In addition, the query process reveals no useful information about either the query or the data. An important consideration in this context is the efficiency of the query operation on the server side. In fact, in the context of cloud storage, where one often works with massive datasets, even linear time operations can be infeasible. (emphasis in original)

With just a little nudging, their:

A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key.

could be re-stated as:

A subject identity encryption scheme leaves out merging data in such a way that the resulting topic map can only be queried with knowledge of the subject identity merging key.

You may have topics that represent diagnoses such as cancer, AIDS, sexual contacts, but if none of those can be associated with individuals who are also topics in the map, there is no more disclosure than census results for a metropolitan area and a list of the citizens therein.

That is you are missing the critical merging data that would link up (associate) any diagnosis with a given individual.

Multi-property subject identities would make the problem even harder, so say nothing of conferring properties on the basis of supplied properties as part of the merging process.

One major benefit of a subject identity based approach is that without the merging key, any data set, however sensitive the information, is just a data set, until you have the basis for solving its subject identity riddle.

PS: With the usual caveats of not using social security numbers, birth dates and the like as your subject identity properties. At least not in the map proper. I can think of several ways to generate keys for merging that would be resistant to even brute force attacks.

Ping me if you are interested in pursuing that on a data set.

February 17, 2016

Challenges of Electronic Dictionary Publication

Filed under: Dictionary,Digital Research,Subject Identity,Topic Maps — Patrick Durusau @ 4:20 pm

Challenges of Electronic Dictionary Publication

From the webpage:

April 8-9th, 2016

Venue: University of Leipzig, GWZ, Beethovenstr. 15; H1.5.16

This April we will be hosting our first Dictionary Journal workshop. At this workshop we will give an introduction to our vision of „Dictionaria“, introduce our data model and current workflow and will discuss (among others) the following topics:

  • Methodology and concept: How are dictionaries of „small“ languages different from those of „big“ languages and what does this mean for our endeavour? (documentary dictionaries vs. standard dictionaries)
  • Reviewing process and guidelines: How to review and evaluate a dictionary database of minor languages?
  • User-friendliness: What are the different audiences and their needs?
  • Submission process and guidelines: reports from us and our first authors on how to submit and what to expect
  • Citation: How to cite dictionaries?

If you are interested in attending this event, please send an e-mail to dictionary.journal[AT]uni-leipzig.de.

Workshop program

Our workshop program can now be downloaded here.

See the webpage for a list of confirmed participants, some with submitted abstracts.

Any number of topic map related questions arise in a discussion of dictionaries.

  • How to represent dictionary models?
  • What properties should be used to identify the subjects that represent dictionary models?
  • On what basis, if any, should dictionary models be considered the same or different? And for what purposes?
  • What data should be captured by dictionaries and how should it be identified?
  • etc.

Those are only a few of the questions that could be refined into dozens, if not hundreds of more, when you reach the details of constructing a dictionary.

I won’t be attending but wait with great anticipation the output from this workshop!

February 13, 2016

You Can Confirm A Gravity Wave!

Filed under: Physics,Python,Science,Signal Processing,Subject Identity,Subject Recognition — Patrick Durusau @ 5:35 pm

Unless you have been unconscious since last Wednesday, you have heard about the confirmation of Einstein’s 1916 prediction of gravitational waves.

An very incomplete list of popular reports include:

Einstein, A Hunch And Decades Of Work: How Scientists Found Gravitational Waves (NPR)

Einstein’s gravitational waves ‘seen’ from black holes (BBC)

Gravitational Waves Detected, Confirming Einstein’s Theory (NYT)

Gravitational waves: breakthrough discovery after a century of expectation (Guardian)

For the full monty, see the LIGO Scientific Collaboration itself.

Which brings us to the iPython notebook with the gravitational wave discovery data: Signal Processing with GW150914 Open Data

From the post:

Welcome! This ipython notebook (or associated python script GW150914_tutorial.py ) will go through some typical signal processing tasks on strain time-series data associated with the LIGO GW150914 data release from the LIGO Open Science Center (LOSC):

To begin, download the ipython notebook, readligo.py, and the data files listed below, into a directory / folder, then run it. Or you can run the python script GW150914_tutorial.py. You will need the python packages: numpy, scipy, matplotlib, h5py.

On Windows, or if you prefer, you can use a python development environment such as Anaconda (https://www.continuum.io/why-anaconda) or Enthought Canopy (https://www.enthought.com/products/canopy/).

Questions, comments, suggestions, corrections, etc: email losc@ligo.org

v20160208b

Unlike the toadies at the New England Journal of Medicine, Parasitic Re-use of Data? Institutionalizing Toadyism, Addressing The Concerns Of The Selfish, the scientists who have labored for decades on the gravitational wave question are giving their data away for free!

Not only giving the data away, but striving to help others learn to use it!

Beyond simply “doing the right thing,” and setting an example for other scientists, this is a great opportunity to learn more about signal processing.

Signal processing being an important method of “subject identification” when you stop to think about it in a large number of domains.

Detecting a gravity wave is beyond your personal means but with the data freely available…, further analysis is a matter of interest and perseverance.

November 19, 2015

Infinite Dimensional Word Embeddings [Variable Representation, Death to Triples]

Infinite Dimensional Word Embeddings by Eric Nalisnick and Sachin Ravi.

Abstract:

We describe a method for learning word embeddings with stochastic dimensionality. Our Infinite Skip-Gram (iSG) model specifies an energy-based joint distribution over a word vector, a context vector, and their dimensionality, which can be defined over a countably infinite domain by employing the same techniques used to make the Infinite Restricted Boltzmann Machine (Cote & Larochelle, 2015) tractable. We find that the distribution over embedding dimensionality for a given word is highly interpretable and leads to an elegant probabilistic mechanism for word sense induction. We show qualitatively and quantitatively that the iSG produces parameter-efficient representations that are robust to language’s inherent ambiguity.

Even better from the introduction:

To better capture the semantic variability of words, we propose a novel embedding method that produces vectors with stochastic dimensionality. By employing the same mathematical tools that allow the definition of an Infinite Restricted Boltzmann Machine (Côté & Larochelle, 2015), we describe ´a log-bilinear energy-based model–called the Infinite Skip-Gram (iSG) model–that defines a joint distribution over a word vector, a context vector, and their dimensionality, which has a countably infinite domain. During training, the iSGM allows word representations to grow naturally based on how well they can predict their context. This behavior enables the vectors of specific words to use few dimensions and the vectors of vague words to elongate as needed. Manual and experimental analysis reveals this dynamic representation elegantly captures specificity, polysemy, and homonymy without explicit definition of such concepts within the model. As far as we are aware, this is the first word embedding method that allows representation dimensionality to be variable and exhibit data-dependent growth.

Imagine a topic map model that “allow[ed] representation dimensionality to be variable and exhibit data-dependent growth.

Simple subjects, say the sort you find at schema.org, can have simple representations.

More complex subjects, say the notion of “person” in U.S. statutory law (no, I won’t attempt to list them here), can extend its dimensional representation as far as is necessary.

Of course in this case, the dimensions are learned from a corpus but I don’t see any barrier to the intentional creation of dimensions for subjects and/or a combined automatic/directed creation of dimensions.

Or as I put it in the title, Death to All Triples.

More precisely, not just triples but any pre-determined limit on representation.

Looking forward to taking a slow read on this article and those it cites. Very promising.

November 18, 2015

Knowing the Name of Something vs. Knowing How To Identify Something

Filed under: Identification,Names,Subject Identity — Patrick Durusau @ 10:08 pm

Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something

From the post:


In this short clip (below), Feynman articulates the difference between knowing the name of something and understanding it.

See that bird? It’s a brown-throated thrush, but in Germany it’s called a halzenfugel, and in Chinese they call it a chung ling and even if you know all those names for it, you still know nothing about the bird. You only know something about people; what they call the bird. Now that thrush sings, and teaches its young to fly, and flies so many miles away during the summer across the country, and nobody knows how it finds its way.

Knowing the name of something doesn’t mean you understand it. We talk in fact-deficient, obfuscating generalities to cover up our lack of understanding.

You won’t get to see the Feynman quote live because it has been blocked by BBC Worldwide on copyright grounds. No doubt they make a bag full of money every week off that 179 second clip of Feynman.

The stronger point for Feynman would be to point out that you can’t recognize anything on the basis of knowing a name.

I may be sitting next to Cindy Lou Who on the bus but knowing her name isn’t going to help me to recognize her.

Knowing the name of someone or something isn’t useful unless you know something about the person or thing you associate with a name.

That is you know when it is appropriate to use the name you have learned and when to say: “Sorry, I don’t know your name or the name of (indicating in some manner).” At which point you will learn a new name and store a new set of properties to know when to use that name, instead of any other name you know.

Everyone does that exercise, learning new names and the properties that establish when it is appropriate to use a particular name. And we do so seamlessly.

So seamlessly that when called upon to make explicit “how” we know which name to use, subject identification in other words, it takes a lot of effort.

It’s enough effort that it should be done only when necessary and when we can show the user an immediate semantic ROI for their effort.

More on this to follow.

September 27, 2015

What Does Probability Mean in Your Profession? [Divergences in Meaning]

Filed under: Mathematics,Probability,Subject Identity — Patrick Durusau @ 9:39 pm

What Does Probability Mean in Your Profession? by Ben Orlin.

Impressive drawings that illustrate the divergence in meaning of “probability” for various professions.

I’m not sold on the “actual meaning” drawing because if everyone in a discipline understands “probability” to mean something else, on what basis can you argue for the “actual meaning?”

If I am reading a paper by someone who subscribes to a different meaning than your claimed “actual” one, then I am going to reach erroneous conclusions about their paper. Yes?

That is in order to understand a paper I have to understand the words as they are being used by the author. Yes?

If I understand “democracy and freedom” to mean “serves the interest of U.S.-based multinational corporations,” then calls for “democracy and freedom” in other countries isn’t going to impress me all that much.

Enjoy the drawings!

September 15, 2015

Value of Big Data Depends on Identities in Big Data

Filed under: BigData,Subject Identity,Topic Maps — Patrick Durusau @ 9:42 pm

Intel Exec: Extracting Value From Big Data Remains Elusive by George Leopold.

From the post:

Intel Corp. is convinced it can sell a lot of server and storage silicon as big data takes off in the datacenter. Still, the chipmaker finds that major barriers to big data adoption remain, most especially what to do with all those zettabytes of data.

“The dirty little secret about big data is no one actually knows what to do with it,” Jason Waxman, general manager of Intel’s Cloud Platforms Group, asserted during a recent company datacenter event. Early adopters “think they know what to do with it, and they know they have to collect it because you have to have a big data strategy, of course. But when it comes to actually deriving the insight, it’s a little harder to go do.”

Put another way, industry analysts rate the difficulty of determining the value of big data as far outweighing considerations like technological complexity, integration, scaling and other infrastructure issues. Nearly two-thirds of respondents to a Gartner survey last year cited by Intel stressed they are still struggling to determine the value of big data.

“Increased investment has not led to an associated increase in organizations reporting deployed big data projects,” Gartner noted in its September 2014 big data survey. “Much of the work today revolves around strategy development and the creation of pilots and experimental projects.”

gartner-barriers-to-big-data

It may just be me, but “determing value,” “risk and governance,” and “integrating multiple data sources,” the top three barriers to use of big data, all depend on knowing the identities represented in big data.

The trivial data integration demos that share “customer-ID” fields, don’t inspire a lot of confidence about data integration when “customer-ID” maybe identified in as many ways as there are data sources. And that is a minor example.

It would be very hard to determine the value you can extract from data when you don’t know what the data represents, its accuracy (risk and governance), and what may be necessary to integrate it with other data sources.

More processing power from Intel is always welcome but churning poorly understood big data faster isn’t going to create value. Quite the contrary, investment in more powerful hardware isn’t going to be favorably reflected on the bottom line.

Investment in capturing the diverse identities in big data will empower easier valuation of big data, evaluation of its risks and uncovering how to integrate diverse data sources.

Capturing diverse identities won’t be easy, cheap or quick. But not capturing them will leave the value of Big Data unknown, its risks uncertain and integration a crap shoot when it is ever attempted.

Your call.

August 1, 2015

Things That Are Clear In Hindsight

Filed under: Social Sciences,Subject Identity,Subject Recognition — Patrick Durusau @ 12:10 pm

Sean Gallagher recently tweeted:

Oh look, the Triumphalism Trilogy is now a boxed set.

triumphalism-trilogy

In case you are unfamiliar with the series, The Tipping Point, Blink, Outliers.

Although entertaining reads, particularly The Tipping Point (IMHO), Gladwell does not describe how to recognize a tipping point in advance of it being a tipping point, nor how to make good decisions without thinking (Blink) or how to recognize human potential before success (Outliers).

Tipping points, good decisions and human potential can be recognized only when they are manifested.

As you can tell from Gladwell’s book sales, selling the hope of knowing the unknowable, remains a viable market.

July 6, 2015

Which Functor Do You Mean?

Filed under: Homonymous,Names,Subject Identity — Patrick Durusau @ 8:34 pm

Peteris Krumins calls attention to the classic confusion of names that topic maps address in On Functors.

From the post:

It’s interesting how the term “functor” means completely different things in various programming languages. Take C++ for example. Everyone who has mastered C++ knows that you call a class that implements operator() a functor. Now take Standard ML. In ML functors are mappings from structures to structures. Now Haskell. In Haskell functors are just homomorphisms over containers. And in Prolog functor means the atom at the start of a structure. They all are different. Let’s take a closer look at each one.

Peter has said twice in the first paragraph that each of these “functors” is different. Don’t rush to his 2010 post to point out they are different. That was the point of the post. Yes?

Exercise: All of these uses of functor could be scoped by language. What properties of each “functor” would you use to distinguish them beside their language of origin?

April 14, 2015

Attribute-Based Access Control with a graph database [Topic Maps at NIST?]

Filed under: Cybersecurity,Graphs,Neo4j,NIST,Security,Subject Identity,Topic Maps — Patrick Durusau @ 3:25 pm

Attribute-Based Access Control with a graph database by Robin Bramley.

From the post:

Traditional access control relies on the identity of a user, their role or their group memberships. This can become awkward to manage, particularly when other factors such as time of day, or network location come into play. These additional factors, or attributes, require a different approach, the US National Institute of Standards and Technology (NIST) have published a draft special paper (NIST 800-162) on Attribute-Based Access Control (ABAC).

This post, and the accompanying Graph Gist, explore the suitability of using a graph database to support policy decisions.

Before we dive into the detail, it’s probably worth mentioning that I saw the recent GraphGist on Entitlements and Access Control Management and that reminded me to publish my Attribute-Based Access Control GraphGist that I’d written some time ago, originally in a local instance having followed Stefan Armbruster’s post about using Docker for that very purpose.

Using a Property Graph, we can model attributes using relationships and/or properties. Fine-grained relationships without qualifier properties make patterns easier to spot in visualisations and are more performant. For the example provided in the gist, the attributes are defined using solely fine-grained relationships.

Graph visualization (and querying) of attribute-based access control.

I found this portion of the NIST draft particularly interesting:


There are characteristics or attributes of a subject such as name, date of birth, home address, training record, and job function that may, either individually or when combined, comprise a unique identity that distinguishes that person from all others. These characteristics are often called subject attributes. The term subject attributes is used consistently throughout this document.

In the course of a person’s life, he or she may work for different organizations, may act in different roles, and may inherit different privileges tied to those roles. The person may establish different personas for each organization or role and amass different attributes related to each persona. For example, an individual may work for Company A as a gate guard during the week and may work for Company B as a shift manager on the weekend. The subject attributes are different for each persona. Although trained and qualified as a Gate Guard for Company A, while operating in her Company B persona as a shift manager on the weekend she does not have the authority to perform as a Gate Guard for Company B.
…(emphasis in the original)

Clearly NIST recognizes that subjects, at least in the sense of people, are identified by a set of “subject attributes” that uniquely identify that subject. It doesn’t seem like much of a leap to recognize that for other subjects, including the attributes used to identify subjects.

I don’t know what other US government agencies have similar language but it sounds like a starting point for a robust discussion of topic maps and their advantages.

Yes?

March 18, 2015

Interactive Intent Modeling: Information Discovery Beyond Search

Interactive Intent Modeling: Information Discovery Beyond Search by Tuukka Ruotsalo, Giulio Jacucci, Petri Myllymäki, Samuel Kaski.

From the post:

Combining intent modeling and visual user interfaces can help users discover novel information and dramatically improve their information-exploration performance.

Current-generation search engines serve billions of requests each day, returning responses to search queries in fractions of a second. They are great tools for checking facts and looking up information for which users can easily create queries (such as “Find the closest restaurants” or “Find reviews of a book”). What search engines are not good at is supporting complex information-exploration and discovery tasks that go beyond simple keyword queries. In information exploration and discovery, often called “exploratory search,” users may have difficulty expressing their information needs, and new search intents may emerge and be discovered only as they learn by reflecting on the acquired information. 8,9,18 This finding roots back to the “vocabulary mismatch problem” 13 that was identified in the 1980s but has remained difficult to tackle in operational information retrieval (IR) systems (see the sidebar “Background”). In essence, the problem refers to human communication behavior in which the humans writing the documents to be retrieved and the humans searching for them are likely to use very different vocabularies to encode and decode their intended meaning. 8,21

Assisting users in the search process is increasingly important, as everyday search behavior ranges from simple look-ups to a spectrum of search tasks 23 in which search behavior is more exploratory and information needs and search intents uncertain and evolving over time.

We introduce interactive intent modeling, an approach promoting resourceful interaction between humans and IR systems to enable information discovery that goes beyond search. It addresses the vocabulary mismatch problem by giving users potential intents to explore, visualizing them as directions in the information space around the user’s present position, and allowing interaction to improve estimates of the user’s search intents.

What!? All those years spend trying to beat users into learning complex search languages were in vain? Say it’s not so!

But, apparently it is so. All of the research on “vocabulary mismatch problem,” “different vocabularies to encode and decode their meaning,” has come back to bite information systems that offer static and author-driven vocabularies.

Users search best, no surprise, through vocabularies they recognize and understand.

I don’t know of any interactive topic maps in the sense used here but that doesn’t mean that someone isn’t working on one.

A shift in this direction could do wonders for the results of searches.

March 16, 2015

Chemical databases: curation or integration by user-defined equivalence?

Filed under: Cheminformatics,Chemistry,InChl,Subject Identity — Patrick Durusau @ 2:52 pm

Chemical databases: curation or integration by user-defined equivalence? by Anne Hersey, Jon Chambers, Louisa Bellis, A. Patrícia Bento, Anna Gaulton, John P. Overington.

Abstract:

There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.

The authors identify a number of reasons why databases of chemical identifications have different structures recorded for the same chemicals. One problem is that there is no authoritative source for chemical structures so upon publication, authors publish those aspects most relevant to their interest. Or publish images and not machine readable representations of a chemical. To say nothing of the usual antics with simple names and their confusions. But there are software limitations, business rules and other sources of a multiplicity of chemical structures.

Suffice it to say that the authors make a strong case for why there are multiple structures for any given chemical now and why that is going to continue.

The author’s openly ask if it is time to ask users for their assistance in mapping this diversity of structures:

Is it now time to accept that however diligent database providers are, there will always be differences in structure representations and indeed some errors in the structures that cannot be fixed with a realistic level of resource? Should we therefore turn our attention to encouraging the use and development of tools that enable the mapping together of related compounds rather than concentrate our efforts on ever more curation?

You know my answer to that question.

What’s yours?

I first saw this in a tweet by John P. Overington.

March 7, 2015

Fifty Words for Databases

Fifty Words for Databases by Phil Factor

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

Older Posts »

Powered by WordPress