Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 3, 2013

G2 | Sensemaking – Two Years Old Today

Filed under: Context,G2 Sensemaking,Identity,Subject Identity — Patrick Durusau @ 6:59 pm

G2 | Sensemaking – Two Years Old Today by Jeff Jonas.

From the post:

What is G2?

When I speak about Context Accumulation, Data Finds Data and Relevance Finds You, and Sensemaking I am describing various aspects of G2.

In simple terms G2 software is designed to integrate diverse observations (data) as it arrives, in real-time.  G2 does this incrementally, piece by piece, much in the same way you would put a puzzle together at home.  And just like at home, the more puzzle pieces integrated into the puzzle, the more complete the picture.  The more complete the picture, the better the ability to make sense of what has happened in the past, what is happening now, and what may come next.  Users of G2 technology will be more efficient, deliver high quality outcomes, and ultimately will be more competitive.

Early adopters seem to be especially interested in one specific use case: Using G2 to help organizations better direct the attention of its finite workforce.  With the workforce now focusing on the most important things first, G2 is then used to improve the quality of analysis while at the same time reducing the amount of time such analysis takes.  The bigger the organization, the bigger the observation space, the more essential sensemaking is.

About Sensemaking

One of the things G2 can already do pretty darn well – considering she just turned two years old – is ”Sensemaking.”  Imagine a system capable of paying very close attention to every observation that comes its way.  Each observation incrementally improving upon the picture and using this emerging picture in real-time to make higher quality business decisions; for example, the selection of the perfect ad for a web page (in sub-200 milliseconds as the user navigates to the page) or raising an alarm to a human for inspection (an alarm sufficiently important to be placed top of the queue).  G2, when used this way, enables Enterprise Intelligence.

Of course there is no magic.  Sensemaking engines are limited by their available observation space.  If a sentient being would be unable to make sense of the situation based on the available observation space, neither would G2.  I am not talking about Fantasy Analytics here.

I would say “subject identity” instead of “sensemaking” and after reading Jeff’s post, consider them to be synonyms.

Read the section General Purpose Context Accumulation very carefully.

As well as “Privacy by Design (PbD).”

BTW, G2 uses Universal Message Format XML for input/output.

Not to argue from authority but Jeff is one of only 77 active IBM Research Fellows.

Someone to listen to, even if we may disagree on some of the finer points.

Making Sense of Others’ Data Structures

Filed under: Data Mining,Data Structures,Identity,Subject Identity — Patrick Durusau @ 6:58 pm

Making Sense of Others’ Data Structures by Eruditio Loginquitas.

From the post:

Coming in as an outsider to others’ research always requires an investment of time and patience. After all, how others conceptualize their fields, and how they structure their questions and their probes, and how they collect information, and then how they represent their data all reflect their understandings, their theoretical and analytical approaches, their professional training, and their interests. When professionals collaborate, they will approach a confluence of understandings and move together in a semi-united way. Individual researchers—not so much. But either way, for an outsider, there will have to be some adjustment to understand the research and data. Professional researchers strive to control for error and noise at every stage of the research: the hypothesis, literature review, design, execution, publishing, and presentation.

Coming into a project after the data has been collected and stored in Excel spreadsheets means that the learning curve is high in yet another way: data structures. While the spreadsheet itself seems pretty constrained and defined, there is no foregone conclusion that people will necessarily represent their data a particular way.

Data structures as subjects. What a concept! 😉

Data structures, contrary to some, are not self-evident or self-documenting.

Not to mention that like ourselves, are in a constant state of evolution as our understanding or perception of data changes.

Mine is not the counsel of despair, but of encouragement to consider the costs/benefits of capturing data structure subject identities just as more traditional subjects.

It may be costs or other constraints prevent such capture but you may also miss benefits if you don’t ask.

How much did it cost for each transition in episodic data governance efforts to re-establish data structure subject identities?

Could be that more money spent now would get an enterprise off the perpetual cycle of data governance.

ToxPi GUI [Data Recycling]

Filed under: Bioinformatics,Biomedical,Integration,Medical Informatics,Subject Identity — Patrick Durusau @ 6:57 pm

ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence by David M. Reif, Myroslav Sypa, Eric F. Lock, Fred A. Wright, Ander Wilson, Tommy Cathey, Richard R. Judson and Ivan Rusyn. (Bioinformatics (2013) 29 (3): 402-403. doi: 10.1093/bioinformatics/bts686)

Abstract:

Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.

Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.

Availability: The ToxPi GUI application, complete user manual and example data files are freely available from http://comptox.unc.edu/toxpi.php.

Contact: reif.david@gmail.com

Very cool!

Although like having a Ford automobile in any color, so long as the color was black, you can integrate any data source, so long as the format is csv. And values are numbers. Subject to other restrictions as well.

That’s an observation, not a criticism.

The application serves a purpose within a domain and does not “integrate” information in the sense of a topic map.

But a topic map could recycle its data to add other identifications and properties. Without having to re-write this application or its data.

Once curated, data should be re-used, not re-created/curated.

Topic maps give you more bang for your data buck.

December 11, 2012

Music Network Visualization

Filed under: Graphs,Music,Networks,Similarity,Subject Identity,Visualization — Patrick Durusau @ 7:23 pm

Music Network Visualization by Dimiter Toshkov.

From the post:

My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization!

Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph.

Interesting for the visualization but also the use of similarity measures.

The test for identity of a subject, particularly collective subjects, artists “similar” to X, is as unlimited as your imagination.

November 29, 2012

Detecting Communities in Social Graph [Communities of Representatives?]

Filed under: Graphs,Social Graphs,Social Networks,Subject Identity — Patrick Durusau @ 6:49 pm

Detecting Communities in Social Graph by Ricky Ho.

From the post:

In analyzing social network, one common problem is how to detecting communities, such as groups of people who knows or interacting frequently with each other. Community is a subgraph of a graph where the connectivity are unusually dense.

In this blog, I will enumerate some common algorithms on finding communities.

First of all, community detection can be think of graph partitioning problem. In this case, a single node will belong to no more than one community. In other words, community does not overlap with each other.

When you read:

community detection can be think of graph partitioning problem. In this case, a single node will belong to no more than one community.

What does that remind you of?

Does it stand to reason that representatives of the same subject, some with more, some with less information about a subject, would exhibit the same “connectivity” that Ricky calls “unusually dense?”

The TMDM defines a basis for “unusually dense” connectivity but what if we are exploring other representatives of subjects? And trying to detect likely representatives of the same subject?

How would you use graph partitioning to explore such representative?

That could make a fairly interesting research project for anyone wanting to merge diverse intelligence about some subject or person together.

November 27, 2012

SunPy [Choosing Specific Subject Identity Issues]

Filed under: Astroinformatics,Subject Identity,Topic Maps — Patrick Durusau @ 10:57 am

SunPy: A Community Python Library for Solar Physics

From the homepage:

The SunPy project is an effort to create an open-source software library for solar physics using the Python programming language.

As you have seen in your own experience or read about in my other posting on astronomical data, like elsewhere, subject identity issues abound.

This is another area that may spark someone’s interest in using topic maps to mitigate against specific subject identity issues.

“Specific subject identity issues” because the act of mitigation always creates more subjects which could be the sources of subject identity issues. It’s not a problem so long as you choose the issues most important to you.

If and when those other potential subject identity issues become relevant, they can be addressed later. The logic approach pretends such issues don’t exist at all. I prefer the former. It’s less fragile.

November 19, 2012

Psychological Studies of Policy Reasoning

Filed under: Psychology,Subject Identity,Users — Patrick Durusau @ 7:47 pm

Psychological Studies of Policy Reasoning by Adam Wyner.

From the post:

The New York Times had an article on the difficulties that the public has to understand complex policy proposals – I’m Right (For Some Reason). The points in the article relate directly to the research I’ve been doing at Liverpool on the IMPACT Project, for we decompose a policy proposal into its constituent parts for examination and improved understanding. See our tool live: Structured Consultation Tool

Policy proposals are often presented in an encapsulated form (a sound bite). And those receiving it presume that they understand it, the illusion of explanatory depth discussed in a recent article by Frank Keil (a psychology professor at Cornell when and where I was a Linguistics PhD student). This is the illusion where people believe they understand a complex phenomena with greater precision, coherence, and depth than they actually do; they overestimate their understanding. To philosophers, this is hardly a new phenomena, but showing it experimentally is a new result.

In research about public policy, the NY Times authors, Sloman and Fernbach, describe experiments where people state a position and then had to justify it. The results showed that participants softened their views as a result, for their efforts to justify it highlighted the limits of their understanding. Rather than statements of policy proposals, they suggest:

An approach to get people to state how they would distinguish or not, two subjects?

Would it make a difference if the questions were oral or in writing?

Since a topic map is an effort to capture a domain expert’s knowledge, tools to elicit that knowledge are important.

October 31, 2012

MDM: It’s Not about One Version of the Truth

Filed under: Master Data Management,Subject Identity — Patrick Durusau @ 7:38 am

MDM: It’s Not about One Version of the Truth by Michele Goetz.

From the post:

Here is why I am not a fan of the “single source of truth” mantra. A person is not one-dimensional; they can be a parent, a friend, a colleague and each has different motivations and requirements depending on the environment. A product is as much about the physical aspect as it is the pricing, message, and sales channel it is sold through. Or, it is also faceted by the fact that it is put together from various products and parts from partners. In no way is a master entity unique or has a consistency depending on what is important about the entity in a given situation. What MDM provides are definitions and instructions on the right data to use in the right engagement. Context is a key value of MDM.

When organizations have implemented MDM to create a golden record and single source of truth, domain models are extremely rigid and defined only within a single engagement model for a process or reporting. The challenge is the master entity is global in nature when it should have been localized. This model does not allow enough points of relationship to create the dimensions needed to extend beyond the initial scope. If you want to now extend, you need to rebuild your MDM model. This is essentially starting over or you ignore and build a layer of redundancy and introduce more complexity and management.

The line:

The challenge is the master entity is global in nature when it should have been localized.

stopped me cold.

What if I said:

“The challenge is a subject proxy is global in nature when it should have been localized.”

Would your reaction be the same?

Shouldn’t subject identity always be local?

Or perhaps better, have you ever experienced a subject identification that wasn’t local?

We may talk about a universal notion of subject but even so we are using a localized definition of universal subject.

If a subject proxy is a container for local identifications, thought to be identifications of the same subject, need we be concerned if it doesn’t claim to be a universal representative for some subject? Or is it sufficient that it is a faithful representative of one or more identifications, thought by some collector to identify the same subject?

I am leaning towards the latter because it jettisons the doubtful baggage of universality.

That is a subject may have more than one collection of local identifications (such collections being subject proxies), none of which is the universal representative for that subject.

Even if we think another collection represents the same subject, merging those collections is a question of your requirements.

You may not want to collect Twitter comments in Hindi about Glee.

Your topic map, your requirements, your call.

PS: You need to read Michele’s original post to discover what could entice management to fund an MDM project. Interoperability of data isn’t it.

September 29, 2012

Visual Clues: A Brain “feature,” not a “bug”

You will read in When Your Eyes Tell Your Hands What to Think: You’re Far Less in Control of Your Brain Than You Think that:

You’ve probably never given much thought to the fact that picking up your cup of morning coffee presents your brain with a set of complex decisions. You need to decide how to aim your hand, grasp the handle and raise the cup to your mouth, all without spilling the contents on your lap.

A new Northwestern University study shows that, not only does your brain handle such complex decisions for you, it also hides information from you about how those decisions are made.

“Our study gives a salient example,” said Yangqing ‘Lucie’ Xu, lead author of the study and a doctoral candidate in psychology at Northwestern. “When you pick up an object, your brain automatically decides how to control your muscles based on what your eyes provide about the object’s shape. When you pick up a mug by the handle with your right hand, you need to add a clockwise twist to your grip to compensate for the extra weight that you see on the left side of the mug.

“We showed that the use of this visual information is so powerful and automatic that we cannot turn it off. When people see an object weighted in one direction, they actually can’t help but ‘feel’ the weight in that direction, even when they know that we’re tricking them,” Xu said. (emphasis added)

I never quite trusted my brain and now I have proof that it is untrustworthy. Hiding stuff indeed! 😉

But that’s the trick of subject identification/identity isn’t it?

That our brains “recognize” all manner of subjects without any effort on our part.

Another part of the effortless features of our brains. But it hides the information we need to integrate information stores from ourselves and others.

Or rather, making it more work than we are usually willing to devote to digging it out.

When called upon to be “explicit” about subject identification, or even worse, to imagine how other people identify subjects, we prefer to stay at home consuming passive entertainment.

Two quick points:

First, need to think about how to incorporate this “feature” into delivery interfaces for users.

Second, what subjects would users pay others to mine/collate/identify for them? (Delivery being a separate issue.)

September 12, 2012

Pushing Parallel Barriers Skyward (Subject Identity at 1EB/year)

Filed under: Astroinformatics,BigData,Subject Identity — Patrick Durusau @ 5:50 pm

Pushing Parallel Barriers Skyward by Ian Armas Foster

From the post:

As much data as there exists on the planet Earth, the stars and the planets that surround them contain astronomically more. As we discussed earlier, Peter Nugent and the Palomar Transient Factory are using a form of parallel processing to identify astronomical phenomena.

Some researchers believe that parallel processing will not be enough to meet the huge data requirements of future massive-scale astronomical surveys. Specifically, several researchers from the Korea Institute of Science and Technology Information including Jaegyoon Hahm along with Yongsei University’s Yong-Ik Byun and the University of Michigan’s Min-Su Shin wrote a paper indicating that the future of astronomical big data research is brighter with cloud computing than parallel processing.

Parallel processing is holding its own at the moment. However, when these sky-mapping and phenomena-chasing projects grow significantly more ambitious by the year 2020, parallel processing will have no hope.

How ambitious are these future projects? According to the paper, the Large Synoptic Survey Telescope (LSST) will generate 75 petabytes of raw plus catalogued data for its ten years of operation, or about 20 terabytes a night. That pales in comparison to the Square Kilometer Array, which is projected to archive in one year 250 times the amount of information that exists on the planet today.

“The total data volume after processing (the LSST) will be several hundred PB, processed using 150 TFlops of computing power. Square Kilometer Array (SKA), which will be the largest in the world radio telescope in 2020, is projected to generate 10-100PB raw data per hour and archive data up to 1EB every year.”

Beyond storage/processing requirements, how do you deal with subject identity at 1EB/year?

Changing subject identity that is.

People are as inconstant with subject identity as they are with martial fidelity. If they do that well.

Now spread that over decades or centuries of research.

Does anyone see a problem here?

September 8, 2012

“how hard can this be?” (Data and Reality)

Filed under: Design,Modeling,Subject Identity — Patrick Durusau @ 2:07 pm

Books that Influenced my Thinking: Kent’s Data and Reality by Thomas Redman.

From the post:

It was the rumor that Steve Hoberman (Technics Publications) planned to reissue Data and Reality by William Kent that led me to use this space to review books that had influenced my thinking about data and data quality. My plan had been to do the review of Data and Reality as soon as it came out. I completely missed the boat – it has been out for some six months.

I first read Data and Reality as we struggled at Bell Labs to develop a definition of data that would prove useful for data quality. While I knew philosophers had debated the merits of various approaches for thousands of years, I still thought “how hard can this be?” About twenty minutes with Kent’s book convinced me. This is really tough.
….

Amazon reports Data and Reality (3rd edition) as 200 pages long.

Looking at a hard copy I see:

  • Prefaces 17-34
  • Chapter 1 Entities 35-54
  • Chapter 2 The Nature of an Information System 55-67
  • Chapter 3 Naming 69-86
  • Chapter 4 Relationships 87-98
  • Chapter 5 Attributes 99-107
  • Chapter 6 Types and Categories and Sets 109-117
  • Chapter 7 Models 119-123
  • Chapter 8 The Record Model 125-137
  • Chapter 9 Philosophy 139-150
  • Bibliography 151-159
  • Index 161-162

Way less than the 200 pages promised by Amazon.

To ask a slightly different question:

“How hard can it be” to teach building data models?

A hard problem with no fixed solution?

Suggestions?

September 6, 2012

A dynamic data structure for counting subgraphs in sparse graphs

Filed under: Graphs,Networks,Subject Identity — Patrick Durusau @ 4:35 pm

A dynamic data structure for counting subgraphs in sparse graphs by Zdenek Dvorak and Vojtech Tuma.

Abstract:

We present a dynamic data structure representing a graph G, which allows addition and removal of edges from G and can determine the number of appearances of a graph of a bounded size as an induced subgraph of G. The queries are answered in constant time. When the data structure is used to represent graphs from a class with bounded expansion (which includes planar graphs and more generally all proper classes closed on topological minors, as well as many other natural classes of graphs with bounded average degree), the amortized time complexity of updates is polylogarithmic.

Work on data structures seems particularly appropriate when discussing graphs.

Subject identity, beyond string equivalent, can be seen as graph isomorphism or subgraph problem.

Has anyone proposed “bounded” subject identity mechanisms that correspond to the bounds necessary on graphs to make them processable?

We know how to do string equivalence and the “ideal” solution would be unlimited relationships to other subjects, but that is known to be intractable. For one thing we don’t know every relationship for any subject.

Thinking there may be boundary conditions for constructing subject identities that are more complex than string equivalence but that result in tractable identifications.

Suggestions?

August 4, 2012

Genetic algorithms: a simple R example

Filed under: Genetic Algorithms,Merging,Subject Identity — Patrick Durusau @ 6:49 pm

Genetic algorithms: a simple R example by Bart Smeets.

From the post:

Genetic algorithm is a search heuristic. GAs can generate a vast number of possible model solutions and use these to evolve towards an approximation of the best solution of the model. Hereby it mimics evolution in nature.

GA generates a population, the individuals in this population (often called chromosomes) have a given state. Once the population is generated, the state of these individuals is evaluated and graded on their value. The best individuals are then taken and crossed-over – in order to hopefully generate ‘better’ offspring – to form the new population. In some cases the best individuals in the population are preserved in order to guarantee ‘good individuals’ in the new generation (this is called elitism).

The GA site by Marek Obitko has a great tutorial for people with no previous knowledge on the subject.

As the size of data stores increase, the cost of personal judgement on each subject identity test will as well. Genetic algorithms may be one way of creating subject identity tests in such situations.

In any event, it won’t harm anyone to be aware of the basic contours of the technique.

I first saw this at R-Bloggers.

July 20, 2012

Optimal simultaneous superpositioning of multiple structures with missing data

Filed under: Alignment,Bioinformatics,Multidimensional,Subject Identity,Superpositioning — Patrick Durusau @ 3:55 pm

Optimal simultaneous superpositioning of multiple structures with missing data (Douglas L. Theobald and Phillip A. Steindel Optimal simultaneous superpositioning of multiple structures with missing data Bioinformatics 2012 28: 1972-1979. )

Abstract:

Motivation: Superpositioning is an essential technique in structural biology that facilitates the comparison and analysis of conformational differences among topologically similar structures. Performing a superposition requires a one-to-one correspondence, or alignment, of the point sets in the different structures. However, in practice, some points are usually ‘missing’ from several structures, for example, when the alignment contains gaps. Current superposition methods deal with missing data simply by superpositioning a subset of points that are shared among all the structures. This practice is inefficient, as it ignores important data, and it fails to satisfy the common least-squares criterion. In the extreme, disregarding missing positions prohibits the calculation of a superposition altogether.

Results: Here, we present a general solution for determining an optimal superposition when some of the data are missing. We use the expectation–maximization algorithm, a classic statistical technique for dealing with incomplete data, to find both maximum-likelihood solutions and the optimal least-squares solution as a special case.

Availability and implementation: The methods presented here are implemented in THESEUS 2.0, a program for superpositioning macromolecular structures. ANSI C source code and selected compiled binaries for various computing platforms are freely available under the GNU open source license from http://www.theseus3d.org.

Contact: dtheobald@brandeis.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

From the introduction:

How should we properly compare and contrast the 3D conformations of similar structures? This fundamental problem in structural biology is commonly addressed by performing a superposition, which removes arbitrary differences in translation and rotation so that a set of structures is oriented in a common reference frame (Flower, 1999). For instance, the conventional solution to the superpositioning problem uses the least-squares optimality criterion, which orients the structures in space so as to minimize the sum of the squared distances between all corresponding points in the different structures. Superpositioning problems, also known as Procrustes problems, arise frequently in many scientific fields, including anthropology, archaeology, astronomy, computer vision, economics, evolutionary biology, geology, image analysis, medicine, morphometrics, paleontology, psychology and molecular biology (Dryden and Mardia, 1998; Gower and Dijksterhuis, 2004; Lele and Richtsmeier, 2001). A particular case we consider here is the superpositioning of multiple 3D macromolecular coordinate sets, where the points to be superpositioned correspond to atoms. Although our analysis specifically concerns the conformations of macromolecules, the methods developed herein are generally applicable to any entity that can be represented as a set of Cartesian points in a multidimensional space, whether the particular structures under study are proteins, skulls, MRI scans or geological strata.

We draw an important distinction here between a structural ‘alignment’ and a ‘superposition.’ An alignment is a discrete mapping between the residues of two or more structures. One of the most common ways to represent an alignment is using the familiar row and column matrix format of sequence alignments using the single letter abbreviations for residues (Fig. 1). An alignment may be based on sequence information or on structural information (or on both). A superposition, on the other hand, is a particular orientation of structures in 3D space. [emphasis added]

I have deep reservations about the representations of semantics using Cartesian metrics but in fact that happens quite frequently. And allegedly, usefully.

Leaving my doubts to one side, this superpositioning technique could prove to be a useful exploration technique.

If you experiment with this technique, a report of your experiences would be appreciated.

June 23, 2012

Elements of Software Construction [MIT 6.005]

Filed under: Software,Subject Identity,Topic Maps — Patrick Durusau @ 6:59 pm

Elements of Software Construction

Description:

This course introduces fundamental principles and techniques of software development. Students learn how to write software that is safe from bugs, easy to understand, and ready for change.

Topics include specifications and invariants; testing, test-case generation, and coverage; state machines; abstract data types and representation independence; design patterns for object-oriented programming; concurrent programming, including message passing and shared concurrency, and defending against races and deadlock; and functional programming with immutable data and higher-order functions.

From the MIT OpenCourseware site.

Of interest to anyone writing topic map software.

It should also be of interest to anyone evaluating how software shapes what subjects we can talk about and how we can talk about them. Data structures have the same implications.

Not necessary to undertake such investigations in all cases. There are many routine uses for common topic map software.

Being able to see when the edges of a domain don’t quite fit or there may be gaps in coverage for an information system, are necessary skills for non-routine cases.

May 6, 2012

Why Your Brain Isn’t A Computer

Filed under: Artificial Intelligence,Semantics,Subject Identity — Patrick Durusau @ 7:45 pm

Why Your Brain Isn’t A Computer by Alex Knapp.

Alex writes:

“If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.”
– Emerson M. Pugh

Earlier this week, i09 featured a primer, of sorts, by George Dvorsky regarding how an artificial human brain could be built. It’s worth reading, because it provides a nice overview of the philosophy that underlies some artificial intelligence research, while simultaneously – albeit unwittingly – demonstrating the some of the fundamental flaws underlying artificial intelligence research based on the computational theory of mind.

The computational theory of mind, in essence, says that your brain works like a computer. That is, it takes input from the outside world, then performs algorithms to produce output in the form of mental state or action. In other words, it claims that the brain is an information processor where your mind is “software” that runs on the “hardware” of the brain.

Dvorsky explicitly invokes the computational theory of mind by stating “if brain activity is regarded as a function that is physically computed by brains, then it should be possible to compute it on a Turing machine, namely a computer.” He then sets up a false dichotomy by stating that “if you believe that there’s something mystical or vital about human cognition you’re probably not going to put too much credence” into the methods of developing artificial brains that he describes.

I don’t normally read Forbes but I made and exception in this case and am glad I did.

Not that I particularly care about which side of the AI debate you come out on.

I do think that the notion of “emergent” properties is an important one for judging subject identities. Whether those subjects occur in text messages, intercepted phone calls, signal “intell” of any sort.

Properties that identify subjects “emerge” from a person who speaks the language in question, who has social/intellectual/cultural experiences that give them a grasp of the matters under discussion and perhaps the underlying intent of the parties to the conversation.

A computer program can be trained to mindlessly sort through large amounts of data. It can even be trained to acceptable levels of mis-reading, mis-interpretation.

What will our evaluation be when it misses the one conversation prior to another 9/11? Because the context or language was not anticipated? Because the connection would only emerge out of a living understanding of cultural context?

Computers are deeply useful, but not when emergent properties, emergent properties of the sort that identify subjects, targets and the like are at issue.

April 10, 2012

A new framework for innovation in journalism: How a computer scientist would do it

Filed under: Journalism,News,Subject Identity — Patrick Durusau @ 6:40 pm

A new framework for innovation in journalism: How a computer scientist would do it

Andrew Phelps writes:

What if journalism were invented today? How would a computer scientist go about building it, improving it, iterating it?

He might start by mapping out some fundamental questions: What are the project’s values and goals? What consumer needs would it satisfy? How much should be automated, how much human-powered? How could it be designed to be as efficient as possible?

Computer science Ph.D. Nick Diakopoulos has attempted to create a new framework for innovation in journalism. His new white paper, commissioned by CUNY’s Tow-Knight Center for Entrepreneurial Journalism, does not provide answers so much as a different way to come up with questions.

Diakopolous identified 27 computing concepts that could apply to journalism — think natural language processing, machine learning, game engines, virtual reality, information visualization — and pored over thousands of research papers to determine which topics get the most (and least) attention. (There are untapped opportunities in robotics, augmented reality, and motion capture, it turns out.)

He thinks computer science and journalism have a lot in common, actually. They are both fundamentally concerned with information. Acquiring it, storing it, modifying it, presenting it.

Suggest you read his paper in full: Cultivating the Landscape of Innovation in Computational Journalism.

Intrigued by the idea of gauging the opportunities along a continuum of activities. Could be a stunning visual of how subject identity is handled across activities and/or technologies.

Interested?

March 28, 2012

Once Upon A Subject Clearly…

Filed under: Identity,Marketing,Subject Identity — Patrick Durusau @ 4:22 pm

As I was writing up the GWAS Central post, the question occurred to me: does their mapping of identifiers take something away from topic maps?

My answer is no and I would like to say why if you have a couple of minutes. 😉 Seriously! It isn’t going to take that long. However long it has taken me to reach this point.

Every time we talk, write or otherwise communicate about a subject, we at the same time have identified that subject. Makes sense. We want whoever we are talking, writing to or communicating with, to understand what we are talking about. Hard to do if we don’t identify what subject(s) we are talking about.

We do it all day, every day. In public, in private, in semi-public places. 😉 And we use words to do it. To identify the subjects we are talking about.

For the most part, or at least fairly often, we are understood by other people. Not always, but most of the time.

The problem comes in when we start to gather up information from different people who may (or may not) use words differently than we do. So there is a much larger chance that we don’t mean the same thing by the same words. Or we may use different words to mean the same thing.

Words, which were our reliable servants for the most part, become far less reliable.

To counter that unreliability, we can create groups of words, mappings if you like, to keep track of what words go where. But, to do that, we have to use words, again.

Start to see the problem? We always use words, to clear up our difficulties with words. And there isn’t any universal stopping place. The Cyc advocates would have us stop there and the SUMO crowd would have us stop over there and the Semantic Web folks yet somewhere else and of course the topic map mavens, yet one or more places.

For some purposes, any one or more of those mappings may be adequate. A mapping is only as good and for as long as it is useful.

History tells us that every mapping will be replaced with other mappings. We would do well us understand/document the words we are using as part of our mappings, as well as we are able.

But if words are used to map words, where do we stop? My suggestion would be to stop as we always have, wherever looks convenient. So long as the mapping suits your present purposes, what more would you ask of it?

I am quite content to have such stopping places because it means we will always have more starting places for the next round of mapping!

Ironic isn’t it? We create mappings to make sense out of words and our words lay the foundation for others to do the same.

March 15, 2012

Data and Reality

Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.

I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”

His abstract there read:

The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.

I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.

The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.

I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.

More to follow.

March 13, 2012

Then BI and Data Science Thinking Are Flawed, Too

Filed under: Identification,Identifiers,Marketing,Subject Identifiers,Subject Identity — Patrick Durusau @ 8:15 pm

Then BI and Data Science Thinking Are Flawed, Too

Steve Miller writes:

I just finished an informative read entitled “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us,” by social scientist Duncan Watts.

Regular readers of Open Thoughts on Analytics won’t be surprised I found a book with a title like this noteworthy. I’ve written quite a bit over the years on challenges we face trying to be the rational, objective, non-biased actors and decision-makers we think we are.

So why is a book outlining the weaknesses of day-to-day, common sense thinking important for business intelligence and data science? Because both BI and DS are driven from a science of business framework that formulates and tests hypotheses on the causes and effects of business operations. If the thinking that produces that testable understanding is flawed, then so will be the resulting BI and DS.

According to Watts, common sense is “exquisitely adapted to handling the kind of complexity that arises in everyday situations … But ‘situations’ involving corporations, cultures, markets, nation-states, and global institutions exhibit a very different kind of complexity from everyday situations. And under these circumstances, common sense turns out to suffer from a number of errors that systematically mislead us. Yet because of the way we learn from experience … the failings of commonsense reasoning are rarely apparent to us … The paradox of common sense, therefore, is that even as it helps us make sense of the world, it can actively undermine our ability to understand it.”

The author argues that common sense explanations to complex behavior fail in three ways. The first error is that the mental model of individual behavior is systematically flawed. The second centers on explanations for collective behavior that are even worse, often missing the “emergence” – one plus one equals three – of social behavior. And finally, “we learn less from history than we think we do, and that misperception skews our perception of the future.”

Reminds me of Thinking, Fast and Slow by Daniel Kahneman.

Not that two books with a similar “take” proves anything but you should put them on your reading list.

I wonder when/where our perceptions of CS practices have been skewed?

Or where that has played a role in our decision making about information systems?

February 19, 2012

Identity – The Philosophical Challenge For the Web

Filed under: Identity,Subject Identifiers,Subject Identity — Patrick Durusau @ 8:35 pm

Identity – The Philosophical Challenge For the Web by Matthew Hurst.

From the post:

I work in local search at Microsoft which means, like all those working in this space, I have to deal with an identity crisis on a daily basis. Currently, most local search products – like Bing’s and Google’s – leverage multiple data sets to derive a digital model of the world that users can then interact with. In creating this digital model, multiple statements have to be conflated to form a unified representation. This can be extremely challenging for two reasons. Firstly, the system has to decided when two records are intended to denote the same real world entity. Secondly, the designers of the system have to determine what real world entities are and how to describe them.

For example, if a business moves is that the same business or the closure of one and the opening of another? What does it mean to categorize a business? The cafe in Barnes and Noble is branded Starbucks but isn’t actually part of the Starbucks chain – should is surface as a separate entity or is it ‘hidden’ within the bookshop as an attribute (‘has cafe’)?

Thinking through these hard representational problems is as much part of the transformative trends going on in the tech industry as are those characterized by terms like ‘big data’ and ‘data scientist’.

Questions of identity and how to resolve different multiple references to the same entity have been debated at least since the time of Greek philosophers. Identity (Wikipedia page, see references on the various pages.)

This “philosophical challenge” has been going on for a very long time and so far I haven’t seen any demonstrations that the Web raises new questions.

You need to read Matthew’s identity example in his post.

The songs in question could be said to be instances of the same subject and a reference to that subject would be satisfied with any of those instances. From another point of view, the origin of the instances could be said to distinguish them into different subjects, say for proof of licensing purposes. Other view points are possible. Depends upon the purpose of your criteria of identification.

January 7, 2012

Fractals in Science, Engineering and Finance (Roughness and Beauty)

Filed under: Fractals,Roughness,Subject Identity — Patrick Durusau @ 3:57 pm

Fractals in Science, Engineering and Finance (Roughness and Beauty) by Benoit B. Mandelbrot.

About the lecture:

Roughness is ubiquitous and a major sensory input of Man. The first step to measure and simulate it was provided by fractal geometry. Illustrative examples will be drawn from the sciences, engineering (the internet) and (more extensively) the variation of financial prices. The beauty of fractals, an unanticipated “premium,” helps in teaching and bridges some chasms between different aspects of knowing and feeling.

Mandelbrot summaries his career as the pursuit of a theory of roughness.

Discusses the use of the eye as well as the ear in discovery (which I would call identification) of phenomena.

Have you listened to one of your subject identifications lately?

Are subject identifications rough? Or are they the smoothing of roughness?

Do your subjects have self-similarity?

Definitely worth your time.

First seen at: Benoît B. Mandelbrot: Fractals in Science, Engineering and Finance (Roughness and Beauty) over at Computational Legal Studies.

December 27, 2011

Thinking, Fast and Slow

Thinking, Fast and Slow by Daniel Kahneman, Farrar, Straus and Giroux, New York, 2011.

I got a copy of “Thinking, Fast and Slow” for Christmas and it has already proven to be an enjoyable read.

Kahneman says early on (page 28):

The premise of this book is that it is easier to recognize other people’s mistakes than our own.

I thought about that line when I read a note from a friend that topic maps needed more than my:

tagging everything with “Topic Maps….”

Which means I haven’t been clear about the reasons for the breath of materials I have and will be covering in this blog.

One premise of this blog is that the use and recognition of identifiers is essential for communication.

Another premise of this blog is that it is easier for us to study the use and recognition of identifiers by others, much for the same reasons we can recognize the mistakes of others more easily.

The use and recognition of identifiers by others aren’t mistakes but they may be different from those we would make. In cases where they differ from ours, we have a unique opportunity to study the choices made and the impacts of those choices. And we may learn patterns in those choices that we can eventually see in our own choices.

Understanding the use and recognition of identifiers in a particular circumstance and the requirements for the use and recognition of identifiers, is the first step towards deciding whether topic maps would be useful in some circumstance and in what way?

For example, processing social security records in the United States, anything other than “bare” identifiers like a social security number may be unnecessary and add load with no corresponding benefit. Aligning social security records with bank records, might need to reconsider the judgement to use only social security numbers. (Some information sharing is “against the law.” But as the Sheriff in “Oh Brother where art thou?” says: “The law is a man made thing.” Laws change, or you can commission absurdist interpretations of it.)

Topic maps aren’t everywhere but identifiers and recognition of identifiers are.

Understanding identifiers and their recognition will help you choose the most appropriate solution to a problem

October 28, 2011

Factual Resolve

Factual Resolve

Factual has a new API – Resolve:

From the post:

The Internet is awash with data. Where ten years ago developers had difficulty finding data to power applications, today’s difficulty lies in making sense of its abundance, identifying signal amidst the noise, and understanding its contextual relevance. To address these problems Factual is today launching Resolve — an entity resolution API that makes partial records complete, matches one entity against another, and assists in de-duping and normalizing datasets.

The idea behind Resolve is very straightforward: you tell us what you know about an entity, and we, in turn, tell you everything we know about it. Because data is so commonly fractured and heterogeneous, we accept fragments of an entity and return the matching entity in its entirety. Resolve allows you to do a number of things that will make your data engineering tasks easier:

  • enrich records by populating missing attributes, including category, lat/long, and address
  • de-dupe your own place database
  • convert multiple daily deal and coupon feeds into a single normalized, georeferenced feed
  • identify entities unequivocally by their attributes

For example: you may be integrating data from an app that provides only the name of a place and an imprecise location. Pass what you know to Factual Resolve via a GET request, with the attributes included as JSON-encoded key/value pairs:

I particularly like the line:

identify entities unequivocally by their attributes

I don’t know about the “unequivocally” part but the rest of it rings true. At least in my experience.

October 25, 2011

a speed gun for spam

Filed under: Subject Identifiers,Subject Identity — Patrick Durusau @ 7:35 pm

a speed gun for spam

From the post:

Apart from the content there are various features from metadata (like IP etc) which can help tell a spammer and regular user apart. Following are results of some data analysis (done on roughly 8000+ comments) which speak of another feature which proves to be a good discriminator. Hopefully this will aid others fighting spam/abuse (if not already using a similar feature).

(graph omitted)

The discriminator referred above is typing speed. The graph above plots the content length of a comment posted by a user against the (approximate) time he took to write it. If a user posts more than one comment in window of 5-10 minutes, we can consider those comments as consecutive posts. …

An illustration that subject identity tests are limited only by your imagination. From what I understand, very few spammers self-identify themselves using OWL and URLs. So as in this case, you need other tests to separate them.

A follow-up on this would be to see if particular spammers have speed patterns in their posts or searching more broadly, say across a set of blogs, a particular pattern. That is they start with blog X and then move down the line. Could be useful for dynamically configuring firewalls to block further content after they hit the first blog.

You have heard that passwords + keying patterns are used for personal identity?

October 11, 2011

Free Programming Books

Filed under: Language,Programming,Recognition,Subject Identity — Patrick Durusau @ 5:54 pm

Free Programming Books

Despite the title (the author’s update only went so far), there are more than 50 books listed here.

I won’t have tweeted this because like Lucene turning ten, everyone in the world has already tweeted or retweeted the news of these books.

I seem to be on a run of mostly programming resources today and I thought you might find the list interesting, possibly useful.

Especially those of you interested on pattern matching.

It occurs to me that programming languages and books about programming languages are fit fodder for the same tools used on other texts.

I am sure there probably exists an index with all the “hello, world” examples from various computer languages but are there more/deeper similarities that the universal initial example?

There was a universe of programming languages prior to “hello, world” and there is certainly a very large one beyond those listed here but one has to start somewhere. So why not with this set?

I think the first question I would ask is the obvious one: Are there groupings of these works, other than the ones noted? What measures would you use and why? What results do you get?

I suppose before that you need to gather up the texts and do whatever cleanup/conversion is required, perhaps a short note on what you did there would be useful.

What parts were in anticipation of your methods for grouping the texts?

Patience topic map fans, we are getting to the questions of subject recognition.

So, what subjects should we recognize across programming languages? Not tokens or even signatures but subjects. Signatures may be a way of identifying subjects, but can the same subjects have different signatures in distinct languages?

Would recognition of subjects across programming languages assist in learning languages?, in developing new languages (what is commonly needed)?, in studying the evolution of languages (where did we go right/wrong)?, in usefully indexing CS literature?, etc.

And you thought this was a post about “free” programming books. 😉

October 7, 2011

DeepaMetja 3 v0.5 – Property-Less Data Model

Filed under: Software,Subject Identity — Patrick Durusau @ 6:18 pm

DeepaMetja 3 v0.5 – Property-Less Data Model

I started to outline all the issues with the property-less solution but then thought, what a nice classroom exercise!

What do you think are the issues with the “solution?” Write a maximum of three (3) pages with no citations.

September 20, 2011

Silverlight® Visualizations… Changing the Way We Look at Predictive Analytics

Filed under: Analytics,Prediction,Subject Identity — Patrick Durusau @ 7:53 pm

Silverlight® Visualizations… Changing the Way We Look at Predictive Analytics

Webinar: Tuesday, October 18, 2011 10:00 AM – 11:00 AM PDT

Presented by Caroline Junkin, Director of Analytics Solutions for Predixion Software.

That’s about all the webinar form says so I went looking for more information. 😉

Predixion Insight™ Video Library

From that page:

Predixion Software’s video library contains tutorials that explore the predictive analytics features currently available in Predixion Insight™, demonstrations that walk you through various applications for predictive analytics and Webinar Replays.

If subjects can include subjects that some people don’t think exist, then subjects can certainly include subjects we think may exist at some point in the future. And no doubt our references to them will change over time.

September 8, 2011

Summing up Properties with subjectIdentifiers/URLs?

Filed under: Identification,Identifiers,Intelligence,Subject Identifiers,Subject Identity — Patrick Durusau @ 6:06 pm

I was picking tomatoes in the garden when I thought about telling Carol (my wife) the plants are about to stop producing.

Those plants are at a particular address, in the backyard, middle garden bed of three, are of three different varieties, but I am going to sum up those properties by saying: “The tomatoes are about to stop producing.”

It occurred to me that a subjectIdentifier could be assigned to a topic element on the basis of summing up properties of the topic.* That would have the advantage of enabling merging on the basis of subjectIdentifiers as opposed to more complex tests upon properties of a topic.

Disclosure of the basis for assignment of a subjectIdentifier is an interesting question.

It could be that a service wishes to produce subjectIdentifiers and index information based upon complex property measures, producing for consumption, the subjectIdentifiers and merge-capable indexes on one or more information sets. The basis for merging being the competitive edge offered by the service.

If promoting merging with a vendor’s process or format, which is seeking to become the TCP/IP of some area, the basis for merging and tools to assist with it will be supplied.

Or if you are an intelligence agency and you want an inward and outward facing interface that promotes merging of information but does not disclose your internal basis for identification, variants of this technique may be of interest.

*The notion of summing up imposes no prior constraints on the tests used or the location of the information subjected to those tests.

August 18, 2011

« Newer PostsOlder Posts »

Powered by WordPress