Archive for the ‘Knowledge’ Category

YAGO: A High-Quality Knowledge Base (Open Source)

Saturday, October 28th, 2017

YAGO: A High-Quality Knowledge Base


YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.

YAGO is special in several ways:

  1. The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value.
  2. YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes.
  3. YAGO is an ontology that is anchored in time and space. YAGO attaches a temporal dimension and a spacial dimension to many of its facts and entities.
  4. In addition to a taxonomy, YAGO has thematic domains such as "music" or "science" from WordNet Domains.
  5. YAGO extracts and combines entities and facts from 10 Wikipedias in different languages.

YAGO is developed jointly with the DBWeb group at Télécom ParisTech University.

Before you are too impressed by the numbers, which are impressive, realize that 10 million entities is 3% of the current US population. To say nothing of any other entities we might want include along with them. It’s a good start and very useful, but realize it is a limited set of entities.

All the source data is available, along with the source code.

Would be interesting to see how useful the entity set is when used with US campaign contribution data.


The Pretence of Knowledge

Friday, October 24th, 2014

The Pretence of Knowledge by Friedrich August von Hayek. (Nobel Prize Lecture in Economics, December 11, 1974)

From the lecture:

The particular occasion of this lecture, combined with the chief practical problem which economists have to face today, have made the choice of its topic almost inevitable. On the one hand the still recent establishment of the Nobel Memorial Prize in Economic Science marks a significant step in the process by which, in the opinion of the general public, economics has been conceded some of the dignity and prestige of the physical sciences. On the other hand, the economists are at this moment called upon to say how to extricate the free world from the serious threat of accelerating inflation which, it must be admitted, has been brought about by policies which the majority of economists recommended and even urged governments to pursue. We have indeed at the moment little cause for pride: as a profession we have made a mess of things.

It seems to me that this failure of the economists to guide policy more successfully is closely connected with their propensity to imitate as closely as possible the procedures of the brilliantly successful physical sciences – an attempt which in our field may lead to outright error. It is an approach which has come to be described as the “scientistic” attitude – an attitude which, as I defined it some thirty years ago, “is decidedly unscientific in the true sense of the word, since it involves a mechanical and uncritical application of habits of thought to fields different from those in which they have been formed.”1 I want today to begin by explaining how some of the gravest errors of recent economic policy are a direct consequence of this scientistic error.

If you have some time for serious thinking over the weekend, visit or re-visit this lecture.

Substitute “computistic” for “scientistic” and capturing semantics as the goal.

Google and other search engines are overwhelming proof that some semantics can be captured by computers, but they are equally evidence of a semantic capture gap.

Any number of proposals exist to capture semantics, ontologies, Description Logic, RDF, OWL, but none are based on an empirical study how semantics originate, change and function in human society. Such proposals are snapshots of a small group’s understanding of semantics. Your mileage may vary.

Depending on your goals and circumstances, one or more proposal may be useful. But capturing and maintaining semantics without a basis in empirical study of semantics seems like a hit or miss proposition.

Or at least historical experience with capturing and maintaining semantics points in that direction.

I first saw this in a tweet by Chris Diehl

Knowledge Base Completion…

Thursday, February 6th, 2014

Knowledge Base Completion via Search-Based Question Answering by Robert West,


Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. For example, over 70% of people included in Freebase have no known place of birth, and 99% have no known ethnicity. In this paper, we propose a way to leverage existing Web-search–based question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, we learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute. For example, if we want to find Frank Zappa’s mother, we could ask the query “who is the mother of Frank Zappa”. However, this is likely to return “The Mothers of Invention”, which was the name of his band. Our system learns that it should (in this case) add disambiguating terms, such as Zappa’s place of birth, in order to make it more likely that the search results contain snippets mentioning his mother. Our system also learns how many different queries to ask for each attribute, since in some cases, asking too many can hurt accuracy (by introducing false positives). We discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute. Finally, we evaluate our system and show that it is able to extract a large number of facts with high confidence.

I was glad to see this paper was relevant to searching because any paper with Frank Zappa and “The Mothers of Invention” in the abstract deserves to be cited. 😉 I will tell you that story another day.

It’s heavy reading and I have just begun but I wanted to mention something from early in the paper:

We show that it is better to ask multiple queries and aggregate the results, rather than rely on the answers to a single query, since integrating several pieces of evidence allows for more robust estimates of answer correctness.

Does the use of multiple queries run counter to the view that querying a knowledge base, be it RDF or topic maps or other, should result in a single answer?

If you were to ask me a non-trivial question five (5) days in a row (same question) you would get at least five different answers. All in response to the same question but eliciting slightly different information.

Should we take the same approach to knowledge bases? Or do we in fact already do take that approach by querying search engines with slightly different queries?


I first saw this in a tweet by Stefano Bertolo.

Knowledge Leakage:..

Thursday, August 22nd, 2013

Knowledge Leakage: The Destructive Impact of Failing to Train on ERP Projects by Cushing Anderson.


This IDC study refines the concept of knowledge leakage and the factors that compound and mitigate the impact of knowledge leakage on an IT organization. It also suggests strategies for IT management to reduce the impact of knowledge leakage on organizational performance.

There is a silent killer in every IT organization — knowledge leakage. IT organizations are in a constant state of flux. The IT environment, the staff, and the organizational goals change continuously. At the same time, organizational performance must be as high as possible, but the impact of changing staff and skill leakage can cause 50% of an IT organization’s skills to be lost in six years.

“Knowledge leak is the degradation of skills over time, and it occurs in every organization, every time. It doesn’t discriminate based on operating system or platform, but it can kill organizational performance in as little as a couple of years.” — Cushing Anderson, vice president, IT Education and Certification research

I don’t have an IDC account so I can’t share with you what goodies may be inside this article.

I do think that “knowledge leakage” is a good synonym for “organizational memory.” Or should that be “organizational memory loss?”

I also don’t think that “knowledge leakage” is confined to IT organizations.

Ask the nearest supervisor that has had a long time administrative assistant retire. That’s real “knowledge leakage.”

The problem with capturing organizational knowledge, the unwritten rules of who to ask, for what and when, is that such rules are almost never written down.

And if they were, how would you find them?

Let me leave you with a hint:

The user writing down the unwritten rules needs to use their vocabulary and not one ordained by IT or your corporate office. And they need to walk you through it so you can add your vocabulary to it.

Or to summarize: Say it your way. Find it your way.

If you are interested, you know how to contact me.

Knowledge, Graphs & 3D CAD Systems

Tuesday, July 16th, 2013

Knowledge, Graphs & 3D CAD Systems by David Bigelow.

I like this riff, Phase 1: collect data, Phase 2: something happens, Phase 3: Profit.

David says we need more detail on phase 2. 😉

Covers making data autonomous, capturing design frameworks, people “mobility,” and data flows.

Demo starts at time mark 28:00 or so.

The most interesting part is the emphasis on not storing all the data in the graph database.

The graph database is used to store relationship information, such as who can access particular data and other relationships between data.

Remarkably similar to some comments I have made recently on using topic map to supplement other information systems as opposed to replacing them.

Some principles of intelligent tutoring

Tuesday, February 12th, 2013

Some principles of intelligent tutoring by Stellan Ohlsson. (Instructional Science May 1986, Volume 14, Issue 3-4, pp 293-326)


Research on intelligent tutoring systems is discussed from the point of view of providing moment-by-moment adaptation of both content and form of instruction to the changing cognitive needs of the individual learner. The implications of this goal for cognitive diagnosis, subject matter analysis, teaching tactics, and teaching strategies are analyzed. The results of the analyses are stated in the form of principles about intelligent tutoring. A major conclusion is that a computer tutor, in order to provide adaptive instruction, must have a strategy which translates its tutorial goals into teaching actions, and that, as a consequence, research on teaching strategies is central to the construction of intelligent tutoring systems.

Be sure to notice the date: 1986, when you could write:

The computer offers the potential for adapting instruction to the student at a finer grain-level than the one which concerned earlier generations of educational researchers. First, instead of adapting to global traits such as learning style, the computer tutor can, in principle, be programmed to adapt to the student dynamically, during on-going instruction, at each moment in time providing the kind of instruction that will be most beneficial to the student at that time. Said differently, the computer tutor takes a longitudinal, rather than cross-sectional, perspective, focussing on the fluctuating cognitive needs of a single learner over time, rather than on stable inter-individual differences. Second, and even more important, instead of adapting to content-free characteristics of the learner such as learning rate, the computer can, in principle, be programmed to adapt both the content and the form of instruction to the student’s understanding of the subject matter. The computer can be programmed, or so we hope, to generate exactly that question, explanation, example, counter-example, practice problem, illustration, activity, or demonstration which will be most helpful to the learner. It is the task of providing dynamic adaptation of content and form which is the challenge and the promise of computerized instruction*

That was written decades before we were habituated to users adapting to the interface, not the other way around.

More on point, the quote from Ohlsson, Principle of Non-Equifinality of Learning, was proceeded by:

But there are no canonical representations of knowledge. Any knowledge domain can be seen from several different points of view, each view showing a different structure, a different set of parts, differently related. This claim, however broad and blunt – almost impolite – it may appear when laid out in print, is I believe, incontrovertible. In fact, the evidence for it is so plentiful that we do not notice it, like the fish in the sea who never thinks about water. For instance, empirical studies of expertise regularly show that human experts differ in their problem solutions (e.g., Prietula and Marchak, 1985); at the other end of the scale, studies of young children tend to show that they invent a variety of strategies even for simple tasks, (e.g., Young, 1976; Svenson and Hedenborg, 1980). As a second instance, consider rational analyses of thoroughly codified knowledge domains such as the arithmetic of rational numbers. The traditional mathematical treatment by Thurstone (1956) is hard to relate to the didactic analysis by Steiner (1969), which, in turn, does not seem to have much in common with the informal, but probing, analyses by Kieren (1976, 1980) – and yet, they are all experts trying to express the meaning of, for instance, “two-thirds”. In short, the process of acquiring a particular subject matter does not converge on a particular representation of that subject matter. This fact has such important implications for instruction that it should be stated as a principle.

The first two sentences capture the essence of topic maps as well as any I have ever seen:

But there are no canonical representations of knowledge. Any knowledge domain can be seen from several different points of view, each view showing a different structure, a different set of parts, differently related.
(emphasis added)

Single knowledge representations, such as in bank accounting systems can be very useful. But when multiple banks with different accounting systems try to roll knowledge up to the Federal Reserve, different (not better) representations may be required.

Could even require representations that support robust mappings between different representations.

What do you think?

Software fences

Saturday, September 8th, 2012

Software fences by John D. Cook.

A great quote from G. K. Chesterton.

Do reformers of every generation think their forefathers were fools or do reformers have a mistaken belief in “progress?”

Rather than saying “progress,” what if we say we know things “differently” than our forefathers?

Not better or worse, just differently.

Ignorance by Stuart Firestein; It’s Not Rocket Science by Ben Miller – review

Tuesday, July 31st, 2012

Ignorance by Stuart Firestein; It’s Not Rocket Science by Ben Miller – review by Adam Rutherford

From the review, speaking of “Ignorance” by Stuart Firestein, Adam writes:

Stuart Firestein, a teacher and neuroscientist, has written a splendid and admirably short book about the pleasure of finding things out using the scientific method. He smartly outlines how science works in reality rather than in stereotype. His MacGuffin – the plot device to explore what science is – is ignorance, on which he runs a course at Columbia University in New York. Although the word “science” is derived from the Latin scire (to know), this misrepresents why it is the foundation and deliverer of civilisation. Science is to not know but have a method to find out. It is a way of knowing.

Firestein is also quick to dispel the popular notion of the scientific method, more often than not portrayed as a singular thing enshrined in stone. The scientific method is more of a utility belt for ignorance. Certainly, falsification and inductive reasoning are cornerstones of converting unknowns to knowns. But much published research is not hypothesis-driven, or even experimental, and yet can generate robust knowledge. We also invent, build, take apart, think and simply observe. It is, Firestein says, akin to looking for a black cat in a darkened room, with no guarantee the moggy is even present. But the structure of ignorance is crucial, and not merely blind feline fumbling.

The size of your questions is important, and will be determined by how much you know. Therein lies a conundrum of teaching science. Questions based on pure ignorance can be answered with knowledge. Scientific research has to be born of informed ignorance, otherwise you are not finding new stuff out. Packed with real examples and deep practical knowledge, Ignorance is a thoughtful introduction to the nature of knowing, and the joy of curiosity.

Not to slight “It’s Not Rocket Science,” but I am much more sympathetic to discussions of the “…structure of ignorance…” and how we model those structures.

If you are interested in such arguments, consider the Oxford Handbook of Skepticism. I don’t have a copy (you can fix that if you like) but it is reported to have good coverage of the subject of ignorance.

Knowledge Design Patterns

Saturday, June 16th, 2012

Knowledge Design Patterns

John Sowa announced these slides as:

Last week, I presented a 3-hour tutorial on Knowledge Design Patterns at the Semantic Technology Conference in San Francisco. Following are the slides:

The talk was presented on June 4, but these are the June 10th version of the slides. They include a few revisions and extensions, which I added to clarify some of the issues and to answer some of the questions that were asked during the presentation.

And John posted an outline of the 130 slides:

Outline of This Tutorial

1. What are knowledge design patterns?
2. Foundations of ontology.
3. Syllogisms, categorical and hypothetical.
4. Patterns of logic.
5. Combining logic and ontology.
6. Patterns of patterns of patterns.
7. Simplifying the user interface.

Particularly if you have never seen a Sowa presentation, take a look at the slides.

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Friday, May 18th, 2012

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

From the post:

Here is the first series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!).The titles of the talks are linked to the presentation slides. The full program which ends tomorrow is here.. All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

Posted by Igor Carron at Nuit Blanche.

Finding enough hours to watch all of these is going to be a problem!

Which ones do you like best?

Ontological Conjunctive Query Answering over large, semi-structured knowledge bases

Saturday, February 25th, 2012

Ontological Conjunctive Query Answering over large, semi-structured knowledge bases

From the description:

Ontological Conjunctive Query Answering knows today a renewed interest in knowledge systems that allow for expressive inferences. Most notably in the Semantic Web domain, this problem is known as Ontology-Based Data Access. The problem consists in, given a knowledge base with some factual knowledge (very often a relational database) and universal knowledge (ontology), to check if there is an answer to a conjunctive query in the knowledge base. This problem has been successfully studied in the past, however the emergence of large and semi-structured knowledge bases and the increasing interest on non-relational databases have slightly changed its nature.

This presentation will highlight the following aspects. First, we introduce the problem and the manner we have chosen to address it. We then discuss how the size of the knowledge base impacts our approach. In a second time, we introduce the ALASKA platform, a framework for performing knowledge representation & reasoning operations over heterogeneously stored data. Finally we present preliminary results obtained by comparing efficiency of existing storage systems when storing knowledge bases of different sizes on disk and future implications.

Slides help as always.

Introduces the ALASKA – Abstract Logic-based Architecture Storage systems & Knowledge base Analysis.

Its goal is to enable to perform OCQA in a logical, generic manner, over existing, heterogeneous storage systems.

“ALASKA” is the author’s first acronym.

The results for Oracle software (slide 25) makes me suspect the testing protocol. Not that Oracle wins every contest by any means but such poor performance indicates some issue other its native capabilities.

Molecules from scratch without the fiendish physics

Sunday, February 12th, 2012

Molecules from scratch without the fiendish physics by Lisa Grossman.

From the post:

But because the equation increases in complexity as more electrons and protons are introduced, exact solutions only exist for the simplest systems: the hydrogen atom, composed of one electron and one proton, and the hydrogen molecule, which has two electrons and two protons.

This complexity rules out the possibility of exactly predicting the properties of large molecules that might be useful for engineering or medicine. “It’s out of the question to solve the Schrödinger equation to arbitrary precision for, say, aspirin,” says von Lilienfeld.

So he and his colleagues bypassed the fiendish equation entirely and turned instead to a computer-science technique.

Machine learning is already widely used to find patterns in large data sets with complicated underlying rules, including stock market analysis, ecology and Amazon’s personalised book recommendations. An algorithm is fed examples (other shoppers who bought the book you’re looking at, for instance) and the computer uses them to predict an outcome (other books you might like). “In the same way, we learn from molecules and use them as previous examples to predict properties of new molecules,” says von Lilienfeld.

His team focused on a basic property: the energy tied up in all the bonds holding a molecule together, the atomisation energy. The team built a database of 7165 molecules with known atomisation energies and structures. The computer used 1000 of these to identify structural features that could predict the atomisation energies.

When the researchers tested the resulting algorithm on the remaining 6165 molecules, it produced atomisation energies within 1 per cent of the true value. That is comparable to the accuracy of mathematical approximations of the Schrödinger equation, which work but take longer to calculate as molecules get bigger (Physical Review Letters, DOI: 10.1103/PhysRevLett.108.058301). (emphasis added)

One way to look at this research is to say we have three avenues to discovering the properties of molecules:

  1. Formal logic – but would require far more knowledge than we have at the moment
  2. Schrödinger equation – but that may be intractable for some molecules
  3. Knowledge-based approach – May be less precise than 1 & 2 but works now.

A knowledge-based approach allows us to make progress now. Topic maps can be annotated with other methods, such as math or research results, up to and including formal logic.

The biggest different with topic maps is that the information you wish to record or act upon is not restricted ahead of time.

To Know, but Not Understand: David Weinberger on Science and Big Data

Wednesday, January 4th, 2012

To Know, but Not Understand: David Weinberger on Science and Big Data

From the introduction:

In an edited excerpt from his new book, Too Big to Know, David Weinberger explains how the massive amounts of data necessary to deal with complex phenomena exceed any single brain’s ability to grasp, yet networked science rolls on.

Well, it is a highly entertaining excerpt, with passages like:

For example, the biological system of an organism is complex beyond imagining. Even the simplest element of life, a cell, is itself a system. A new science called systems biology studies the ways in which external stimuli send signals across the cell membrane. Some stimuli provoke relatively simple responses, but others cause cascades of reactions. These signals cannot be understood in isolation from one another. The overall picture of interactions even of a single cell is more than a human being made out of those cells can understand. In 2002, when Hiroaki Kitano wrote a cover story on systems biology for Science magazine — a formal recognition of the growing importance of this young field — he said: “The major reason it is gaining renewed interest today is that progress in molecular biology … enables us to collect comprehensive datasets on system performance and gain information on the underlying molecules.” Of course, the only reason we’re able to collect comprehensive datasets is that computers have gotten so big and powerful. Systems biology simply was not possible in the Age of Books.

Weinberger slips twix and tween philosophy of science, epistemology, various aspects of biology and computational science. Not to mention with the odd bald faced assertion such as: “…the biological system of an organism is complex beyond imagining.” At one time that could have been said about the atom. I think some progress has been made on understanding that last item, or so physicists claim.

Don’t get me wrong, I have a copy on order and look forward to reading it.

But, no single reader will be able to discover all the factual errors and leaps of logic in Too Big to Know. Perhaps a website or wiki, Too Big to Correct?