Archive for November, 2011

Model Thinking

Wednesday, November 30th, 2011

Model Thinking by Scott E. Page.

Marijane sent this link in a comment to my post on Stanford classes.

From the class description:

We live in a complex world with diverse people, firms, and governments whose behaviors aggregate to produce novel, unexpected phenomena. We see political uprisings, market crashes, and a never ending array of social trends. How do we make sense of it?

Models. Evidence shows that people who think with models consistently outperform those who don’t. And, moreover people who think with lots of models outperform people who use only one.

Why do models make us better thinkers?

Models help us to better organize information – to make sense of that fire hose or hairball of data (choose your metaphor) available on the Internet. Models improve our abilities to make accurate forecasts. They help us make better decisions and adopt more effective strategies. They even can improve our ability to design institutions and procedures.

In this class, I present a starter kit of models: I start with models of tipping points. I move on to cover models explain the wisdom of crowds, models that show why some countries are rich and some are poor, and models that help unpack the strategic decisions of firm and politicians.

The models cover in this class provide a foundation for future social science classes, whether they be in economics, political science, business, or sociology. Mastering this material will give you a huge leg up in advanced courses. They also help you in life.

Here’s how the course will work.

For each model, I present a short, easily digestible overview lecture. Then, I’ll dig deeper. I’ll go into the technical details of the model. Those technical lectures won’t require calculus but be prepared for some algebra. For all the lectures, I’ll offer some questions and we’ll have quizzes and even a final exam. If you decide to do the deep dive, and take all the quizzes and the exam, you’ll receive a certificate of completion. If you just decide to follow along for the introductory lectures to gain some exposure that’s fine too. It’s all free. And it’s all here to help make you a better thinker!

Hope you can join the course this January.

As Marijane says, “…awfully relevant to Topic Maps!”

Common Crawl

Wednesday, November 30th, 2011

Common Crawl

From the webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.

We strive to be transparent in all of our operations and we support nofollow and robots.txt. For more information about the ccBot, please see FAQ. For more information on Common Crawl data and how to access it, please see Data. For access to our open source code, please see our GitHub repository.

Current crawl is reported to be 5 billion pages. That should keep you hard drives spinning enough to help with heating in cold climes!

Looks like a nice place to learn a good bit about searching as well as processing serious sized data.

NPR’s radio series on Big Data

Wednesday, November 30th, 2011

NPR’s radio series on Big Data by David Smith.

Public-radio network NPR has just broadcast a 2-part series about Big Data on its Morning Edition program. For anyone with 10 minutes to spare, it’s a great overview of the impact of Big Data and the data scientists who derive value from the data. Part 1 is about companies that make use of Big Data, and the implications for businesses and individuals. Part 2 is about the demand for data scientists to analyze big data. (Key quote: “Math and Statistics are the are the sexiest skills around”.) You can listen to both segments online at the links below.

NPR: Following Digital Breadcrumbs To ‘Big Data’ Gold ; The Search For Analysts To Make Sense Of ‘Big Data’

From Revolution Analytics, a great place to hang out for R and other news.

You would have to work for NPR to think: “Math and Statistics are the are the sexiest skills around” 😉

Seriously, the demand for making sense out of the coming flood of data (you haven’t seen anything yet) is only going to increase. All the “let’s stop while we take 6 months to analyze a particular data set” type solutions are going to be swept away. Analysis is going to be required, but on a cost-benefit basis. And one of the benefits isn’t going to be “works with your software.”

balanced binary search trees exercise for algorithms and data structures class

Wednesday, November 30th, 2011

balanced binary search trees exercise for algorithms and data structures class by René Pichardt.

From the post:

I created some exercises regarding binary search trees. This time there is no coding involved. My experience from teaching former classes is that many people have a hard time understanding why trees are usefull and what the dangers of these trees is. Therefor I have created some straight forward exercises that nevertheless involve some work and will hopefully help the students to better understand and internalize the concepts of binary search tress which are in my oppinion one of the most fundamental and important concepts in a class about algorithms and data structures.

I visited René’s blog because of the Google n gram post but could not leave without mentioning these exercises.

Great teaching technique!

What parts of topic maps should be illustrated with similar exercises?

PS: Still working on it but I am thinking that the real power of topic maps lies in its lack of precision or rather that a topic map can be as precise or as loose as need be. No pre-set need to have a decidable outcome. Or perhaps rather, it can have a decidable outcome that is the decidable outcome because I say it is so. 😉

Ad Hoc Normalization II

Wednesday, November 30th, 2011

After writing Ad Hoc Normalization it occurred to me that topic maps offer another form of “ad hoc” normalization.

I don’t know what else you would call merging two topic maps together?

Try that with two relational databases.

So, not only can topic maps maintain “internal” ad hoc normalization but also “external” ad hoc normalization with data sources that were not present at the time of the creation of a topic map.

But there are other forms of normalization.

Recall that Lars Marius talks about the reduction of information items that represent the same subjects. That can only occur when there is a set of information items that obey the same data model and usually the same syntax. I would call that information model normalization. That is whatever is supported by a particular information model can be normalized.

For relational databases that is normalization by design and for topic maps that is ad hoc normalization (although some of it could be planned in advance as well).

But there is another form of normalization. A theoretical construct but subject-based normalization. I say it is theoretical because in order to instantiate a particular case you have to cross over into the land of information model normalization.

I find subject-based normalization quite useful, mostly because as human designers/authors, we are not constrained by the limits of our machines. We can hold contradictory ideas at the same time without requiring a cold or hot reboot. Subject-based normalization allows us to communicate with other users what we have seen in data and how we need to process it for particular needs.

Download Google n gram data set and neo4j source code for storing it

Wednesday, November 30th, 2011

Download Google n gram data set and neo4j source code for storing it by René Pickhardt.

From the post:

In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is.

The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.

I forwarded this data set to two high school students which I was teaching last summer at the dsa. Now they are working on a project for a German student competition. They are using the n-grams and neo4j to predict sentences and help people to improve typing.

The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.

Now, imagine having users mark subjects in texts (highlighting?) and using a sufficient number of such texts to automate the recognition of subjects and their relationships to document authors and other subjects. Does that sound like an easy way to author an ongoing topic map based on the output of an office? Or project?

Feel free to mention at least one prior project that used a very similar technique with texts. If no one does, I will post its name and links to the papers tomorrow.

No Datum is an Island of Serendip

Wednesday, November 30th, 2011

No Datum is an Island of Serendip by Jim Harris.

From the post:

Continuing a series of blog posts inspired by the highly recommended book Where Good Ideas Come From by Steven Johnson, in this blog post I want to discuss the important role that serendipity plays in data — and, by extension, business success.

Let’s start with a brief etymology lesson. The origin of the word serendipity, which is commonly defined as a “happy accident” or “pleasant surprise” can be traced to the Persian fairy tale The Three Princes of Serendip, whose heroes were always making discoveries of things they were not in quest of either by accident or by sagacity (i.e., the ability to link together apparently innocuous facts to come to a valuable conclusion). Serendip was an old name for the island nation now known as Sri Lanka.

“Serendipity,” Johnson explained, “is not just about embracing random encounters for the sheer exhilaration of it. Serendipity is built out of happy accidents, to be sure, but what makes them happy is the fact that the discovery you’ve made is meaningful to you. It completes a hunch, or opens up a door in the adjacent possible that you had overlooked. Serendipitous discoveries often involve exchanges across traditional disciplines. Serendipity needs unlikely collisions and discoveries, but it also needs something to anchor those discoveries. The challenge, of course, is how to create environments that foster these serendipitous connections.”

I don’t disagree about the importance of serendipity but I do wonder about the degree to which we can plan or even facilitate it. At least in terms of software/interfaces, etc.

Remember Malcolm Gladwell and The Tipping Point? Its a great read but there is on difficulty that I don’t think Malcolm dwells on enough. It is one thing to pick out tipping points (or alleged ones) in retrospect. It is quite another to pick out a tipping point before it occurs and to plan to take advantage of it. There are any number of rationalist explanations for various successes, but that are all after the fact constructs that serve particular purposes.

I do think we can make serendipity more likely by exposing people to a variety of information that makes the realization of connections between information more likely. That isn’t to say that serendipity will happen, just that we can create circumstances for people that will make the conditions ripe for it.

Apache Zookeeper 3.3.4

Wednesday, November 30th, 2011

Apache Zookeeper 3.3.4

From the post:

Apache ZooKeeper release 3.3.4 is now available: this is a fix release covering 22 issues, 9 of which were considered blockers. Some of the more serious issues include:

  • ZOOKEEPER-1208 Ephemeral nodes may not be removed after the client session is invalidated
  • ZOOKEEPER-961 Watch recovery fails after disconnection when using chroot connection option
  • ZOOKEEPER-1049 Session expire/close flooding renders heartbeats to delay significantly
  • ZOOKEEPER-1156 Log truncation truncating log too much – can cause data loss
  • ZOOKEEPER-1046 Creating a new sequential node incorrectly results in a ZNODEEXISTS error
  • ZOOKEEPER-1097 Quota is not correctly rehydrated on snapshot reload
  • ZOOKEEPER-1117 zookeeper 3.3.3 fails to build with gcc >= 4.6.1 on Debian/Ubuntu

In case you are unfamiliar with Zookeeper:

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. (from Apache Zookeeper)

More Google Cluster Data

Wednesday, November 30th, 2011

More Google Cluster Data

From the post:

Google has a strong interest in promoting high quality systems research, and we believe that providing information about real-life workloads to the academic community can help.

In support of this we published a small (7-hour) sample of resource-usage information from a Google production cluster in 2010 (research blog on Google Cluster Data). Approximately a dozen researchers at UC Berkeley, CMU, Brown, NCSU, and elsewhere have made use of it.

Recently, we released a larger dataset. It covers a longer period of time (29 days) for a larger cell (about 11k machines) and includes significantly more information, including:

I remember Robert Barta describing the use of topic maps for systems administration. This data set could give some insight into the design of a topic map for cluster management.

What subjects and relationships would you recognize, how and why?

If you are looking for employment, this might be a good way to attract Google’s attention. (Hint to Google: Releasing interesting data sets could be a way to vet potential applicants in realistic situations.)


Tuesday, November 29th, 2011


From the documentation page:

Wakanda is an open-source platform that allows you to develop business web applications. It provides a unified stack running on JavaScript from end-to-end:

  • Cross-platform and cloud-ready on the back end
  • Fully functional, go-anywhere desktop, mobile and tablet apps on the front end

You gain the ability to create browser-based data applications that are as fast, stable, and capable as native client/server solutions are on the desktop.


Notice that no code is generated behind your back: no SQL statements, no binaries, …
What you write is what you get.

I’m not sure that’s a good thing but the name ran a bell and I found an earlier post, Berlin Buzzwords 2011, that just has a slide deck on it.

It’s in Developer Preview 2 (is that pre-pre-alpha or some other designation?) now.

Comments? Anyone looking at this for interface type issues?

I’m the first to admit that most interfaces disappoint but that isn’t because of the underlying technology. Most interfaces disappoint because they are designed as a matter of the underlying technology.

A well-designed green screen would find faster acceptance than any number of current interfaces. (Note I said a well-designed green screen.)

Apache OpenNLP 1.5.2-incubating

Tuesday, November 29th, 2011

From the announcement of the release of Apache OpenNLP 1.5.2-incubating:

The Apache OpenNLP team is pleased to announce the release of version 1.5.2-incubating of Apache OpenNLP.

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.

The OpenNLP 1.5.2-incubating binary and source distributions are available for download from our download page:

The OpenNLP library is distributed by Maven Central as well. See the Maven Dependency page for more details:

This release contains a couple of new features, improvements and bug fixes. The maxent trainer can now run in multiple threads to utilize multi-core CPUs, configurable feature generation was added to the name finder, the perceptron trainer was refactored and improved, machine learners can now be configured with much more options via a parameter file, evaluators can print out detailed evaluation information.

Additionally the release contains the following noteworthy changes:

  • Improved the white space handling in the Sentence Detector and its training code
  • Added more cross validator command line tools
  • Command line handling code has been refactored
  • Fixed problems with the new build
  • Now uses fast token class feature generation code by default
  • Added support for BioNLP/NLPBA 2004 shared task data
  • Removal of old and deprecated code
  • Dictionary case sensitivity support is now done properly
  • Support for OSGi

For a complete list of fixed bugs and improvements please see the RELEASE_NOTES file included in the distribution.

18th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012)

Tuesday, November 29th, 2011

18th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012)

Important Submission Dates:

  • Abstract Submission: April 18th, 2012
  • Full Paper Submission: April 25th, 2012
  • Notification:June 6th, 2012
  • Camera-Ready: June 30th, 2012
  • Tutorial & Workshop Proposal: May 16th, 2012
  • Demo Submission: July 2nd, 2012
  • Demo Notification: July 23rd, 2012
  • Demo Camera-Ready Version: August 4th, 2012
  • Poster Submission: August 13th, 2012
  • Poster Notification: September 3rd, 2012

Somewhere further up on the page they said:

Galway, Ireland at the Aula Maxima located in the National University of Ireland Galway Quadrangle from October 8-12, 2012.

I don’t know. With a name like Aula Maxima I was expecting something a bit more impressive. Still, it’s Ireland and so a lot to be said for the location, impressive buildings or no.

From the call for papers:

The 18th International Conference on Knowledge Engineering and Knowledge Management is concerned with all aspects of eliciting, acquiring, modeling and managing knowledge, and its role in the construction of knowledge-intensive systems and services for the semantic web, knowledge management, e-business, natural language processing, intelligent information integration, etc.

The special focus of the 18th edition of EKAW will be on “Knowledge Engineering and Knowledge Management that matters”. We are explicitly calling for papers that have a potentially high impact on a specific community or application domain (e.g. pharmacy and life sciences), as well as for papers which report on the development or evaluation of publicly available data sets relevant for a large number of applications. Moreover, we welcome contributions dealing with problems specific to modeling and maintenance of real-world data or knowledge, such as scalability and robustness of knowledge-based applications, or privacy and provenance issues related to organizational knowledge management.

In addition to the main research track, EKAW 2012 will feature a tutorial and workshop program, as well as a poster and demo track. Moreover, there will be a Doctoral Consortium giving new PhD students a possibility to present their research proposals, and to get feedback on methodological and practical aspects of their planned dissertation.

The proceedings of the conference will be published by Springer Verlag in the LNCS series. The LNCS volume will contain the contributed research papers as well as descriptions of the demos presented at the conference. Papers published at any of the workshops will be published in dedicated workshop proceedings.

EKAW 2012 welcomes papers dealing with theoretical, methodological, experimental, and application-oriented aspects of knowledge engineering and knowledge management. In particular, but not exclusively, we solicit papers about methods, tools and methodologies relevant with regard to the following topics:

Ad Hoc Normalization

Tuesday, November 29th, 2011

I really should not start reading Date over the weekend. It puts me in a relational frame of mind and I start thinking of explanations of topic maps in terms of the relational model.

For example, take his definition of:

First normal form: A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute. (page 358)

Second normal form: (definition assuming only one candidate key, which we assume is the primary key): a relvar is in 2NF if and only if it is in 1NF and every nonkey attribute is irreducibly dependent on the primary key. (page 361)

Third normal form: (definition assuming only one candidate key, which we assume is the primary key): A relvar is in 3NF if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key. (page 363)

Third normal form (even more informal definition): A relvar is in third normal form (3NF) if and only if, for all time, each tuple consists of a primary key value that identifies some entity, together with a set of zero or more mutually independent values that describe that entity in some way.

Does that mean that topic maps support ad hoc normalization? That is we don’t have to design in normalization before we start writing the topic map but can decide on what subjects need to be “normalized,” that is represented by topics that read to a single representative, after we have started writing the topic map.

Try that with a relational database and tables of any complexity. If you don’t get it right at the design stage, fixing it becomes more expensive as time goes by.

Not a “dig” at relational databases. If your domain is that slow changing and other criteria point to a relational solution, by all means, use one. Performance numbers are hard to beat.

On the other hand, if you need “normalization” an yet you have a rapidly changing environment that is subject to exploration and mappings across domains, you should give topic maps a hard look. Ask for “Ad Hoc Normalization” by name. 😉

PS: I suspect this is what Lars Marius meant by Topic Maps Data Model (TMDM) 6. Merging, 6.1 General:

A central operation in Topic Maps is that of merging, a process applied to a topic map in order to eliminate redundant topic map constructs in that topic map. This clause specifies in which situations merging shall occur, but the rules given here are insufficient to ensure that all redundant information is removed from a topic map.

Any change to a topic map that causes any set to contain two information items equal to each other shall be followed by the merging of those two information items according to the rules given below for the type of information item to which the two equal information items belong.

But I wasn’t “hearing” “…eliminate redundant topic maps constructs…” as “normalization.”

Similarity as Association?

Tuesday, November 29th, 2011

I was listening to Ian Robinson’s recent presentation on Dr. Who and Neo4j when Ian remarked that similarity could be modeled as a relationship.

It seemed like an off-hand remark at the time but it struck me as having immediate relevance to using Neo4j with topic maps.

One of my concerns for using Neo4j with topic maps has been the TMDM specification of merging topic items as:

1. Create a new topic item C.
2. Replace A by C wherever it appears in one of the following properties of an information item: [topics], [scope], [type], [player], and [reifier].
3. Repeat for B.
4. Set C’s [topic names] property to the union of the values of A and B’s [topic names] properties.
5. Set C’s [occurrences] property to the union of the values of A and B’s [occurrences] properties.
6. Set C’s [subject identifiers] property to the union of the values of A and B’s [subject identifiers] properties.
7. Set C’s [subject locators] property to the union of the values of A and B’s [subject locators] properties.
8. Set C’s [item identifiers] property to the union of the values of A and B’s [item identifiers] properties.
(TMDM, 6.2 Merging Topic Items)

Obviously the TMDM is specifying an end result and not how you get there but still, there has to be a mechanism by which a query that finds A or B, also results in the “union of the values of A and B’s [topic names] properties.” (And the other operations specified by the TMDM here and elsewhere.)

Ian’s reference to similarity being modeled as a relationship made me realize that similarity relationships could be created between nodes that share the same [subject identifiers} property value (and other conditions for merging). Thus, when querying a topic map, there should be the main query, followed by a query for “sameness” relationships for any returned objects.

This avoids the performance “hit” of having to update pointers to information items that are literally re-written with new identifiers. Not to mention that processing the information items that will be presented to the user as one object could even be off-loaded onto the client, with a further savings in server side processing.

There is an added bonus to this approach, particularly for complex merging conditions beyond the TMDM. Since the directed edges have properties, it would be possible to dynamically specify merging conditions beyond those of the TMDM based on those properties. Which means that “merging” operations could be “unrolled” as it were.

Or would that be “rolled down?” Thinking that a user could step through each addition of a “merging” condition and observe the values as they were added, along with their source. Perhaps even set “break points” as in debugging software.

Will have to think about this some more and work up some examples in Neo4j. Comments/suggestions most welcome!

PS: You know, if this works, Neo4j already has a query language, Cypher. I don’t know if Cypher supports declaration of routines that can be invoked as parts of queries but investigate that possibility. Just to keep users from having to write really complex queries to gather up all the information items on a subject. That won’t help people using other key/value stores but there are some interesting possibilities there as well. Will depend on the use cases and nature of “merging” requirements.

Deep Learning

Tuesday, November 29th, 2011

Deep Learning… moving beyond shallow machine learning since 2006!

From the webpage:

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.

This website is intended to host a variety of resources and pointers to information about Deep Learning. In these pages you will find

  • a reading list
  • links to software
  • datasets
  • a discussion forum
  • as well as tutorials and cool demos

I encountered this site via its Deep Learning Tutorial which is only one of the tutorial type resources available Tutorials.

I mention that because the Deep Learning Tutorial looks like it would be of interest to anyone doing data or entity mining.

A Common GPU n-Dimensional Array for Python and C

Tuesday, November 29th, 2011

A Common GPU n-Dimensional Array for Python and C by Frédéric Bastien, Arnaud Bergeron, Pascal Vincent and Yoshua Bengio

From the webpage:

Currently there are multiple incompatible array/matrix/n-dimensional base object implementations for GPUs. This hinders the sharing of GPU code and causes duplicate development work. This paper proposes and presents a first version of a common GPU n-dimensional array(tensor) named GpuNdArray that works with both CUDA and OpenCL. It will be usable from python, C and possibly other languages.

Apologies, all I can give you today is a pointer to the accepted papers for Big Learning, Day 1, first paper, which promises a PDF soon.

I didn’t check the PDF link yesterday when I saw it. My bad.

Anyway, there are a lot of other interesting papers at this site and I will update this entry when this paper appears. The conference is December 16-17, 2011 so it may not be too long of a wait.

R 2.14.0 (Great Pumpkin) Released!

Tuesday, November 29th, 2011

R 2.14.0 Released! from One R Tip A Day.

A new version of R is out and this post points to tips to make your upgrade easier.

In case you are wondering what is included in the new release, see NEWS. (At the CRAN respository at r-project.) It’s too large to reproduce here.

Financial Data Analysis and Modeling with R (AMATH 542)

Tuesday, November 29th, 2011

Financial Data Analysis and Modeling with R (AMATH 542)

From the webpage:

This course is an in-depth hands-on introduction to the R statistical programming language ( for computational finance. The course will focus on R code and code writing, R packages, and R software development for statistical analysis of financial data including topics on factor models, time series analysis, and portfolio analytics.

Topics include:

  • The R Language. Syntax, data types, resources, packages and history
  • Graphics in R. Plotting and visualization
  • Statistical analysis of returns. Fat-tailed skewed distributions, outliers, serial correlation
  • Financial time series modeling. Covariance matrices, AR, VecAR
  • Factor models. Linear regression, LS and robust fits, test statistics, model selection
  • Multidimensional models. Principal components, clustering, classification
  • Optimization methods. QP, LP, general nonlinear
  • Portfolio optimization. Mean-variance optimization, out-of-sample back testing
  • Bootstrap methods. Non-parametric, parametric, confidence intervals, tests
  • Portfolio analytics. Performance and risk measures, style analysis

A quick summary:

Status: Open
Start Date: 1/4/2012
End Date: 3/19/2012
Credits: 4 Credits
Learning Format: Online
Location: Web
Cost: $3,300

Particularly if your employer is paying for it, this might be a good way to pick up some R skills for financial data work. And R will be useful if you want to mine financial data for topic map purposes. Although, transparency and finance aren’t two concepts that occur together very often. In my experience, setting disclosure requirements means people can walk as close to the disclosure line as they dare.

In other words, disclosure requirements function as disclosure limits, with the really interesting stuff just on the other side of the line.

“Yoda Conditions”, “Pokémon Exception Handling” and other programming classics

Tuesday, November 29th, 2011

“Yoda Conditions”, “Pokémon Exception Handling” and other programming classics

Topic maps haven’t matured enough to have main stream idioms to pass into its usage.

But, when I saw:

Shrug Report – a bug report with no error message or repro steps and only a vague description of the problem. Usually contains the phrase “doesn’t work.”

I knew Lars Marius would appreciate at least some of these and that’s close enough to relevance. 😉

Interesting papers coming up at NIPS’11

Monday, November 28th, 2011

Interesting papers coming up at NIPS’11

Yaroslav Bulatov has tracked down papers that have been accepted for NIPS’11. Not abstracts or summaries but the actual papers.

Well worth a visit to take advantage of his efforts.

While looking at the NIPS’11 site (will post that tomorrow) I ran across a paper on a proposal for a “…array/matrix/n-dimensional base object implementations for GPUs.” Will post that tomorrow as well.

New Insights from Text Analytics

Monday, November 28th, 2011

New Insights from Text Analytics by Themos Kalafatis.

From the post:

“I have been trying repeatedly to solve my billing problem through customer care. I first talked with someone called Mrs Jane Doe. She said she should transfer my call to another representative from the sales department. Yet another rep from the sales department informed me that i should be talking with the Billing department instead. Unfortunately my bad experience of being transferred through various representatives was not over because the Billing department informed me that i should speak to the……”

Currently Text Analytics software will identify key elements of the above text but a very important piece of information goes unnoticed. It is the sequence of events which takes place :

(Jane Doe => Sales Dept =>Billing Dept =>…)

Is your software capturing sequences?

If not, how would you go about doing it?

And once captured, how do you represent it in a topic map?

PS: I would have isolated more segments in the sequence. How about you?

Oyster Software On Sourceforge!

Monday, November 28th, 2011

Some of you may recall my comments on Oyster: A Configurable ER Engine, a configurable entity resolution engine.

Software wasn’t available when that post was written but it is now, along with work on a GUI for the software.

Oyster Entity Resolution (SourceForge).

BTW, the “complete” download does not include the GUI.

It is important to also download the GUI for two reasons:

1) It is the only documentation for the project, and

2) The GUI generates the XML files needed to use the Oyster software.

There is no documentation of the XML format (I asked). As in a schema, etc.

Contributing a schema to the project would be a nice thing to do.


Monday, November 28th, 2011


I ran across this review, Vectorwise – Worth a Look? by Steve Miller and was wondering if anyone else had seen it or reviewed Vectorwise?

His non-complete evaluation resulted in:

The results I’ve gathered so far for my admittedly non-strenuous tests are nonetheless encouraging. My first experiment was loading a 600,000 row fact table with nine small star lookups from csv files. The big table load takes two seconds and all the queries I’ve attempted, even ones with a six table join, complete in a second or two.

The second test involved loading a 10 million+ row, four attribute, csv stock performance data set along with a 3000 record lookup. The big table imports in 7 seconds. Group-by queries that join to the lookup complete in under three seconds.

There is a 30-day free trial version for Windows.

BTW, can anyone forward me the technical white paper on Vectorwise? The last white paper I signed up for was a marketing document and not even a very good one of those.

3D Ecosystem Globe Grows on #cop17 Tweets

Monday, November 28th, 2011

CNN Ecosphere: 3D Ecosystem Globe Grows on #cop17 Tweets from information aesthetics

From the post:

The goal of CNN’s Ecosphere [] by Minivegas and Stinkdigital is a real-time Twitter visualization that aims to reveal how the online discussion is evolving around the topic of climate change. More specifically, the visualization aggregates all Twitter messages on the topic of #cop17 (in case you wonder, this is an abbreviation for “The 17th Conference of the Parties (COP17) to the United Nations Framework Convention on Climate Change (UNFCCC)”.

The online visualization consists of an interactive 3D globe, described as a “lush digital ecosystem” that closely resembles the look and behavior of real plants and trees in nature. In practice, the virtual plants in the 3D Ecosphere grow from those tweets that are tagged with #COP17. Each tweet about climate change feeds into a plant representing that specific topic or discussion, causing it to grow a little more.

Don’t know that the DoD (US) would be interested in that level of transparency but one can imagine such an info-graphic that tracks inventory by service, item, etc. and shows the last point inventory was accounted for. Would give the GAO a place to start looking for some of it.

What is a Dashboard?

Monday, November 28th, 2011

What is a Dashboard? – Defining dashboards, visual analysis tools and other data presentation media by Alexander ‘Sandy’ Chiang.

From the post:

To reiterate, there are typically four types of presentation media: dashboards, visual analysis tools, scorecards, and reports. These are all visual representations of data that help people identify correlations, trends, outliers (anomalies), patterns, and business conditions. However, they all have their own unique attributes.

What do you think? Are there four for business purposes or do other domains offer more choices? If so, how would you distinguish them from those defined here?

Just curious. I can imagine flogging one of these to a business client who was choosing based on experience with these four choices. Hard to choose what you have not seen. But beyond that, say in government circles, do these hold true?

3 surprising facts about the computation of scalar products

Monday, November 28th, 2011

3 surprising facts about the computation of scalar products by Daniel Lemire.

From the post:

The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. Given two arrays such as (1,2) and (5,3), the scalar product is the sum of products 1 × 5 + 2 × 3. We have strong incentives to compute the scalar product as quickly as possible.

Sorry, can’t tell you the three things because that would ruin the surprise. 😉 See Daniel’s blog for the details.

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Monday, November 28th, 2011

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Ryan Rosario.

From the post:

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

  • article content and template pages
  • article content with revision history (huge files)
  • article content including user pages and talk pages
  • redirect graph
  • page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
  • image metadata
  • site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

All of that is available but also lacking any consistent usage of syntax. Ryan stumbles upon Wikipedia Extractor, which has pluses and minuses, an example of that latter being really slow. Things look up for Ryan when he is reminded about Cloud9, which is designed for a MapReduce environment.

Read the post to see how things turned out for Ryan using Cloud9.

Depending on your needs, Wikipedia URLs are a start on subject identifiers, although you will probably need to create some for your particular domain.

Surrogate Learning

Monday, November 28th, 2011

Surrogate Learning – From Feature Independence to Semi-Supervised Classification by Sriharsha Veeramachaneni and Ravi Kumar Kondadadi.


We consider the task of learning a classifier from the feature space X to the set of classes $Y = {0, 1}$, when the features can be partitioned into class-conditionally independent feature sets $X1$ and $X2$. We show that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $X2$ to $X1$ (in the sense of estimating the probability $P(x1|x2))$ and 2) learning the class-conditional distribution of the feature set $X1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.

The two “real world” applications are ones you are likely to encounter:


Our problem consisted of merging each of ≈ 20000 physician records, which we call the update database, to the record of the same physician in a master database of ≈ 106 records.

Our old friends record linkage and entity resolution. The solution depends upon a clever choice of features for application of the technique. (The thought occurs to me that a repository of data analysis snippets for particular techniques would be as valuable, if not more so, than the techniques themselves. Techniques come and go. Data analysis and the skills it requires goes on and on.)


Sentence classification is often a preprocessing step for event or relation extraction from text. One of the challenges posed by sentence classification is the diversity in the language for expressing the same event or relationship. We present a surrogate learning approach to generating paraphrases for expressing the merger-acquisition (MA) event between two organizations in financial news. Our goal is to find paraphrase sentences for the MA event from an unlabeled corpus of news articles, that might eventually be used to train a sentence classifier that discriminates between MA and non-MA sentences. (Emphasis added. This is one of the issues in the legal track at TREC.)

This test was against 700000 financial news records.

Both tests were quite successful.

Surrogate learning looks interesting for a range of NLP applications.

Template-Based Information Extraction without the Templates

Monday, November 28th, 2011

Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky.


Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

Can you say association?

Definitely points towards a pipeline approach to topic map authoring. To abuse the term, perhaps a “dashboard” that allows selection of data sources followed by the construction of workflows with preliminary analysis being displayed at “breakpoints” in the processing. No particular reason why stages have to be wired together other than tradition.

Just looking a little bit into the future, imagine that some entities weren’t being recognized at a high enough rate. So you shift that part of the data to several thousand human entity processors and take the average of their results, higher than what you were getting and feed that back into the system. Could have knowledge workers who work full time but shift from job to job performing tasks too difficult to program effectively.

6th International Symposium on Intelligent Distributed Computing – IDC 2012

Sunday, November 27th, 2011

6th International Symposium on Intelligent Distributed Computing – IDC 2012

Important Dates:

Full paper submission: April 10, 2012
Notification of acceptance: May 10, 2012
Final (camera ready) paper due: June 1, 2012
Symposium: September 24-26, 2012

From the call for papers:

Intelligent computing covers a hybrid palette of methods and techniques derived from classical artificial intelligence, computational intelligence, multi-agent systems a.o. Distributed computing studies systems that contain loosely-coupled components running on different networked computers and that communicate and coordinate their actions by message transfer. The emergent field of intelligent distributed computing is expected to pose special challenges of adaptation and fruitful combination of results of both areas with a great impact on the development of new generation intelligent distributed information systems. The aim of this symposium is to bring together researchers involved in intelligent distributed computing to allow cross-fertilization and synergy of ideas and to enable advancement of researches in the field.

The symposium welcomes submissions of original papers concerning all aspects of intelligent distributed computing ranging from concepts and theoretical developments to advanced technologies and innovative applications. Papers acceptance and publication will be judged based on their relevance to the symposium theme, clarity of presentation, originality and accuracy of results and proposed solutions.