Archive for the ‘Probalistic Models’ Category

Inferring Social Rank in…

Wednesday, March 13th, 2013

Inferring Social Rank in an Old Assyrian Trade Network by David Bamman, Adam Anderson, Noah A. Smith.

Abstract:

We present work in jointly inferring the unique individuals as well as their social rank within a collection of letters from an Old Assyrian trade colony in K\”ultepe, Turkey, settled by merchants from the ancient city of Assur for approximately 200 years between 1950-1750 BCE, the height of the Middle Bronze Age. Using a probabilistic latent-variable model, we leverage pairwise social differences between names in cuneiform tablets to infer a single underlying social order that best explains the data we observe. Evaluating our output with published judgments by domain experts suggests that our method may be used for building informed hypotheses that are driven by data, and that may offer promising avenues for directed research by Assyriologists.

An example of how digitization of ancient texts enables research other than text searching.

Inferring identity and social rank may be instructive for creation of topic maps from both ancient and modern data sources.

I first saw this in a tweet by Stefano Bertolo.

Dr. Watson?

Wednesday, December 7th, 2011

I got up thinking that there needs to be a project for automated authoring of a topic map and the name, Dr. Watson suddenly occurred to me. After all, Dr. Watson was Sherlock Holmes’ sidekick so it would not be like saying it could stand on its own. Plus there would be some name recognition and/or confusion with the real Dr. Watson, or rather imaginary Dr. Watson of Sherlock Holmes’ fame.

And there would be confusion with the Dr. Watson that is the internal debugger for Windows (MS, I never can remember if the ™ goes on Windows or MS. Not that anyone else would want to call themselves MS. ;-) ) Plus the Watson research center at IBM.

Well, I suspect being an automated, probabilistic topic map authoring system will be enough to distinguish it from the foregoing uses.

Any others that I should be aware of?

I say probabilistic because even with the TMDM’s string matching on URIs, it is only probable that two or more topics actually represent the same topic. It is always possible that a URI has been incorrectly used to identity the subject that a topic represents. And in such cases, the error perpetuates itself across a topic map.

So we start off with the realization that even string matching results in a probability of less than 1.0 (where 1.0 is absolute certainty) that two or more topics represent the same subject.

Since we are already being probabilistic, why not be openly so?

But, before we get into the weeds and details, the project has to have a cool name. (As in not an acronym that is cool and we make up a long name to fit the acronym.)

All those in favor of Dr. Watson, please signify by raising your hands (or the beer you are holding).

More to follow.

Apache Flume incubation wiki

Sunday, October 2nd, 2011

Apache Flume incubation wiki

From the website:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.

A number of resources for Flume.

Will “data flows” as the dominant means of accessing data be a consequence of an environment where a “local copy” of data is no longer meaningful or an enabler of such an environment? Or both?

I think topic maps would do well to develop models for streaming and perhaps probabilistic merging or even probabilistic creation of topics/associations from data streams. Static structures only give the appearance of certainty.

SIGKDD 2011 Conference

Tuesday, September 6th, 2011

A pair of posts from Ryan Rosario on the SIGKDD 2011 Conference.

Day 1 (Graph Mining and David Blei/Topic Models)

Tough sledding on Probabilistic Topic Models but definitely worth the effort to follow.

Days 2/3/4 Summary

Useful summaries and pointers to many additional resources.

If you attended SIGKDD 2011, do you have pointers to other reviews of the conference or other resources?

I added a category for SIGKDD.

Probability Primer – Mathematicalmonk

Monday, August 8th, 2011

Probability Primer – Mathematicalmonk

From the description:

A series of videos giving an introduction to the definitions, notation, and basic concepts one would encounter in a 1st year graduate probability course.

More math videos from the Mathematicalmonk.

I don’t think a presentation style can be “copied” but surely there are lessons we can take from it to apply in other venues.

Suggestions?

Machine Learning and Probabilistic Graphical Models

Thursday, May 12th, 2011

Machine Learning and Probabilistic Graphical Models

From the website:

Instructor: Sargur Srihari Department of Computer Science and Engineering, University at Buffalo

Machine learning is an exciting topic about designing machines that can learn from examples. The course covers the necessary theory, principles and algorithms for machine learning. The methods are based on statistics and probability– which have now become essential to designing systems exhibiting artificial intelligence. The course emphasisizes Bayesian techniques and probabilistic graphical models (PGMs). The material is complementary to a course on Data Mining where statistical concepts are used to analyze data for human, rather than machine, use.

The textbooks for different parts of the course are “Pattern Recognition and Machine Learning” by Chris Bishop (Springer 2006) and “Probabilistic Graphical Models” by Daphne Koller and Nir Friedman (MIT Press 2009).

Lecture slides and some videos of lectures.

Probabilistic Models in the Study of Language

Monday, April 11th, 2011

Probabilistic Models in the Study of Language

From the website:

I’m in the process of writing a textbook on the topic of using probabilistic models in scientific work on language ranging from experimental data analysis to corpus work to cognitive modeling. A current (partial) draft is available here. The intended audience is graduate students in linguistics, psychology, cognitive science, and computer science who are interested in using probabilistic models to study language. Feedback (both comments on existing drafts, and expressed desires for additional material to include!) is more than welcome — send it to rlevy@ucsd.edu.

Just scanning the chapters titles this looks like a useful work for anyone concerned with language issues.

Comparative Study of Probabilistic Logic Languages and Systems

Thursday, April 7th, 2011

Comparative Study of Probabilistic Logic Languages and Systems

This was mentioned in a response to Chris Diehl’s post in his series.

Good source of software/information.

(I checked, all the links work. That is something these days.)

Toward Topic Search on the Web – Paper

Tuesday, March 8th, 2011

Toward Topic Search on the Web was reported by Infodocket.com.

Report from Microsoft researchers on “…framework that improves web search experiences through the use of a probabilistic knowledge base.”

Interesting report.

Even more so if you think about topic maps as declarative knowledge bases and consider the use of probabilistic knowledge bases as a means to add to the former.

BTW, user satisfaction was used as the criteria for success.

Now that is local semantics.

Probably at just about the right level as well.

Comments?

Probability and Computing

Friday, January 7th, 2011

Probability and Computing

Lecture notes by Ryan O’Donnell for his Fall 2009 course at Carnegie Mellon University.

Probability is an important topic (sorry) in a number of CS areas.

Probabilistic merging is why I mention it here but understanding it more broadly will be useful in other CS areas as well.

A weighted similarity measure for non-conjoint rankings – Post

Monday, December 27th, 2010

A weighted similarity measure for non-conjoint rankings updates the recently published A Similarity Measure for Indefinite Rankings
and provides an implementation.

Abstract from the article:

Ranked lists are encountered in research and daily life, and it is often of interest to compare these lists, even when they are incomplete or have only some members in common. An example is document rankings returned for the same query by different search engines. A measure of the similarity between incomplete rankings should handle non-conjointness, weight high ranks more heavily than low, and be monotonic with increasing depth of evaluation; but no measure satisfying all these criteria currently exists. In this article, we propose a new measure having these qualities, namely rank-biased overlap (RBO). The RBO measure is based on a simple probabilistic user model. It provides monotonicity by calculating, at a given depth of evaluation, a base score that is non-decreasing with additional evaluation, and a maximum score that is non-increasing. An extrapolated score can be calculated between these bounds if a point estimate is required. RBO has a parameter which determines the strength of the weighting to top ranks. We extend RBO to handle tied ranks and rankings of different lengths. Finally, we give examples of the use of the measure in comparing the results produced by public search engines, and in assessing retrieval systems in the laboratory.

There are also some comments about journal publishing delays that should serve as fair warning to other authors.