Archive for the ‘Prediction’ Category

Eight Years of the Republican Weekly Address

Wednesday, January 4th, 2017

We looked at eight years of the Republican Weekly Address by Jesse Rifkin.

From the post:

Every week since Ronald Reagan started the tradition in 1982, the president delivers a weekly address. And every week, the opposition party delivers an address as well.

What can the Weekly Republican Addresses during the Obama era reveal about how the GOP has attempted to portray themselves to the American public, by the public policy topics they discussed and the speakers they featured? To find out, GovTrack Insider analyzed all 407 Weekly Republican Addresses for which we could find data during the Obama era, the first such analysis of the weekly addresses as best we can tell. (See the full list of weekly addresses here.)

Sometimes they discuss the same topic as the president’s weekly address — particularly common if a noteworthy event occurs in the news that week — although other times it’s on an unrelated topic of the party’s choosing. It also features a rotating cast of Republicans delivering the speech, most of them congressional, unlike the White House which has almost always featured President Obama, with Vice President Joe Biden occasionally subbing in.

On the issues, we found that Republicans have almost entirely refrained from discussing such inflammatory social issues as abortion, guns, or same-sex marriage in their weekly addresses, despite how animating such issues are to their base. They also were remarkably silent on Donald Trump until the week before the election.

We also find that while Republicans often get slammed on women’s rights and minority issues, Republican congressional women and African Americans are at least proportionally represented in the weekly addresses, compared to their proportions in Congress, if not slightly over-represented — but Hispanics are notably under-represented.

You have seen credible claims of On Predicting Social Unrest Using Social Media by Rostyslav Korolov, et al., and less credible claims from others, CIA claims it can predict some social unrest up to 5 days ahead.

Rumor has it that the CIA has a Word template named, appropriately enough: theRussiansDidIt. I can neither confirm nor deny that rumor.

Taking credible actors at their word, are you aware of any parallel research on weekly addresses by Congress and following congressional action?

A very lite skimming of the literature on predicting Supreme Court decisions comes up with: Competing Approaches to Predicting Supreme Court Decision Making by Andrew D. Martin, Kevin M. Quinn, Theodore W. Ruger, and Pauline T. Kim (2004), Algorithm predicts US Supreme Court decisions 70% of time by David Kravets (2014), Fantasy Scotus (a Supreme Court fantasy league with cash prizes).

Congressional voting has been studied as well, for instance, Predicting Congressional Voting – Social Identification Trumps Party. (Now there’s an unfortunate headline for searchers.)

Congressional votes are important but so is the progress of bills, the order in which issues are addressed, etc., and it the reflection of those less formal aspects in weekly addresses from congress that could be interesting.

The weekly speeches may be as divorced from any shared reality as comments inserted in the Congressional Record. On the other hand, a partially successful model, other than the timing of donations, may be possible.

Outbrain Challenges the Research Community with Massive Data Set

Sunday, November 13th, 2016

Outbrain Challenges the Research Community with Massive Data Set by Roy Sasson.

From the post:

Today, we are excited to announce the release of our anonymized dataset that discloses the browsing behavior of hundreds of millions of users who engage with our content recommendations. This data, which was released on the Kaggle platform, includes two billion page views across 560 sites, document metadata (such as content categories and topics), served recommendations, and clicks.

Our “Outbrain Challenge” is a call out to the research community to analyze our data and model user reading patterns, in order to predict individuals’ future content choices. We will reward the three best models with cash prizes totaling $25,000 (see full contest details below).

The sheer size of the data we’ve released is unprecedented on Kaggle, the competition’s platform, and is considered extraordinary for such competitions in general. Crunching all of the data may be challenging to some participants—though Outbrain does it on a daily basis.

The rules caution:

The data is anonymized. Please remember that participants are prohibited from de-anonymizing or reverse engineering data or combining the data with other publicly available information.

That would be a more interesting question than the ones presented for the contest.

After the 2016 U.S. presidential election we know that racists, sexists, nationalists, etc., are driven by single factors so assuming you have good tagging, what’s the problem?


Or is human behavior is not only complex but variable?

Good luck!

Election Prediction and STEM [Concealment of Bias]

Wednesday, September 28th, 2016

Election Prediction and STEM by Sheldon H. Jacobson.

From the post:

Every U.S. presidential election attracts the world’s attention, and this year’s election will be no exception. The decision between the two major party candidates, Hillary Clinton and Donald Trump, is challenging for a number of voters; this choice is resulting in third-party candidates like Gary Johnson and Jill Stein collectively drawing double-digit support in some polls. Given the plethora of news stories about both Clinton and Trump, November 8 cannot come soon enough for many.

In the Age of Analytics, numerous websites exist to interpret and analyze the stream of data that floods the airwaves and newswires. Seemingly contradictory data challenges even the most seasoned analysts and pundits. Many of these websites also employ political spin and engender subtle or not-so-subtle political biases that, in some cases, color the interpretation of data to the left or right.

Undergraduate computer science students at the University of Illinois at Urbana-Champaign manage Election Analytics, a nonpartisan, easy-to-use website for anyone seeking an unbiased interpretation of polling data. Launched in 2008, the site fills voids in the national election forecasting landscape.

Election Analytics lets people see the current state of the election, free of any partisan biases or political innuendos. The methodologies used by Election Analytics include Bayesian statistics, which estimate the posterior distributions of the true proportion of voters that will vote for each candidate in each state, given both the available polling data and the states’ previous election results. Each poll is weighted based on its age and its size, providing a highly dynamic forecasting mechanism as Election Day approaches. Because winning a state translates into winning all the Electoral College votes for that state (with Nebraska and Maine using Congressional districts to allocate their Electoral College votes), winning by one vote or 100,000 votes results in the same outcome in the Electoral College race. Dynamic programming then uses the posterior probabilities to compile a probability mass function for the Electoral College votes. By design, Election Analytics cuts through the media chatter and focuses purely on data.

If you have ever taken a social science methodologies course then you know:

Election Analytics lets people see the current state of the election, free of any partisan biases or political innuendos.

is as false as anything uttered by any of the candidates seeking nomination and/or the office of the U.S. presidency since January 1, 2016.

It’s an annoying conceit when you realize that every poll is biased, however clean the subsequent number crunching of the numbers may be.

Bias one step removed isn’t the absence of bias, but the concealment of bias.

Bias? What Bias? We’re Scientific!

Monday, May 23rd, 2016

This ProPublica story by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, isn’t short but it is worth your time to not only read, but to download the data and test their analysis for yourself.

Especially if you have the mis-impression that algorithms can avoid bias. Or that clients will apply your analysis with the caution that it deserves.

Finding a bias in software, like finding a bug, is a good thing. But that’s just one, there is no estimate of how many others may exist.

And as you will find, clients may not remember your careful explanation of the limits to your work. Or apply it in ways you don’t anticipate.

Machine Bias – There’s software used across the country to predict future criminals. And it’s biased against blacks.

Here’s the first story to try to lure you deeper into this study:

ON A SPRING AFTERNOON IN 2014, Brisha Borden was running late to pick up her god-sister from school when she spotted an unlocked kid’s blue Huffy bicycle and a silver Razor scooter. Borden and a friend grabbed the bike and scooter and tried to ride them down the street in the Fort Lauderdale suburb of Coral Springs.

Just as the 18-year-old girls were realizing they were too big for the tiny conveyances — which belonged to a 6-year-old boy — a woman came running after them saying, “That’s my kid’s stuff.” Borden and her friend immediately dropped the bike and scooter and walked away.

But it was too late — a neighbor who witnessed the heist had already called the police. Borden and her friend were arrested and charged with burglary and petty theft for the items, which were valued at a total of $80.

Compare their crime with a similar one: The previous summer, 41-year-old Vernon Prater was picked up for shoplifting $86.35 worth of tools from a nearby Home Depot store.

Prater was the more seasoned criminal. He had already been convicted of armed robbery and attempted armed robbery, for which he served five years in prison, in addition to another armed robbery charge. Borden had a record, too, but it was for misdemeanors committed when she was a juvenile.

Yet something odd happened when Borden and Prater were booked into jail: A computer program spat out a score predicting the likelihood of each committing a future crime. Borden — who is black — was rated a high risk. Prater — who is white — was rated a low risk.

Two years later, we know the computer algorithm got it exactly backward. Borden has not been charged with any new crimes. Prater is serving an eight-year prison term for subsequently breaking into a warehouse and stealing thousands of dollars’ worth of electronics.

This analysis demonstrates that malice isn’t required for bias to damage lives. Whether the biases are in software, in its application, in the interpretation of its results, the end result is the same, damaged lives.

I don’t think bias in software is avoidable but here, here no one was even looking.

What role do you think budget justification/profit making played in that blindness to bias?

Big Data To Identify Rogue Employees (Who To Throw Under The Bus)

Thursday, April 9th, 2015

Big Data Algorithm Identifies Rogue Employees by Hugh Son.

From the post:

Wall Street traders are already threatened by computers that can do their jobs faster and cheaper. Now the humans of finance have something else to worry about: Algorithms that make sure they behave.

JPMorgan Chase & Co., which has racked up more than $36 billion in legal bills since the financial crisis, is rolling out a program to identify rogue employees before they go astray, according to Sally Dewar, head of regulatory affairs for Europe, who’s overseeing the effort. Dozens of inputs, including whether workers skip compliance classes, violate personal trading rules or breach market-risk limits, will be fed into the software.

“It’s very difficult for a business head to take what could be hundreds of data points and start to draw any themes about a particular desk or trader,” Dewar, 46, said last month in an interview. “The idea is to refine those data points to help predict patterns of behavior.”

Sounds worthwhile until you realize that $36 billion in legal bills “since the financial crisis” covers a period of seven (7) years, which works out to be about $5 billion per year. Considering that net revenue for 2014 was $21.8 billion, after deducting legal bills, they aren’t doing too badly. 2014 Annual Report

Hugh raises the specter of The Minority Report in terms of predicting future human behavior. True enough but much more likely to discover cues that resulted in prior regulatory notice with cautions to employees to avoid those “tells.” If the trainer reviews three (3) real JPMorgan Chase cases and all of them involve note taking and cell phone records (later traced), how bright do you have to be to get clued in?

People who don’t get clued in will either be thrown under the bus during the next legal crisis or won’t be employed at JPMorgan Chase.

If this were really a question of predicting human behavior the usual concerns about fairness, etc. would obtain. I suspect it is simply churn so that JPMorgan Chase appears to be taking corrective action. Some low level players will be outed, like the Walter Mitty terrorists the FBI keeps capturing in its web of informants. (I am mining some data now to collect those cases for a future post.)

It will be interesting to see if Jamie Dimon electronic trail is included as part of the big data monitoring of employees. Bets anyone?

Federal Data Integration: Dengue Fever

Tuesday, April 7th, 2015

The White House issued a press release today (April 7, 2015) titled: FACT SHEET: Administration Announces Actions To Protect Communities From The Impacts Of Climate Change.

That press release reads in part:

Unleashing Data: As part of the Administration’s Predict the Next Pandemic Initiative, in May 2015, an interagency working group co-chaired by OSTP, the CDC, and the Department of Defense will launch a pilot project to simulate efforts to forecast epidemics of dengue – a mosquito-transmitted viral disease affecting millions of people every year, including U.S. travelers and residents of the tropical regions of the U.S. such as Puerto Rico. The pilot project will consolidate data sets from across the federal government and academia on the environment, disease incidence, and weather, and challenge the research and modeling community to develop predictive models for dengue and other infectious diseases based on those datasets. In August 2015, OSTP plans to convene a meeting to evaluate resulting models and showcase this effort as a “proof-of-concept” for similar forecasting efforts for other infectious diseases.

I tried finding more details on earlier workshops in this effort but limiting the search to “Predict the Next Pandemic Initiative” and the domain to “.gov,” I got two “hits.” One of which was the press release I cite above.

I sent a message (webform) to the White House Office of Science and Technology Policy office and will update you with any additional information that arrives.

Of course my curiosity is about the means used to integrate the data sets. Once integrated, such data sets can be re-used, at least until it is time to integrate additional data sets. Bearing in mind that dirty data can lead to poor decision making, I would rather not duplicate the cleaning of data time after time.

Max Kuhn’s Talk on Predictive Modeling

Monday, March 16th, 2015

Max Kuhn’s Talk on Predictive Modeling

From the post:

Max Kuhn, Director of Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling joined us on February 17, 2015 and shared his experience with Data Mining with R.

Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at

Excellent! (You may need to adjust the sound on the video.)

Support your local user group, particularly those generous enough to post videos and slides for their speakers. It makes a real difference to those unable to travel for one reason or another.

I first saw this in a tweet by NYC Data Science.

Want to win $1,000,000,000 (yes, that’s one billion dollars)?

Wednesday, January 22nd, 2014

Want to win $1,000,000,000 (yes, that’s one billion dollars)? by Ann Drobnis.

The offer is one billion dollars for picking the winners of every game in the NCAA men’s basketball tournament in the Spring of 2014.

Unfortunately, none of the news stories I saw had links back to any authentic information from Quicken Loans and Berkshire Hathaway about the offer.

After some searching I found: Win a Billion Bucks with the Quicken Loans Billion Dollar Bracket Challenge by Clayton Closson, on January 21, 2014 on the Quicken Loans blog. (As far as I can tell it is an authentic post on the QL website.)

From that post:

You could be America’s next billionaire if you’re the grand prize winner of the Quicken Loans Billion Dollar Bracket Challenge. You read that right: one billion. Not one million. Not one hundred million. Not five hundred million. One billion U.S. dollars.

All you have to do is pick a perfect tournament bracket for the upcoming 2014 tournament. That’s it. Guess all the winners of all the games correctly, and Quicken Loans, along with Berkshire Hathaway, will make you a billionaire. The official press release is below. The contest starts March 3, 2014, so we’ll soon have all the info on how and when to enter your perfect bracket.

Good luck, my friends. This is your chance to play in perhaps the biggest sweepstakes in U.S. history. It’s your chance for a billion.

Oh, and by the way, the 20 closest imperfect brackets will win a cool hundred grand to put toward their home (or new home). Plus, in conjunction with the sweepstakes, Quicken Loans will donate $1 million to Detroit and Cleveland nonprofits to help with education of inner city youth.

So, to recap: If you’re perfect, you’ll win a billion. If you’re not perfect, you could win $100,000. The entry period begins Monday, March 3, 2014 and runs until Wednesday, March 19, 2014. Stay tuned on how to enter.

Contest updates at:

The odds against winning are absurd but this has all the markings of a big data project. Historical data, current data on the teams and players, models, prior outcomes to test your models, etc.

I wonder if Watson likes basketball?

Mining the Web to Predict Future Events

Saturday, December 28th, 2013

Mining the Web to Predict Future Events by Kira Radinsky and Eric Horvitz.


We describe and evaluate methods for learning to forecast forthcoming events of interest from a corpus containing 22 years of news stories. We consider the examples of identifying significant increases in the likelihood of disease outbreaks, deaths, and riots in advance of the occurrence of these events in the world. We provide details of methods and studies, including the automated extraction and generalization of sequences of events from news corpora and multiple web resources. We evaluate the predictive power of the approach on real-world events withheld from the system.

The paper starts off well enough:

Mark Twain famously said that “the past does not repeat itself, but it rhymes.” In the spirit of this reflection, we develop and test methods for leveraging large-scale digital histories captured from 22 years of news reports from the New York Times (NYT) archive to make real-time predictions about the likelihoods of future human and natural events of interest. We describe how we can learn to predict the future by generalizing sets of specific transitions in sequences of reported news events, extracted from a news archive spanning the years 1986–2008. In addition to the news corpora, we leverage data from freely available Web resources, including Wikipedia, FreeBase, OpenCyc, and GeoNames, via the LinkedData platform [6]. The goal is to build predictive models that generalize from specific sets of sequences of events to provide likelihoods of future outcomes, based on patterns of evidence observed in near-term newsfeeds. We propose the methods as a means of generating actionable forecasts in advance of the occurrence of target events in the world.

But when it gets down to actual predictions, the experiment predicts:

  • Cholera following flooding in Bangladesh.
  • Riots following police shootings in immigrant/poor neighborhoods.

Both are generally true but I don’t need 22 years worth of New York Times (NYT) archives to make those predictions.

Test offers of predictive advice by asking for specific predictions relevant to your enterprise. Also ask long time staff to make their predictions. Compare the predictions.

Unless the automated solution is significantly better, reward the staff and drive on.

I first saw this in Nat Torkington’s Four short links: 26 December 2013.

Intrade Archive: Data for Posterity

Wednesday, April 3rd, 2013

Intrade Archive: Data for Posterity by Panos Ipeirotis.

From the post:

A few years back, I have done some work on prediction markets. For this line of research, we have been collecting data from Intrade, to perform our experimental analysis. Some of the data is available through the Intrade Archive, a web app that I wrote in order to familiarize myself with the Google App Engine.

In the last few weeks, through, after the effective shutdown of Intrade, I started receiving requests on getting access to the data stored in the Intrade Archive. So, after popular demand, I gathered all the data from the Intrade Archive, and also all the past data that I had about all the Intrade contracts going back to 2003, and I put them all on GitHub for everyone to access and download.

If you don’t know about Intrade, see: How Intrade Works.

Not sure why you would need the data but it is unusual enough to merit notice.

…[D]emocratization of modeling, simulations, and predictions

Sunday, January 27th, 2013

Technical engine for democratization of modeling, simulations, and predictions by Justyna Zander and Pieter J. Mosterman. (Justyna Zander and Pieter J. Mosterman. 2012. Technical engine for democratization of modeling, simulations, and predictions. In Proceedings of the Winter Simulation Conference (WSC ’12). Winter Simulation Conference , Article 228 , 14 pages.)


Computational science and engineering play a critical role in advancing both research and daily-life challenges across almost every discipline. As a society, we apply search engines, social media, and selected aspects of engineering to improve personal and professional growth. Recently, leveraging such aspects as behavioral model analysis, simulation, big data extraction, and human computation is gaining momentum. The nexus of the above facilitates mass-scale users in receiving awareness about the surrounding and themselves. In this paper, an online platform for modeling and simulation (M&S) on demand is proposed. It allows an average technologist to capitalize on any acquired information and its analysis based on scientifically-founded predictions and extrapolations. The overall objective is achieved by leveraging open innovation in the form of crowd-sourcing along with clearly defined technical methodologies and social-network-based processes. The platform aims at connecting users, developers, researchers, passionate citizens, and scientists in a professional network and opens the door to collaborative and multidisciplinary innovations. An example of a domain-specific model of a pick and place machine illustrates how to employ the platform for technical innovation and collaboration.

It is an interesting paper but when speaking of integration of models the authors say:

The integration is performed in multiple manners. Multi-domain tools that become accessible from one common environment using the cloud-computing paradigm serve as a starting point. The next step of integration happens when various M&S execution semantics (and models of computation (cf., Lee and Sangiovanni-Vincentelli 1998; Lee 2010) are merged and model transformations are performed.

That went by too quickly for me. You?

The question of effective semantic integration is an important one.

The U.S. federal government publishes enough data to map where some of the dark data is waiting to be found.

The good, bad or irrelevant data churned out every week, makes the amount of effort required an ever increasing barrier to its use by the public.

Perhaps that is by design?

What do you think?

Prediction API – Machine Learning from Google

Tuesday, January 22nd, 2013

Prediction API – Machine Learning from Google by Istvan Szegedi.

From the post:

One of the exciting APIs among the 50+ APIs offered by Google is the Prediction API. It provides pattern matching and machine learning capabilities like recommendations or categorization. The notion is similar to the machine learning capabilities that we can see in other solutions (e.g. in Apache Mahout): we can train the system with a set of training data and then the applications based on Prediction API can recommend (“predict”) what products the user might like or they can categories spams, etc.

In this post we go through an example how to categorize SMS messages – whether they are spams or valuable texts (“hams”).

Nice introduction to Google’s Prediction API.

A use case for topic map authoring would be to route content to appropriate experts for further evaluation.

Data Mining Book Review: Dance with Chance

Sunday, November 4th, 2012

Data Mining Book Review: Dance with Chance by Sandro Saitta.

From the post:

If you ever worked on time series prediction (forecasting), you should read Dance with Chance. It is written by a statistician, a psychologist and a decision scientist (Makriddakis, Hogarth and Gaba). As it is the case in The Numerati or Super Crunchers, authors explain complex notions to a non-expert audience. I find the book really interesting and provocative.

The main concept of Dance with Chance is the “illusion of control”. It is when you think you control a future event or situation, that is in fact mainly due to chance. This is the opposite of fatalism (when you think you have no control, although you have). The book teaches how to avoid being fooled by this illusion of control. This is a very interesting reading for any data miner, particularly involved with forecasting. The books contains dozens of examples of the limitation of forecasting techniques. For example, it explains the issues of forecasting the stock market and when predictions are due to chance. Authors use a brilliant mix of statistics and psychology to prove their point.

From the review this sounds like an interesting read.

Forecasting can be useful but being aware of its limitations is as well.

Predicting what topics will trend on Twitter [Predicting Merging?]

Friday, November 2nd, 2012

Predicting what topics will trend on Twitter

From the post:

Twitter’s home page features a regularly updated list of topics that are “trending,” meaning that tweets about them have suddenly exploded in volume. A position on the list is highly coveted as a source of free publicity, but the selection of topics is automatic, based on a proprietary algorithm that factors in both the number of tweets and recent increases in that number.

At the Interdisciplinary Workshop on Information and Decision in Social Networks at MIT in November, Associate Professor Devavrat Shah and his student, Stanislav Nikolov, will present a new algorithm that can, with 95 percent accuracy, predict which topics will trend an average of an hour and a half before Twitter’s algorithm puts them on the list — and sometimes as much as four or five hours before.

If you can’t attend the Interdisciplinary Workshop on Information and Decision in Social Networks workshop, which has an exciting final program, try Stanislav Nikolov thesis, Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series.


In supervised classification, one attempts to learn a model of how objects map to labels by selecting the best model from some model space. The choice of model space encodes assumptions about the problem. We propose a setting for model specification and selection in supervised learning based on a latent source model. In this setting, we specify the model by a small collection of unknown latent sources and posit that there is a stochastic model relating latent sources and observations. With this setting in mind, we propose a nonparametric classification method that is entirely unaware of the structure of these latent sources. Instead, our method relies on the data as a proxy for the unknown latent sources. We perform classification by computing the conditional class probabilities for an observation based on our stochastic model. This approach has an appealing and natural interpretation — that an observation belongs to a certain class if it sufficiently resembles other examples of that class.

We extend this approach to the problem of online time series classification. In the binary case, we derive an estimator for online signal detection and an associated implementation that is simple, efficient, and scalable. We demonstrate the merit of our approach by applying it to the task of detecting trending topics on Twitter. Using a small sample of Tweets, our method can detect trends before Twitter does 79% of the time, with a mean early advantage of 1.43 hours, while maintaining a 95% true positive rate and a 4% false positive rate. In addition, our method provides the flexibility to perform well under a variety of tradeoffs between types of error and relative detection time.

This will be interesting in many classification contexts.

Particularly predicting what topics a user will say represent the same subject.

BigML creates a marketplace for Predictive Models

Friday, October 26th, 2012

BigML creates a marketplace for Predictive Models by Ajay Ohri.

From the post:

BigML has created a marketplace for selling Datasets and Models. This is a first (?) as the closest market for Predictive Analytics till now was Rapid Miner’s marketplace for extensions (at



You can make your Dataset public. Mind you: the Datasets we are talking about are BigML’s fancy histograms. This means that other BigML users can look at your Dataset details and create new models based on this Dataset. But they can not see individual records or columns or use it beyond the statistical summaries of the Dataset. Your Source will remain private, so there is no possibility of anyone accessing the raw data.


Now, once you have created a great model, you can share it with the rest of the world. For free or at any price you set.Predictions are paid for in BigML Prediction Credits. The minimum price is ‘Free’ and the maximum price indicated is 100 credits.

Having a public, digital marketplace for data and data analysis has been proposed by many and attempted by more than just a few.

Data is bought and sold today, but not by the digital equivalent of small shop keepers. The shop keepers who changed the face of Europe.

Data is bought and sold today by the digital equivalent of the great feudal lords. Complete with castles (read silos).

Will BigML give rise to a new mercantile class?

Or just as importantly, will you be a member of it or bound to the estate of a feudal lord?

Are Expert Semantic Rules so 1980’s?

Monday, October 8th, 2012

In The Geometry of Constrained Structured Prediction: Applications to Inference and Learning of Natural Language Syntax André Martins proposes advances in inferencing and learning for NLP processing. And it is important work for that reason.

But in his introduction to recent (and rapid) progress in language technologies, the following text caught my eye:

So, what is the driving force behind the aforementioned progress? Essentially, it is the alliance of two important factors: the massive amount of data that became available with the advent of the Web, and the success of machine learning techniques to extract statistical models from the data (Mitchell, 1997; Manning and Schötze, 1999; Schölkopf and Smola, 2002; Bishop, 2006; Smith, 2011). As a consequence, a new paradigm has emerged in the last couple of decades, which directs attention to the data itself, as opposed to the explicit representation of knowledge (Abney, 1996; Pereira, 2000; Halevy et al., 2009). This data-centric paradigm has been extremely fruitful in natural language processing (NLP), and came to replace the classic knowledge representation methodology which was prevalent until the 1980s, based on symbolic rules written by experts. (emphasis added)

Are RDF, Linked Data, topic maps, and other semantic technologies caught in a 1980’s “symbolic rules” paradigm?

Are we ready to make the same break that NLP did, what, thirty (30) years ago now?

To get started on the literature, consider André’s sources:

Abney, S. (1996). Statistical methods and linguistics. In The balancing act: Combining symbolic and statistical approaches to language, pages 1–26. MIT Press, Cambridge, MA.

A more complete citation: Steven Abney. Statistical Methods and Linguistics. In: Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language. The MIT Press, Cambridge, MA. 1996. (Link is to PDF of Abney’s paper.)

Pereira, F. (2000). Formal grammar and information theory: together again? Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 358(1769):1239–1253.

I added a pointer to the Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences abstract for the article. You can see it at: Formal grammar and information theory: together again? (PDF file).

Halevy, A., Norvig, P., and Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2):8–12.

I added a pointer to the Intelligent Systems, IEEE abstract for the article. You can see it at: The unreasonable effectiveness of data (PDF file).

The Halevy article doesn’t have an abstract per se but the ACM reports one as:

Problems that involve interacting with humans, such as natural language understanding, have not proven to be solvable by concise, neat formulas like F = ma. Instead, the best approach appears to be to embrace the complexity of the domain and address it by harnessing the power of data: if other humans engage in the tasks and generate large amounts of unlabeled, noisy data, new algorithms can be used to build high-quality models from the data. [ACM]

That sounds like a challenge to me. You?

PS: I saw the pointer to this thesis at Christophe Lalanne’s A bag of tweets / September 2012

Building a “Data Eye in the Sky”

Saturday, September 22nd, 2012

Building a “Data Eye in the Sky” by Erwin Gianchandani.

From the post:

Nearly a year ago, tech writer John Markoff published a story in The New York Times about Open Source Indicators (OSI), a new program by the Federal government’s Intelligence Advanced Research Projects Activity (IARPA) seeking to automatically collect publicly available data, including Web search queries, blog entries, Internet traffic flows, financial market indicators, traffic webcams, changes in Wikipedia entries, etc., to understand patterns of human communication, consumption, and movement. According to Markoff:

It is intended to be an entirely automated system, a “data eye in the sky” without human intervention, according to the program proposal. The research would not be limited to political and economic events, but would also explore the ability to predict pandemics and other types of widespread contagion, something that has been pursued independently by civilian researchers and by companies like Google.

This past April, IARPA issued contracts to three research teams, providing funding potentially for up to three years, with continuation beyond the first year contingent upon satisfactory progress. At least two of these contracts are now public (following the link):

Erwin reviews what is known about programs at Virginia Tech and BBN Technologies.

And concludes with:

Each OSI research team is being required to make a number of warnings/alerts that will be judged on the basis of lead time, or how early the alert was made; the accuracy of the warning, such as the where/when/what of the alert; and the probability associated with the alert, that is, high vs. very high.

To learn more about the OSI program, check out the IARPA website or a press release issued by Virginia Tech.

Given the complexities of semantics, what has my curiosity up is how “warnings/alerts” are going to be judged?

Recalling that “all the lights were blinking red” before 9/11.

If all the traffic lights in the U.S. flashed three (3) times at the same time, without more, it could mean anything from the end of the Mayan calendar to free beer. One just never knows.

Do you have the stats on the oracle at Delphi?

Might be a good baseline for comparison.

Predictive Models: Build once, Run Anywhere

Tuesday, August 21st, 2012

Predictive Models: Build once, Run Anywhere

From the post:

We have released a new version of our open source Python bindings. This new version aims at showing how the BigML API can be used to build predictive models capable of generating predictions locally or remotely. You can get full access to the code at Github and read the full documentation at Read the Docs.

The complete list of updates includes (drum roll, please):

Development Mode

We recently introduced a free sandbox to help developers play with BigML on smaller datasets without being concerned about credits. In the new Python bindings you can use BigML in development mode, and all dataset and models smaller than 1MB can be created for free:

from bigml.api import BigML

api = BigML(dev_mode=True)

A “sandbox” for your machine learning experiments!

Day Nine of a Predictive Coding Narrative: A scary search…

Wednesday, August 8th, 2012

Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CAR with the Griswold’s, and a moral dilemma by Ralph Losey.

From the post:

In this sixth installment I continue my description, this time covering day nine of the project. Here I do a quality control review of a random sample to evaluate my decision in day eight to close the search.

Ninth Day of Review (4 Hours)

I began by generating a random sample of 1,065 documents from the entire null set (95% +/- 3%) of all documents not reviewed. I was going to review this sample as a quality control test of the adequacy of my search and review project. I would personally review all of them to see if any were False Negatives, in other words, relevant documents, and if relevant, whether any were especially significant or Highly Relevant.

I was looking to see if there were any documents left on the table that should have been produced. Remember that I had already personally reviewed all of the documents that the computer had predicted were like to be relevant (51% probability). I considered the upcoming random sample review of the excluded documents to be a good way to check the accuracy of reliance on the computer’s predictions of relevance.

I know it is not the only way, and there are other quality control measures that could be followed, but this one makes the most sense to me. Readers are invited to leave comments on the adequacy of this method and other methods that could be employed instead. I have yet to see a good discussion of this issue, so maybe we can have one here.

I can appreciate Ralph’s apprehension at a hindsight review of decisions already made. In legal proceedings, decisions are made and they move forward. Some judgements/mistakes can be corrected, others are simply case history.

Days Seven and Eight of a Predictive Coding Narrative [Re-Use of Analysis?]

Wednesday, August 8th, 2012

Days Seven and Eight of a Predictive Coding Narrative: Where I have another hybrid mind-meld and discover that the computer does not know God by Ralph Losey.

From the post:

In this fifth installment I will continue my description, this time covering days seven and eight of the project. As the title indicates, progress continues and I have another hybrid mind-meld moment. I also discover that the computer does not recognize the significance of references to God in an email. This makes sense logically, but is unexpected and kind of funny when encountered in a document review.

Ralph discovered new terms to use for training as the analysis of the documents progressed.

While Ralph captures those for his use, my question would be how to capture what he learned for re-use?

As in re-use by other parties, perhaps in other litigation.

Thinking of reducing the cost of discovery by sharing analysis of data sets, rather than every discovery process starting at ground zero.

Days Five and Six of a Predictive Coding Narrative

Friday, July 27th, 2012

Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment by Ralph Losey.

From the post:

This is my fourth in a series of narrative descriptions of an academic search project of 699,082 Enron emails and attachments. It started as a predictive coding training exercise that I created for Jackson Lewis attorneys. The goal was to find evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane. The third and fourth days are described in Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.

In this fourth installment I continue to describe what I did in days five and six of the project. In this narrative I go deep into the weeds and describe the details of multimodal search. Near the end of day six I have an affirming hybrid multimodal mind-meld moment, which I try to describe. I conclude by sharing some helpful advice I received from Joseph White, one of Kroll Ontrack’s (KO) experts on predictive coding and KO’s Inview software. Before I launch into the narrative, a brief word about vendor experts. Don’t worry, it is not going to be a commercial for my favorite vendors; more like a warning based on hard experience.

You will learn a lot about predictive analytics and e-discovery from this series of posts but the most important paragraphs I have read thus far:

When talking to the experts, be sure that you understand what they say to you, and never just nod in agreement when you do not really get it. I have been learning and working with new computer software of all kinds for over thirty years, and am not at all afraid to say that I do not understand or follow something.

Often you cannot follow because the explanation is so poor. For instance, often the words I hear from vendor tech experts are too filled with company specific jargon. If what you are being told makes no sense to you, then say so. Keep asking questions until it does. Do not be afraid of looking foolish. You need to be able to explain this. Repeat back to them what you do understand in your own words until they agree that you have got it right. Do not just be a parrot. Take the time to understand. The vendor experts will respect you for the questions, and so will your clients. It is a great way to learn, especially when it is coupled with hands-on experience.

Insisting that experts explain until you understand what is being said will help you avoid costly mistakes and make you more sympathetic to a client’s questions when you are the expert.

The technology and software will change for predictive coding will change beyond recognition in a few short years.

Demanding and giving explanations that “explain” is a skill that will last a lifetime.

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree

Friday, July 27th, 2012

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree by Ralph Losey.

From the post:

This is the third in a series of detailed descriptions of a legal search project. The project was an academic training exercise for Jackson Lewis e-discovery liaisons conducted in May and June 2012. I searched a set of 699,082 Enron emails and attachments for possible evidence pertaining to involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane.

The description of day-two was short, but it was preceded by a long explanation of my review plan and search philosophy, along with a rant in favor of humanity and against over-dependence on computer intelligence. Here I will just stick to the facts of what I did in days three and four of my search using Kroll Ontrack’s (KO) Inview software.

Interesting description of where Ralph and the computer disagree on relevant/irrelevant judgement on documents.

Unless I just missed it, Ralph is only told be the software what rating a document was given, not why the software arrived at that rating. Yes?

If you knew what terms drove a particular rating, it would be interesting to “comment out” those terms in a document to see the impact on its relevance rating.

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane

Friday, July 13th, 2012

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane by Ralph Losey.

From the post:

Day One of the search project ended when I completed review of the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and then, with the help of William Webber, go deeper into math and statistical sampling than ever before. I will also give you the big picture of my review plan and search philosophy: its hybrid and multimodal. Some search experts disagree with my philosophy. They think I do not go far enough to fully embrace machine coding. They are wrong. I will explain why and rant on in defense of humanity. Only then will I conclude with the Day Two narrative.

More than you are probably going to want to know about sample sizes and their calculation but persevere until you get to the defense of humanity stuff. It is all quite good.

If I had to add a comment on the defense of humanity rant, it would be that machines have a flat view of documents and not the richly textured one of a human reader. While true that machines can rapidly compare document without tiring, they will miss an executive referring to a secretary as his “cupcake.” A reference that would jump out at a human reader. Same text, different result.

Perhaps because in one case the text is being scanned for tokens and in the other case it is being read.

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron

Friday, July 13th, 2012

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron by Ralph Losey.

The start of a series of posts on predictive coding and searching of the Enron emails by a lawyer. A legal perspective is important enough that I will be posting a note about each post in this series as they occur.

A couple of preliminary notes:

I am sure this is the first time that Ralph has used predictive encoding with the Enron emails. On the other hand, I would not take “…this is the first time for X…” sort of claims from any vendor or service organization. 😉

You can see other examples of processing the Enron emails at:

And that is just a “lite” scan. There are numerous other projects that use the Enron email collection.

I wonder if that is because we are naturally nosey?

From the post:

This is the first in a series of narrative descriptions of a legal search project using predictive coding. Follow along while I search for evidence of involuntary employee terminations in a haystack of 699,082 Enron emails and attachments.

Joys and Risks of Being First

To the best of my knowledge, this writing project is another first. I do not think anyone has ever previously written a blow-by-blow, detailed description of a large legal search and review project of any kind, much less a predictive coding project. Experts on predictive coding speak only from a mile high perspective; never from the trenches (you can speculate why). That has been my practice here, until now, and also my practice when speaking about predictive coding on panels or in various types of conferences, workshops, and classes.

There are many good reasons for this, including the main one that lawyers cannot talk about their client’s business or information. That is why in order to do this I had to run an academic project and search and review the Enron data. Many people could do the same. In fact, each year the TREC Legal Track participants do similar search projects of Enron data. But still, no one has taken the time to describe the details of their search, not even the spacey TRECkies (sorry Jason).

A search project like this takes an enormous amount of time. In fact, to my knowledge (Maura, please correct me if I’m wrong), no Legal Track TRECkies have ever recorded and reported the time that they put into the project, although there are rumors. In my narrative I will report the amount of time that I put into the project on a day-by-day basis, and also, sometimes, on a per task basis. I am a lawyer. I live by the clock and have done so for thirty-two years. Time is important to me, even non-money time like this. There is also a not-insignificant amount of time it takes to write it up a narrative like this. I did not attempt to record that.

There is one final reason this has never been attempted before, and it is not trivial: the risks involved. Any narrator who publicly describes their search efforts assumes the risk of criticism from monday morning quarterbacks about how the sausage was made. I get that. I think I can handle the inevitable criticism. A quote that Jason R. Baron turned me on to a couple of years ago helps, the famous line from Theodore Roosevelt in his Man in the Arena speech at the Sorbonne:

It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.

I know this narrative is no high achievement, but we all do what we can, and this seems within my marginal capacities.

Predicting link directions via a recursive subgraph-based ranking

Tuesday, June 12th, 2012

Predicting link directions via a recursive subgraph-based ranking by Fangjian Guo, Zimo Yang, and Tao Zhou.


Link directions are essential to the functionality of networks and their prediction is helpful towards a better knowledge of directed networks from incomplete real-world data. We study the problem of predicting the directions of some links by using the existence and directions of the rest of links. We propose a solution by first ranking nodes in a specific order and then predicting each link as stemming from a lower-ranked node towards a higher-ranked one. The proposed ranking method works recursively by utilizing local indicators on multiple scales, each corresponding to a subgraph extracted from the original network. Experiments on real networks show that the directions of a substantial fraction of links can be correctly recovered by our method, which outperforms either purely local or global methods.

This paper focuses mostly on prediction of direction of links, relying on other research for the question of link existence.

I mention it because predicting links and their directions will be important for planning graph database deployments in particular.

It will be a little late to find out when under full load that other modeling choices should have been made. (It is usually under “full load” conditions when retrospectives on modeling choices come up.)

Developing a predictive analytics program doable on a limited budget

Sunday, November 13th, 2011

Developing a predictive analytics program doable on a limited budget

From the post:

Predictive analytics is experiencing what David Menninger, a research director and vice president at Ventana Research Inc., calls “a renewed interest.” And he’s not the only one who is seeing a surge in the number of organizations looking to set up a predictive analytics program.

In September, Hurwitz & Associates, a consulting and market research firm in Needham, Mass., released a report ranking 12 predictive analytics vendors that it views as “strong contenders” in the market. Fern Halper, a Hurwitz partner and the principal researcher for the report, thinks predictive analytics is moving into the user mainstream. She said its growing popularity is being driven by better tools, increased access to high-performance computing resources, reduced storage costs and an economic climate that has businesses hungry for better forecasting.

“Especially in today’s economy, they’re realizing they can’t just look in the rearview mirror and look at what has happened,” said Halper. “They need to look at what can happen and what will happen and become as smart as they can possibly be if they’re going to compete.”

While predictive analytics basks in the limelight, the nuances of developing an effective program are tricky and sometimes can be overwhelming for organizations. But the good news, according to a variety of analysts and consultants, is that finding the right strategy is possible — even on a shoestring budget.

Here are some of their best-practices tips for succeeding on predictive analytics without breaking the bank:

What caught my eye was doable on a limited budget.

Limited budgets aren’t uncommon most of the time and in today’s economy they are down right plentiful. In private and public sectors.

The lessons in this post apply to topic maps. Don’t try to sell converting an entire enterprise or operation to topic maps. Pick some small area of pain or obvious improvement and sell a solution for that part. ROI that they can see this quarter or maybe next. Then build on that experience to propose larger or longer range projects.

Rapid-I: Report the Future

Wednesday, October 19th, 2011

Rapid-I: Report the Future

Source of:

RapidMiner: Professional open source data mining made easy.

Analytical ETL, Data Mining, and Predictive Reporting with a single solution

RapidAnalytics: Collaborative data analysis power.

No 1 in open source business analytics

The key product for business critical predictive analysis

RapidDoc: Webbased solution for document retrieval and analysis.

Classify text, identify trends as well as emerging topics

Easy to use and configure

From About Rapid-I:

Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, i.e. for large amounts of structured data like database systems and unstructured data like texts. The open-source data mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.

The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading open-source system for knowledge discovery and data mining. It is available as a stand-alone application for data analysis and as a data mining engine which can be integrated into own products. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP, Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma, PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses benefitting from the open-source business model of Rapid-I.

Data mining/analysis is the first part of any topic map project, however large or small. These tools, which I have not (yet) tried, are likely to prove useful in such projects. Comments welcome.

Google Prediction API graduates from labs

Sunday, October 16th, 2011

Google Prediction API graduates from labs, adds new features by Zachary Goldberg, Product Manager.

From the post:

Since the general availability launch of the Prediction API this year at Google I/O, we have been working hard to give every developer access to machine learning in the cloud to build smarter apps. We’ve also been working on adding new features, accuracy improvements, and feedback capability to the API.

Today we take another step by announcing Prediction v1.4. With the launch of this version, Prediction is graduating from Google Code Labs, reflecting Google’s commitment to the API’s development and stability. Version 1.4 also includes two new features:

  • Data Anomaly Analysis
    • One of the hardest parts of building an accurate predictive model is gathering and curating a high quality data set. With Prediction v1.4, we are providing a feature to help you identify problems with your data that we notice during the training process. This feedback makes it easier to build accurate predictive models with proper data.
  • PMML Import
    • PMML has become the de facto industry standard for transmitting predictive models and model data between systems. As of v1.4, the Google Prediction API can programmatically accept your PMML for data transformations and preprocessing.
    • The PMML spec is vast and covers many, many features. You can find more details about the specific features that the Google Prediction API supports here.

(I added a paragraph break in the first text block for readability. It should be re-written but I am quoting.)

Suggest you take a close look at the features of PMML that Google does not support. Quite an impressive array of non-support.

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction

Friday, September 23rd, 2011

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction by Ping Shi, Surajit Ray, Qifu Zhu and Mark A Kon.

BMC Bioinformatics 2011, 12:375 doi:10.1186/1471-2105-12-375 Published: 23 September 2011



The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.


We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher’s discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets


The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

Knowing the tools that are already in use in bioinformatics will help you design topic map applications of interest to those in that field. And this is a very nice combination of methods to study on its own.

Online Master of Science in Predictive Analytics

Wednesday, September 21st, 2011

Online Master of Science in Predictive Analytics

As businesses seek to maximize the value of vast new stores of available data, Northwestern University’s Master of Science in Predictive Analytics program prepares students to meet the growing demand in virtually every industry for data-driven leadership and problem solving.

Advanced data analysis, predictive modeling, computer-based data mining, and marketing, web, text, and risk analytics are just some of the areas of study offered in the program. As a student in the Master of Science in Predictive Analytics program, you will:

  • Prepare for leadership-level career opportunities by focusing on statistical concepts and practical application
  • Learn from distinguished Northwestern faculty and from the seasoned industry experts who are redefining how data improve decision-making and boost ROI
  • Build statistical and analytic expertise as well as the management and leadership skills necessary to implement high-level, data-driven decisions
  • Earn your Northwestern University master’s degree entirely online

Just so you know, libraries schools were offering mostly online degrees a decade or so ago. Nice to see other disciplines catching up. 😉

It would be interesting to see short courses in subject analysis, as in subject identity and the properties that compose a particular identity, in specific domains.