Archive for the ‘Predictive Analytics’ Category

Big Data To Identify Rogue Employees (Who To Throw Under The Bus)

Thursday, April 9th, 2015

Big Data Algorithm Identifies Rogue Employees by Hugh Son.

From the post:

Wall Street traders are already threatened by computers that can do their jobs faster and cheaper. Now the humans of finance have something else to worry about: Algorithms that make sure they behave.

JPMorgan Chase & Co., which has racked up more than $36 billion in legal bills since the financial crisis, is rolling out a program to identify rogue employees before they go astray, according to Sally Dewar, head of regulatory affairs for Europe, who’s overseeing the effort. Dozens of inputs, including whether workers skip compliance classes, violate personal trading rules or breach market-risk limits, will be fed into the software.

“It’s very difficult for a business head to take what could be hundreds of data points and start to draw any themes about a particular desk or trader,” Dewar, 46, said last month in an interview. “The idea is to refine those data points to help predict patterns of behavior.”

Sounds worthwhile until you realize that $36 billion in legal bills “since the financial crisis” covers a period of seven (7) years, which works out to be about $5 billion per year. Considering that net revenue for 2014 was $21.8 billion, after deducting legal bills, they aren’t doing too badly. 2014 Annual Report

Hugh raises the specter of The Minority Report in terms of predicting future human behavior. True enough but much more likely to discover cues that resulted in prior regulatory notice with cautions to employees to avoid those “tells.” If the trainer reviews three (3) real JPMorgan Chase cases and all of them involve note taking and cell phone records (later traced), how bright do you have to be to get clued in?

People who don’t get clued in will either be thrown under the bus during the next legal crisis or won’t be employed at JPMorgan Chase.

If this were really a question of predicting human behavior the usual concerns about fairness, etc. would obtain. I suspect it is simply churn so that JPMorgan Chase appears to be taking corrective action. Some low level players will be outed, like the Walter Mitty terrorists the FBI keeps capturing in its web of informants. (I am mining some data now to collect those cases for a future post.)

It will be interesting to see if Jamie Dimon electronic trail is included as part of the big data monitoring of employees. Bets anyone?

Federal Data Integration: Dengue Fever

Tuesday, April 7th, 2015

The White House issued a press release today (April 7, 2015) titled: FACT SHEET: Administration Announces Actions To Protect Communities From The Impacts Of Climate Change.

That press release reads in part:

Unleashing Data: As part of the Administration’s Predict the Next Pandemic Initiative, in May 2015, an interagency working group co-chaired by OSTP, the CDC, and the Department of Defense will launch a pilot project to simulate efforts to forecast epidemics of dengue – a mosquito-transmitted viral disease affecting millions of people every year, including U.S. travelers and residents of the tropical regions of the U.S. such as Puerto Rico. The pilot project will consolidate data sets from across the federal government and academia on the environment, disease incidence, and weather, and challenge the research and modeling community to develop predictive models for dengue and other infectious diseases based on those datasets. In August 2015, OSTP plans to convene a meeting to evaluate resulting models and showcase this effort as a “proof-of-concept” for similar forecasting efforts for other infectious diseases.

I tried finding more details on earlier workshops in this effort but limiting the search to “Predict the Next Pandemic Initiative” and the domain to “.gov,” I got two “hits.” One of which was the press release I cite above.

I sent a message (webform) to the White House Office of Science and Technology Policy office and will update you with any additional information that arrives.

Of course my curiosity is about the means used to integrate the data sets. Once integrated, such data sets can be re-used, at least until it is time to integrate additional data sets. Bearing in mind that dirty data can lead to poor decision making, I would rather not duplicate the cleaning of data time after time.

Max Kuhn’s Talk on Predictive Modeling

Monday, March 16th, 2015

Max Kuhn’s Talk on Predictive Modeling

From the post:

Max Kuhn, Director of Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling joined us on February 17, 2015 and shared his experience with Data Mining with R.

Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at

Excellent! (You may need to adjust the sound on the video.)

Support your local user group, particularly those generous enough to post videos and slides for their speakers. It makes a real difference to those unable to travel for one reason or another.

I first saw this in a tweet by NYC Data Science.

PredictionIO [ML Too Easy? Too Fast?]

Tuesday, March 10th, 2015


From the what is page:

PredictionIO is an open-source Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.

PredictionIO template gallery offers a wide range of predictive engine templates for download, developers can customize them easily. The DASE architecture of engine is the “MVC for Machine Learning”. It enables developers to build predictive engine components with separation-of-concerns. Data scientists can also swap and evaluate algorithms as they wish. The core part of PredictionIO is an engine deployment platform built on top of Apache Spark. Predictive engines are deployed as distributed web services. In addition, there is an Event Server. It is a scalable data collection and analytics layer built on top of Apache HBase.

PredictionIO eliminates the friction between software development, data science and production deployment. It takes care of the data infrastructure routine so that your data science team can focus on what matters most.

The most attractive feature of PredictionIO is the ability to configure and test multiple engines with less overhead.

At the same time, I am not altogether sure that “…accelerat[ing] scalable machine learning infrastructure management” is necessarily a good idea.

You may want to remember that the current state of cyberinsecurity, where all programs are suspect and security software may add more bugs that it cures, is a result, in part, of shipping code because “it works,” and not because it is free (or relatively so) of security issues.

I am really not looking forward to machine learning uncertainty like we have cyberinsecurity now.

That isn’t a reflection on PredictionIO but the thought occurred to me because of the emphasis on accelerated use of machine learning.


Friday, December 26th, 2014

Seldon wants to make life easier for data scientists, with a new open-source platform by Martin Bryant.

From the post:

It feels that these days we live our whole digital lives according mysterious algorithms that predict what we’ll want from apps and websites. A new open-source product could help those building the products we use worry less about writing those algorithms in the first place.

As increasing numbers of companies hire in-house data science teams, there’s a growing need for tools they can work with so they don’t need to build new software from scratch. That’s the gambit behind the launch of Seldon, a new open-source predictions API launching early in the new year.

Seldon is designed to make it easy to plug in the algorithms needed for predictions that can recommend content to customers, offer app personalization features and the like. Aimed primarily at media and e-commerce companies, it will be available both as a free-to-use self-hosted product and a fully hosted, cloud-based version.

If you think Inadvertent Algorithmic Cruelty is a problem, just wait until people who don’t understand the data or the algorithms start using them in prepackaged form.

Packaged predictive analytics are about as safe as arming school crossing guards with .600 Nitro Express rifles to ward off speeders. As attractive as the second suggestion sounds, there would be numerous safety concerns.

Different but no less pressing safety concerns abound with packaged predictive analytics. Being disconnected from the actual algorithms, can enterprises claim immunity for race, gender or sexual orientation based discrimination? Hard to prove “intent” when the answers in question were generated in complete ignorance of the algorithmic choices that drove the results.

At least Seldon is open source and so the algorithms can be examined, should you be interested in how results are calculated. But open source algorithms are but one aspect of the problem. What of the data? Blind application of algorithms, even neutral ones, can lead to any number of results. If you let me supply the data, I can give you a guarantee of the results from any known algorithm. “Untouched by human hands” as they say.

When you are given recommendations based on predictive analytics do you ask for the data and/or algorithms? Who in your enterprise can do due diligence to verify the results? Who is on the line for bad decisions based on poor predictive analytics?

I first saw this in a tweet by Gregory Piatetsky.

Want to win $1,000,000,000 (yes, that’s one billion dollars)?

Wednesday, January 22nd, 2014

Want to win $1,000,000,000 (yes, that’s one billion dollars)? by Ann Drobnis.

The offer is one billion dollars for picking the winners of every game in the NCAA men’s basketball tournament in the Spring of 2014.

Unfortunately, none of the news stories I saw had links back to any authentic information from Quicken Loans and Berkshire Hathaway about the offer.

After some searching I found: Win a Billion Bucks with the Quicken Loans Billion Dollar Bracket Challenge by Clayton Closson, on January 21, 2014 on the Quicken Loans blog. (As far as I can tell it is an authentic post on the QL website.)

From that post:

You could be America’s next billionaire if you’re the grand prize winner of the Quicken Loans Billion Dollar Bracket Challenge. You read that right: one billion. Not one million. Not one hundred million. Not five hundred million. One billion U.S. dollars.

All you have to do is pick a perfect tournament bracket for the upcoming 2014 tournament. That’s it. Guess all the winners of all the games correctly, and Quicken Loans, along with Berkshire Hathaway, will make you a billionaire. The official press release is below. The contest starts March 3, 2014, so we’ll soon have all the info on how and when to enter your perfect bracket.

Good luck, my friends. This is your chance to play in perhaps the biggest sweepstakes in U.S. history. It’s your chance for a billion.

Oh, and by the way, the 20 closest imperfect brackets will win a cool hundred grand to put toward their home (or new home). Plus, in conjunction with the sweepstakes, Quicken Loans will donate $1 million to Detroit and Cleveland nonprofits to help with education of inner city youth.

So, to recap: If you’re perfect, you’ll win a billion. If you’re not perfect, you could win $100,000. The entry period begins Monday, March 3, 2014 and runs until Wednesday, March 19, 2014. Stay tuned on how to enter.

Contest updates at:

The odds against winning are absurd but this has all the markings of a big data project. Historical data, current data on the teams and players, models, prior outcomes to test your models, etc.

I wonder if Watson likes basketball?

Mining the Web to Predict Future Events

Saturday, December 28th, 2013

Mining the Web to Predict Future Events by Kira Radinsky and Eric Horvitz.


We describe and evaluate methods for learning to forecast forthcoming events of interest from a corpus containing 22 years of news stories. We consider the examples of identifying significant increases in the likelihood of disease outbreaks, deaths, and riots in advance of the occurrence of these events in the world. We provide details of methods and studies, including the automated extraction and generalization of sequences of events from news corpora and multiple web resources. We evaluate the predictive power of the approach on real-world events withheld from the system.

The paper starts off well enough:

Mark Twain famously said that “the past does not repeat itself, but it rhymes.” In the spirit of this reflection, we develop and test methods for leveraging large-scale digital histories captured from 22 years of news reports from the New York Times (NYT) archive to make real-time predictions about the likelihoods of future human and natural events of interest. We describe how we can learn to predict the future by generalizing sets of specific transitions in sequences of reported news events, extracted from a news archive spanning the years 1986–2008. In addition to the news corpora, we leverage data from freely available Web resources, including Wikipedia, FreeBase, OpenCyc, and GeoNames, via the LinkedData platform [6]. The goal is to build predictive models that generalize from specific sets of sequences of events to provide likelihoods of future outcomes, based on patterns of evidence observed in near-term newsfeeds. We propose the methods as a means of generating actionable forecasts in advance of the occurrence of target events in the world.

But when it gets down to actual predictions, the experiment predicts:

  • Cholera following flooding in Bangladesh.
  • Riots following police shootings in immigrant/poor neighborhoods.

Both are generally true but I don’t need 22 years worth of New York Times (NYT) archives to make those predictions.

Test offers of predictive advice by asking for specific predictions relevant to your enterprise. Also ask long time staff to make their predictions. Compare the predictions.

Unless the automated solution is significantly better, reward the staff and drive on.

I first saw this in Nat Torkington’s Four short links: 26 December 2013.

Predictive Analytics 101

Thursday, October 17th, 2013

Predictive Analytics 101 by Ravi Kalakota.

From the post:

Insight, not hindsight is the essence of predictive analytics. How organizations instrument, capture, create and use data is fundamentally changing the dynamics of work, life and leisure.

I strongly believe that we are on the cusp of a multi-year analytics revolution that will transform everything.

Using analytics to compete and innovate is a multi-dimensional issue. It ranges from simple (reporting) to complex (prediction).

Reporting on what is happening in your business right now is the first step to making smart business decisions. This is the core of KPI scorecards or business intelligence (BI). The next level of analytics maturity takes this a step further. Can you understand what is taking place (BI) and also anticipate what is about to take place (predictive analytics).

By automatically delivering relevant insights to end-users, managers and even applications, predictive decision solutions aims to reduces the need of business users to understand the ‘how’ and focus on the ‘why.’ The end goal of predictive analytics = [Better outcomes, smarter decisions, actionable insights, relevant information].

How you execute this varies by industry and information supply chain (Raw Data -> Aggregated Data -> Contextual Intelligence -> Analytical Insights (reporting vs. prediction) -> Decisions (Human or Automated Downstream Actions)).

There are four types of data analysis:

    • Simple summation and statistics
    • Predictive (forecasting),
    • Descriptive (business intelligence and data mining) and
    • Prescriptive (optimization and simulation)

Predictive analytics leverages four core techniques to turn data into valuable, actionable information:

  1. Predictive modeling
  2. Decision Analysis and Optimization
  3. Transaction Profiling
  4. Predictive Search

This post is a very good introduction to predictive analytics.

You may have to do some hand holding to get executives through it but they will be better off for it.

When you need support for more training of executives, use this graphic from Ravi’s post:

useful data gap

That startled even me. 😉

UC Irvine Extension Announces Winter Predictive Analytics and Info System Courses

Wednesday, December 12th, 2012

UC Irvine Extension Announces Winter Predictive Analytics and Info System Courses

From the post:

Predictive Analytics Certificate Program:

This program is designed for professionals who are using or wish to use Predictive Analytics to optimize business performance at a variety of levels. UC Irvine Extension is offering the following webinar and two courses during winter quarter:

Predictive Analytics Special Topic Webinar: Text Analytics & Text Mining (Jan. 15, 11:30 a.m. to 12:30 p.m., PST) – This free webinar will provide participants with the introductory concepts of text analytics and text mining that are used to recognize how stored, unstructured data represents an extremely valuable source of business information.

Course: Effective Data Preparation (Jan. 7 to Feb. 24) – This online course will address how to extract stored data elements, transform their formats, and derive new relationships among them, in order to produce a dataset suitable for analytical modeling. Course instructor Dr. Robert Nisbet, chief scientist at Smogfarm, which studies crowd psychology, will provide attendees with the skills to produce a fully processed data set compatible for building powerful predictive models.

Course: Text Analytics & Text Mining (Jan. 28 to March 24) – This new online course instructed by Dr. Gary Miner, author of Handbook of Statistical Analysis & Data Mining Applications and Practical Text Mining, will focus on basic concepts of textual information including tokenization and part-of-speech tagging. The course will expose participants to practical techniques for text extraction and text mining, document clustering and classification, information retrieval, and the enhancement of structured data.

Just so you know, the webinar is free but Effective Data Preparation and Text Analytics & Text Mining are $695.00 each.

I am always made more curious by the omission of the most obvious questions from an FAQ or location of the information in very non-prominent places.

I suspect well worth the price but why not be up front with the charges?

BigML creates a marketplace for Predictive Models

Friday, October 26th, 2012

BigML creates a marketplace for Predictive Models by Ajay Ohri.

From the post:

BigML has created a marketplace for selling Datasets and Models. This is a first (?) as the closest market for Predictive Analytics till now was Rapid Miner’s marketplace for extensions (at



You can make your Dataset public. Mind you: the Datasets we are talking about are BigML’s fancy histograms. This means that other BigML users can look at your Dataset details and create new models based on this Dataset. But they can not see individual records or columns or use it beyond the statistical summaries of the Dataset. Your Source will remain private, so there is no possibility of anyone accessing the raw data.


Now, once you have created a great model, you can share it with the rest of the world. For free or at any price you set.Predictions are paid for in BigML Prediction Credits. The minimum price is ‘Free’ and the maximum price indicated is 100 credits.

Having a public, digital marketplace for data and data analysis has been proposed by many and attempted by more than just a few.

Data is bought and sold today, but not by the digital equivalent of small shop keepers. The shop keepers who changed the face of Europe.

Data is bought and sold today by the digital equivalent of the great feudal lords. Complete with castles (read silos).

Will BigML give rise to a new mercantile class?

Or just as importantly, will you be a member of it or bound to the estate of a feudal lord?

Fall Lineup: Protest Monitoring, Bin Laden Letters Analysis, … [Defensive Big Data (DBD)]

Friday, August 24th, 2012

Protest Monitoring, Bin Laden Letters Analysis, and Building Custom Applications

OK, not “Fall Lineup” in the TV sense. 😉

Webinars from Recorded Future in September, 2012.

All start at 11 AM EST.

These webinars should help you learn how data mining looks for clues or how to not leave clues.

Is the term: Defensive Big Data (DBD) in common usage?

Think of using Mahout to analyze email traffic to support reforming your emails to be close to messages that are routinely ignored.

Predictive Models: Build once, Run Anywhere

Tuesday, August 21st, 2012

Predictive Models: Build once, Run Anywhere

From the post:

We have released a new version of our open source Python bindings. This new version aims at showing how the BigML API can be used to build predictive models capable of generating predictions locally or remotely. You can get full access to the code at Github and read the full documentation at Read the Docs.

The complete list of updates includes (drum roll, please):

Development Mode

We recently introduced a free sandbox to help developers play with BigML on smaller datasets without being concerned about credits. In the new Python bindings you can use BigML in development mode, and all dataset and models smaller than 1MB can be created for free:

from bigml.api import BigML

api = BigML(dev_mode=True)

A “sandbox” for your machine learning experiments!

Day Nine of a Predictive Coding Narrative: A scary search…

Wednesday, August 8th, 2012

Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CAR with the Griswold’s, and a moral dilemma by Ralph Losey.

From the post:

In this sixth installment I continue my description, this time covering day nine of the project. Here I do a quality control review of a random sample to evaluate my decision in day eight to close the search.

Ninth Day of Review (4 Hours)

I began by generating a random sample of 1,065 documents from the entire null set (95% +/- 3%) of all documents not reviewed. I was going to review this sample as a quality control test of the adequacy of my search and review project. I would personally review all of them to see if any were False Negatives, in other words, relevant documents, and if relevant, whether any were especially significant or Highly Relevant.

I was looking to see if there were any documents left on the table that should have been produced. Remember that I had already personally reviewed all of the documents that the computer had predicted were like to be relevant (51% probability). I considered the upcoming random sample review of the excluded documents to be a good way to check the accuracy of reliance on the computer’s predictions of relevance.

I know it is not the only way, and there are other quality control measures that could be followed, but this one makes the most sense to me. Readers are invited to leave comments on the adequacy of this method and other methods that could be employed instead. I have yet to see a good discussion of this issue, so maybe we can have one here.

I can appreciate Ralph’s apprehension at a hindsight review of decisions already made. In legal proceedings, decisions are made and they move forward. Some judgements/mistakes can be corrected, others are simply case history.

Days Seven and Eight of a Predictive Coding Narrative [Re-Use of Analysis?]

Wednesday, August 8th, 2012

Days Seven and Eight of a Predictive Coding Narrative: Where I have another hybrid mind-meld and discover that the computer does not know God by Ralph Losey.

From the post:

In this fifth installment I will continue my description, this time covering days seven and eight of the project. As the title indicates, progress continues and I have another hybrid mind-meld moment. I also discover that the computer does not recognize the significance of references to God in an email. This makes sense logically, but is unexpected and kind of funny when encountered in a document review.

Ralph discovered new terms to use for training as the analysis of the documents progressed.

While Ralph captures those for his use, my question would be how to capture what he learned for re-use?

As in re-use by other parties, perhaps in other litigation.

Thinking of reducing the cost of discovery by sharing analysis of data sets, rather than every discovery process starting at ground zero.

Big Data Machine Learning: Patterns for Predictive Analytics

Tuesday, July 31st, 2012

Big Data Machine Learning: Patterns for Predictive Analytics by Ricky Ho.

A DZone “refcard” and as you might expect, a bit “slim” to cover predictive analytics. Still, printed in full color it would make a nice handout on predictive analytics for a general audience.

What would you add to make a “refcard” on a particular method?

Or for that matter, what would you include to make a “refcard” on popular government resources? Can you name all the fields on the campaign disclosure files? Thought not.

Predictive analytics might not have predicted the Aurora shooter

Saturday, July 28th, 2012

Predictive analytics might not have predicted the Aurora shooter by Robert L. Mitchell.

From the post:

Could aggressive data mining by law enforcement prevent heinous crimes, such as the recent mass murder in Aurora, CO., by catching killers before they can act?

The Aurora shooter certainly left a long trail of transactions. In the two months leading up to the crime he bought more than 6,000 rounds of ammunition, several guns, head-to-toe ballistic protective gear and accelerants and other chemicals used to build homemade explosives. These purchases were made from both online ecommerce sites and brick and mortar stores, and more than 50 packages were sent to his apartment, according to news reports.

Robert injects a note of sanity into recent discussions about data mining and the Aurora shooting by quoting Dean Abbott of Abbott Analytics as saying:

Much as we’d like to think we can solve the problem with technology, it turns out that there is no magic bullet. “Something like this could be valuable,” Abbott says. “I just don’t think it’s obvious that it would be fruitful.”

That would make a good movie script but not much else. (Oh, wait, there is such a movie, Minority Report.)

Predictive analytics are useful in the aggregate, but we already knew that from the Foundation Triology (or you could ask your local sociologist).

Days Five and Six of a Predictive Coding Narrative

Friday, July 27th, 2012

Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment by Ralph Losey.

From the post:

This is my fourth in a series of narrative descriptions of an academic search project of 699,082 Enron emails and attachments. It started as a predictive coding training exercise that I created for Jackson Lewis attorneys. The goal was to find evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane. The third and fourth days are described in Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.

In this fourth installment I continue to describe what I did in days five and six of the project. In this narrative I go deep into the weeds and describe the details of multimodal search. Near the end of day six I have an affirming hybrid multimodal mind-meld moment, which I try to describe. I conclude by sharing some helpful advice I received from Joseph White, one of Kroll Ontrack’s (KO) experts on predictive coding and KO’s Inview software. Before I launch into the narrative, a brief word about vendor experts. Don’t worry, it is not going to be a commercial for my favorite vendors; more like a warning based on hard experience.

You will learn a lot about predictive analytics and e-discovery from this series of posts but the most important paragraphs I have read thus far:

When talking to the experts, be sure that you understand what they say to you, and never just nod in agreement when you do not really get it. I have been learning and working with new computer software of all kinds for over thirty years, and am not at all afraid to say that I do not understand or follow something.

Often you cannot follow because the explanation is so poor. For instance, often the words I hear from vendor tech experts are too filled with company specific jargon. If what you are being told makes no sense to you, then say so. Keep asking questions until it does. Do not be afraid of looking foolish. You need to be able to explain this. Repeat back to them what you do understand in your own words until they agree that you have got it right. Do not just be a parrot. Take the time to understand. The vendor experts will respect you for the questions, and so will your clients. It is a great way to learn, especially when it is coupled with hands-on experience.

Insisting that experts explain until you understand what is being said will help you avoid costly mistakes and make you more sympathetic to a client’s questions when you are the expert.

The technology and software will change for predictive coding will change beyond recognition in a few short years.

Demanding and giving explanations that “explain” is a skill that will last a lifetime.

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree

Friday, July 27th, 2012

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree by Ralph Losey.

From the post:

This is the third in a series of detailed descriptions of a legal search project. The project was an academic training exercise for Jackson Lewis e-discovery liaisons conducted in May and June 2012. I searched a set of 699,082 Enron emails and attachments for possible evidence pertaining to involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane.

The description of day-two was short, but it was preceded by a long explanation of my review plan and search philosophy, along with a rant in favor of humanity and against over-dependence on computer intelligence. Here I will just stick to the facts of what I did in days three and four of my search using Kroll Ontrack’s (KO) Inview software.

Interesting description of where Ralph and the computer disagree on relevant/irrelevant judgement on documents.

Unless I just missed it, Ralph is only told be the software what rating a document was given, not why the software arrived at that rating. Yes?

If you knew what terms drove a particular rating, it would be interesting to “comment out” those terms in a document to see the impact on its relevance rating.

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane

Friday, July 13th, 2012

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane by Ralph Losey.

From the post:

Day One of the search project ended when I completed review of the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and then, with the help of William Webber, go deeper into math and statistical sampling than ever before. I will also give you the big picture of my review plan and search philosophy: its hybrid and multimodal. Some search experts disagree with my philosophy. They think I do not go far enough to fully embrace machine coding. They are wrong. I will explain why and rant on in defense of humanity. Only then will I conclude with the Day Two narrative.

More than you are probably going to want to know about sample sizes and their calculation but persevere until you get to the defense of humanity stuff. It is all quite good.

If I had to add a comment on the defense of humanity rant, it would be that machines have a flat view of documents and not the richly textured one of a human reader. While true that machines can rapidly compare document without tiring, they will miss an executive referring to a secretary as his “cupcake.” A reference that would jump out at a human reader. Same text, different result.

Perhaps because in one case the text is being scanned for tokens and in the other case it is being read.

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron

Friday, July 13th, 2012

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron by Ralph Losey.

The start of a series of posts on predictive coding and searching of the Enron emails by a lawyer. A legal perspective is important enough that I will be posting a note about each post in this series as they occur.

A couple of preliminary notes:

I am sure this is the first time that Ralph has used predictive encoding with the Enron emails. On the other hand, I would not take “…this is the first time for X…” sort of claims from any vendor or service organization. 😉

You can see other examples of processing the Enron emails at:

And that is just a “lite” scan. There are numerous other projects that use the Enron email collection.

I wonder if that is because we are naturally nosey?

From the post:

This is the first in a series of narrative descriptions of a legal search project using predictive coding. Follow along while I search for evidence of involuntary employee terminations in a haystack of 699,082 Enron emails and attachments.

Joys and Risks of Being First

To the best of my knowledge, this writing project is another first. I do not think anyone has ever previously written a blow-by-blow, detailed description of a large legal search and review project of any kind, much less a predictive coding project. Experts on predictive coding speak only from a mile high perspective; never from the trenches (you can speculate why). That has been my practice here, until now, and also my practice when speaking about predictive coding on panels or in various types of conferences, workshops, and classes.

There are many good reasons for this, including the main one that lawyers cannot talk about their client’s business or information. That is why in order to do this I had to run an academic project and search and review the Enron data. Many people could do the same. In fact, each year the TREC Legal Track participants do similar search projects of Enron data. But still, no one has taken the time to describe the details of their search, not even the spacey TRECkies (sorry Jason).

A search project like this takes an enormous amount of time. In fact, to my knowledge (Maura, please correct me if I’m wrong), no Legal Track TRECkies have ever recorded and reported the time that they put into the project, although there are rumors. In my narrative I will report the amount of time that I put into the project on a day-by-day basis, and also, sometimes, on a per task basis. I am a lawyer. I live by the clock and have done so for thirty-two years. Time is important to me, even non-money time like this. There is also a not-insignificant amount of time it takes to write it up a narrative like this. I did not attempt to record that.

There is one final reason this has never been attempted before, and it is not trivial: the risks involved. Any narrator who publicly describes their search efforts assumes the risk of criticism from monday morning quarterbacks about how the sausage was made. I get that. I think I can handle the inevitable criticism. A quote that Jason R. Baron turned me on to a couple of years ago helps, the famous line from Theodore Roosevelt in his Man in the Arena speech at the Sorbonne:

It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.

I know this narrative is no high achievement, but we all do what we can, and this seems within my marginal capacities.

BigMl 0.3.1 Release

Friday, July 6th, 2012

BigMl 0.3.1 Release

From the webpage:

An open source binding to, the public BigML API


BigML makes machine learning easy by taking care of the details required to add data-driven decisions and predictive power to your company. Unlike other machine learning services, BigML creates beautiful predictive models that can be easily understood and interacted with.

These BigML Python bindings allow you to interact with, the API for BigML. You can use it to easily create, retrieve, list, update, and delete BigML resources (i.e., sources, datasets, models and, predictions).

There’s that phrase again, predictive models.

Don’t people read patent literature anymore? 😉 I don’t care for absurdist fiction so I tend to avoid it. People claiming invention for having a patent lawyer write common art up in legal prose. Good for patent lawyers, bad for researchers and true inventers.

Predictive Analytics World

Saturday, June 30th, 2012

Predictive Analytics World

I mention a patent on “predictive coding” and now a five (5) day conference on predictive analytics?

The power of blogging? Or self-delusion. Your call. 😉

Seriously, if you are interested in predictive analytics, this looks like a good opportunity to learn more.

It has all the earmarks of a “vendor” conference so I predict you will be spending money but the contacts and basic information should be worth your while.

Suggestions of other predictive analytic resources that aren’t vendor posturing and useful as general introduction?

Reasoning that if it is information, then you should be using a topic map to either trail blaze or navigate it.

Predictive Coding Patented, E-Discovery World Gets Jealous

Wednesday, June 27th, 2012

Predictive Coding Patented, E-Discovery World Gets Jealous by Christopher Danzig

From the post:

The normally tepid e-discovery world felt a little extra heat of competition yesterday. Recommind, one of the larger e-discovery vendors, announced Wednesday that it was issued a patent on predictive coding (which Gabe Acevedo, writing in these pages, named the Big Legal Technology Buzzword of 2011).

In a nutshell, predictive coding is a relatively new technology that allows large chunks of document review to be automated, a.k.a. done mostly by computers, with less need for human management.

Some of Recommind’s competitors were not happy about the news. See how they responded (grumpily), and check out what Recommind’s General Counsel had to say about what this means for everyone who uses e-discovery products….

Predictive coding has received a lot of coverage recently as a new way to save buckets of money during document review (a seriously expensive endeavor, for anyone who just returned to Earth).

I am always curious why a patent or even patent number will be cited but no link to the patent given?

In case you are curious, it is patent 7,933,859, as a hyperlink.

The abstract reads:

Systems and methods for analyzing documents are provided herein. A plurality of documents and user input are received via a computing device. The user input includes hard coding of a subset of the plurality of documents, based on an identified subject or category. Instructions stored in memory are executed by a processor to generate an initial control set, analyze the initial control set to determine at least one seed set parameter, automatically code a first portion of the plurality of documents based on the initial control set and the seed set parameter associated with the identified subject or category, analyze the first portion of the plurality of documents by applying an adaptive identification cycle, and retrieve a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle test on the first portion of the plurality of documents.

If that sounds familiar to you, you are not alone.

Predictive coding, developed over the last forty years, is an excellent feed into a topic map. As a matter of fact, it isn’t hard to imagine a topic map seeding and being augmented by a predictive coding process.

I also mention it as a caution that the IP in this area, as in many others, is beset by the ordinary being approved as innovation.

A topic map would be ideal to trace claims, prior art and to attach analysis to a patent. I saw several patents assigned to Recommind and some pending applications. When I have a moment I will post a listing with links to those documents.

I first saw this at Beyond Search.

Predictive Analytics: Evaluate Model Performance

Sunday, June 24th, 2012

Predictive Analytics: Evaluate Model Performance by Ricky Ho.

Ricky finishes his multi-part series on models for machine learning with the one question left hanging:

OK, so which model should I use?

In previous posts, we have discussed various machine learning techniques including linear regression with regularization, SVM, Neural network, Nearest neighbor, Naive Bayes, Decision Tree and Ensemble models. How do we pick which model is the best ? or even whether the model we pick is better than a random guess ? In this posts, we will cover how we evaluate the performance of the model and what can we do next to improve the performance.

Best guess with no model

First of all, we need to understand the goal of our evaluation. Are we trying to pick the best model ? Are we trying to quantify the improvement of each model ? Regardless of our goal, I found it is always useful to think about what the baseline should be. Usually the baseline is what is your best guess if you don’t have a model.

For classification problem, one approach is to do a random guess (with uniform probability) but a better approach is to guess the output class that has the largest proportion in the training samples. For regression problem, the best guess will be the mean of output of training samples.

Ricky walks you through the steps and code to make an evaluation of each model.

It is always better to have evidence that your choices were better than a coin flip.

Although, I am mindful of the wealth advice story in “Thinking, Fast and Slow” by Daniel Kahneman, where he was given data of investment outcomes for eight years by 28 wealth advisers. The results indicated there was no correlation between “skill” and the outcomes. Luck and not skill was being rewarded with bonuses.

The results were ignored by both management and advisers as inconsistent with their “…personal experiences from experience.” (pp. 215-216)

Do you think the same can be said of search results? Just curious.

Predictive Analytics: Generalized Linear Regression [part 3]

Sunday, June 3rd, 2012

Predictive Analytics: Generalized Linear Regression by Ricky Ho.

From the post:

In the previous 2 posts, we have covered how to visualize input data to explore strong signals as well as how to prepare input data to a form that is situation for learning. In this and subsequent posts, I’ll go through various machine learning techniques to build our predictive model.

  1. Linear regression
  2. Logistic regression
  3. Linear and Logistic regression with regularization
  4. Neural network
  5. Support Vector Machine
  6. Naive Bayes
  7. Nearest Neighbor
  8. Decision Tree
  9. Random Forest
  10. Gradient Boosted Trees

There are two general types of problems that we are interested in this discussion; Classification is about predicting a category (value that is discrete, finite with no ordering implied) while Regression is about predicting a numeric quantity (value is continuous, infinite with ordering).

For classification problem, we use the “iris” data set and predict its “species” from its “width” and “length” measures of sepals and petals. Here is how we setup our training and testing data.

Ricky walks you through linear regression, logistic regression and linear and logistic regression with regularization.

Predictive Analytics: Data Preparation [part 2]

Friday, May 18th, 2012

Predictive Analytics: Data Preparation by Ricky Ho.

From the post:

As a continuation of my last post on predictive analytics, in this post I will focus in describing how to prepare data for the training the predictive model., I will cover how to perform necessary sampling to ensure the training data is representative and fit into the machine processing capacity. Then we validate the input data and perform necessary cleanup on format error, fill-in missing values and finally transform the collected data into our defined set of input features.

Different machine learning model will have its unique requirement in its input and output data type. Therefore, we may need to perform additional transformation to fit the model requirement

Part 2 of Ricky’s posts on predictive analytics.

Predictive Analytics: Overview and Data visualization

Friday, May 18th, 2012

Predictive Analytics: Overview and Data visualization by Ricky Ho.

From the post:

I plan to start a series of blog post on predictive analytics as there is an increasing demand on applying machine learning technique to analyze large amount of raw data. This set of technique is very useful to me and I think they should be useful to other people as well. I will also going through some coding example in R. R is a statistical programming language that is very useful for performing predictive analytic tasks. In case you are not familiar with R, here is a very useful link to get some familiarity in R.

Predictive Analytics is a specialize data processing techniques focusing in solving the problem of predicting future outcome based on analyzing previous collected data. The processing cycle typically involves two phases of processing:

  1. Training phase: Learn a model from training data
  2. Predicting phase: Deploy the model to production and use that to predict the unknown or future outcome

The whole lifecycle of training involve the following steps.

Ricky has already posted part 2 but I am going to create separate entries for them. Mostly to make sure I don’t miss any of his posts.


Data mining opens the door to predictive neuroscience (Google Hazing Rituals)

Tuesday, April 17th, 2012

Data mining opens the door to predictive neuroscience

From the post:

Ecole Polytechnique Fédérale de Lausanne (EPFL) researchers have discovered rules that relate the genes that a neuron switches on and off to the shape of that neuron, its electrical properties, and its location in the brain.

The discovery, using state-of-the-art computational tools, increases the likelihood that it will be possible to predict much of the fundamental structure and function of the brain without having to measure every aspect of it.

That in turn makes modeling the brain in silico — the goal of the proposed Human Brain Project — a more realistic, less Herculean, prospect.

The fulcrum of predictive analytics is finding the “basis” for prediction and within what measurement of error.

Curious how that would work in an employment situation?

Rather than Google’s intellectual hazing rituals, project a thirty-minute questionnaire on Google hires against their evaluations at six-month intervals. Give prospective hires the same questionnaire and then “up” or “down” decisions on hiring. Likely to be as accurate as the current rituals.

Updated Google Prediction API

Saturday, March 17th, 2012

Updated Google Prediction API

From the post:

Although we can’t reliably compare its future-predicting abilities to a crystal ball, the Google Prediction API unlocks a powerful mechanism to use machine learning in your applications.

The Prediction API allows developers to train their own predictive models, taking advantage of Google’s world-class machine learning algorithms. It can be used for all sorts of classification and recommendation problems from spam detection to message routing decisions. In the latest release, the Prediction API has added more detailed debugging information on trained models and a new App Engine sample, which illustrates how to use the Google Prediction API for the Java and Python runtimes.

To help App Engine developers get started with the prediction API, we’ve published an article and walkthrough detailing how to create and manage predictive models in App Engine apps with simple authentication using OAuth2 and service accounts. Check out the walkthrough and let us know what you think on the group. Happy coding!

OK, so what do I do when I leave my crystal ball at home?

Oh, that is why this is on the “cloud” I suppose. 😉

Are you using the Google Prediction API? Would appreciate hearing from satisfied/unsatisfied users. Certainly the sort of thing that could be important in authoring/curating a topic map.

Target, Pregnancy and Predictive Analytics (parts 1 and 2)

Thursday, March 1st, 2012

Dean Abbott wrote a pair of posts on a New York Times article about Target predicting if customers are pregnant.

Target, Pregnancy and Predictive Analytics (part 1)

Target, Pregnancy and Predictive Analytics (part 2)

Read both I truly liked his conclusion that models give us the patterns in data but it is up to us to “recognize” the patterns as significant.

BTW, I do wonder what the different is between the New York Times snooping for secrets to sell newspapers versus Target to sell products? If you know, please give a shout!