Archive for the ‘Prediction’ Category

Intrade Archive: Data for Posterity

Wednesday, April 3rd, 2013

Intrade Archive: Data for Posterity by Panos Ipeirotis.

From the post:

A few years back, I have done some work on prediction markets. For this line of research, we have been collecting data from Intrade, to perform our experimental analysis. Some of the data is available through the Intrade Archive, a web app that I wrote in order to familiarize myself with the Google App Engine.

In the last few weeks, through, after the effective shutdown of Intrade, I started receiving requests on getting access to the data stored in the Intrade Archive. So, after popular demand, I gathered all the data from the Intrade Archive, and also all the past data that I had about all the Intrade contracts going back to 2003, and I put them all on GitHub for everyone to access and download.

If you don’t know about Intrade, see: How Intrade Works.

Not sure why you would need the data but it is unusual enough to merit notice.

…[D]emocratization of modeling, simulations, and predictions

Sunday, January 27th, 2013

Technical engine for democratization of modeling, simulations, and predictions by Justyna Zander and Pieter J. Mosterman. (Justyna Zander and Pieter J. Mosterman. 2012. Technical engine for democratization of modeling, simulations, and predictions. In Proceedings of the Winter Simulation Conference (WSC ’12). Winter Simulation Conference , Article 228 , 14 pages.)

Abstract:

Computational science and engineering play a critical role in advancing both research and daily-life challenges across almost every discipline. As a society, we apply search engines, social media, and selected aspects of engineering to improve personal and professional growth. Recently, leveraging such aspects as behavioral model analysis, simulation, big data extraction, and human computation is gaining momentum. The nexus of the above facilitates mass-scale users in receiving awareness about the surrounding and themselves. In this paper, an online platform for modeling and simulation (M&S) on demand is proposed. It allows an average technologist to capitalize on any acquired information and its analysis based on scientifically-founded predictions and extrapolations. The overall objective is achieved by leveraging open innovation in the form of crowd-sourcing along with clearly defined technical methodologies and social-network-based processes. The platform aims at connecting users, developers, researchers, passionate citizens, and scientists in a professional network and opens the door to collaborative and multidisciplinary innovations. An example of a domain-specific model of a pick and place machine illustrates how to employ the platform for technical innovation and collaboration.

It is an interesting paper but when speaking of integration of models the authors say:

The integration is performed in multiple manners. Multi-domain tools that become accessible from one common environment using the cloud-computing paradigm serve as a starting point. The next step of integration happens when various M&S execution semantics (and models of computation (cf., Lee and Sangiovanni-Vincentelli 1998; Lee 2010) are merged and model transformations are performed.

That went by too quickly for me. You?

The question of effective semantic integration is an important one.

The U.S. federal government publishes enough data to map where some of the dark data is waiting to be found.

The good, bad or irrelevant data churned out every week, makes the amount of effort required an ever increasing barrier to its use by the public.

Perhaps that is by design?

What do you think?

Prediction API – Machine Learning from Google

Tuesday, January 22nd, 2013

Prediction API – Machine Learning from Google by Istvan Szegedi.

From the post:

One of the exciting APIs among the 50+ APIs offered by Google is the Prediction API. It provides pattern matching and machine learning capabilities like recommendations or categorization. The notion is similar to the machine learning capabilities that we can see in other solutions (e.g. in Apache Mahout): we can train the system with a set of training data and then the applications based on Prediction API can recommend (“predict”) what products the user might like or they can categories spams, etc.

In this post we go through an example how to categorize SMS messages – whether they are spams or valuable texts (“hams”).

Nice introduction to Google’s Prediction API.

A use case for topic map authoring would be to route content to appropriate experts for further evaluation.

Data Mining Book Review: Dance with Chance

Sunday, November 4th, 2012

Data Mining Book Review: Dance with Chance by Sandro Saitta.

From the post:

If you ever worked on time series prediction (forecasting), you should read Dance with Chance. It is written by a statistician, a psychologist and a decision scientist (Makriddakis, Hogarth and Gaba). As it is the case in The Numerati or Super Crunchers, authors explain complex notions to a non-expert audience. I find the book really interesting and provocative.

The main concept of Dance with Chance is the “illusion of control”. It is when you think you control a future event or situation, that is in fact mainly due to chance. This is the opposite of fatalism (when you think you have no control, although you have). The book teaches how to avoid being fooled by this illusion of control. This is a very interesting reading for any data miner, particularly involved with forecasting. The books contains dozens of examples of the limitation of forecasting techniques. For example, it explains the issues of forecasting the stock market and when predictions are due to chance. Authors use a brilliant mix of statistics and psychology to prove their point.

From the review this sounds like an interesting read.

Forecasting can be useful but being aware of its limitations is as well.

Predicting what topics will trend on Twitter [Predicting Merging?]

Friday, November 2nd, 2012

Predicting what topics will trend on Twitter

From the post:

Twitter’s home page features a regularly updated list of topics that are “trending,” meaning that tweets about them have suddenly exploded in volume. A position on the list is highly coveted as a source of free publicity, but the selection of topics is automatic, based on a proprietary algorithm that factors in both the number of tweets and recent increases in that number.

At the Interdisciplinary Workshop on Information and Decision in Social Networks at MIT in November, Associate Professor Devavrat Shah and his student, Stanislav Nikolov, will present a new algorithm that can, with 95 percent accuracy, predict which topics will trend an average of an hour and a half before Twitter’s algorithm puts them on the list — and sometimes as much as four or five hours before.

If you can’t attend the Interdisciplinary Workshop on Information and Decision in Social Networks workshop, which has an exciting final program, try Stanislav Nikolov thesis, Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series.

Abstract:

In supervised classification, one attempts to learn a model of how objects map to labels by selecting the best model from some model space. The choice of model space encodes assumptions about the problem. We propose a setting for model specification and selection in supervised learning based on a latent source model. In this setting, we specify the model by a small collection of unknown latent sources and posit that there is a stochastic model relating latent sources and observations. With this setting in mind, we propose a nonparametric classification method that is entirely unaware of the structure of these latent sources. Instead, our method relies on the data as a proxy for the unknown latent sources. We perform classification by computing the conditional class probabilities for an observation based on our stochastic model. This approach has an appealing and natural interpretation — that an observation belongs to a certain class if it sufficiently resembles other examples of that class.

We extend this approach to the problem of online time series classification. In the binary case, we derive an estimator for online signal detection and an associated implementation that is simple, efficient, and scalable. We demonstrate the merit of our approach by applying it to the task of detecting trending topics on Twitter. Using a small sample of Tweets, our method can detect trends before Twitter does 79% of the time, with a mean early advantage of 1.43 hours, while maintaining a 95% true positive rate and a 4% false positive rate. In addition, our method provides the flexibility to perform well under a variety of tradeoffs between types of error and relative detection time.

This will be interesting in many classification contexts.

Particularly predicting what topics a user will say represent the same subject.

BigML creates a marketplace for Predictive Models

Friday, October 26th, 2012

BigML creates a marketplace for Predictive Models by Ajay Ohri.

From the post:

BigML has created a marketplace for selling Datasets and Models. This is a first (?) as the closest market for Predictive Analytics till now was Rapid Miner’s marketplace for extensions (at http://rapidupdate.de:8180/UpdateServer/faces/index.xhtml)

From http://blog.bigml.com/2012/10/25/worlds-first-predictive-marketplace/

SELL YOUR DATA

You can make your Dataset public. Mind you: the Datasets we are talking about are BigML’s fancy histograms. This means that other BigML users can look at your Dataset details and create new models based on this Dataset. But they can not see individual records or columns or use it beyond the statistical summaries of the Dataset. Your Source will remain private, so there is no possibility of anyone accessing the raw data.

SELL YOUR MODEL

Now, once you have created a great model, you can share it with the rest of the world. For free or at any price you set.Predictions are paid for in BigML Prediction Credits. The minimum price is ‘Free’ and the maximum price indicated is 100 credits.

Having a public, digital marketplace for data and data analysis has been proposed by many and attempted by more than just a few.

Data is bought and sold today, but not by the digital equivalent of small shop keepers. The shop keepers who changed the face of Europe.

Data is bought and sold today by the digital equivalent of the great feudal lords. Complete with castles (read silos).

Will BigML give rise to a new mercantile class?

Or just as importantly, will you be a member of it or bound to the estate of a feudal lord?

Are Expert Semantic Rules so 1980′s?

Monday, October 8th, 2012

In The Geometry of Constrained Structured Prediction: Applications to Inference and Learning of Natural Language Syntax André Martins proposes advances in inferencing and learning for NLP processing. And it is important work for that reason.

But in his introduction to recent (and rapid) progress in language technologies, the following text caught my eye:

So, what is the driving force behind the aforementioned progress? Essentially, it is the alliance of two important factors: the massive amount of data that became available with the advent of the Web, and the success of machine learning techniques to extract statistical models from the data (Mitchell, 1997; Manning and Schötze, 1999; Schölkopf and Smola, 2002; Bishop, 2006; Smith, 2011). As a consequence, a new paradigm has emerged in the last couple of decades, which directs attention to the data itself, as opposed to the explicit representation of knowledge (Abney, 1996; Pereira, 2000; Halevy et al., 2009). This data-centric paradigm has been extremely fruitful in natural language processing (NLP), and came to replace the classic knowledge representation methodology which was prevalent until the 1980s, based on symbolic rules written by experts. (emphasis added)

Are RDF, Linked Data, topic maps, and other semantic technologies caught in a 1980′s “symbolic rules” paradigm?

Are we ready to make the same break that NLP did, what, thirty (30) years ago now?

To get started on the literature, consider André’s sources:

Abney, S. (1996). Statistical methods and linguistics. In The balancing act: Combining symbolic and statistical approaches to language, pages 1–26. MIT Press, Cambridge, MA.

A more complete citation: Steven Abney. Statistical Methods and Linguistics. In: Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language. The MIT Press, Cambridge, MA. 1996. (Link is to PDF of Abney’s paper.)

Pereira, F. (2000). Formal grammar and information theory: together again? Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 358(1769):1239–1253.

I added a pointer to the Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences abstract for the article. You can see it at: Formal grammar and information theory: together again? (PDF file).

Halevy, A., Norvig, P., and Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2):8–12.

I added a pointer to the Intelligent Systems, IEEE abstract for the article. You can see it at: The unreasonable effectiveness of data (PDF file).

The Halevy article doesn’t have an abstract per se but the ACM reports one as:

Problems that involve interacting with humans, such as natural language understanding, have not proven to be solvable by concise, neat formulas like F = ma. Instead, the best approach appears to be to embrace the complexity of the domain and address it by harnessing the power of data: if other humans engage in the tasks and generate large amounts of unlabeled, noisy data, new algorithms can be used to build high-quality models from the data. [ACM]

That sounds like a challenge to me. You?

PS: I saw the pointer to this thesis at Christophe Lalanne’s A bag of tweets / September 2012

Building a “Data Eye in the Sky”

Saturday, September 22nd, 2012

Building a “Data Eye in the Sky” by Erwin Gianchandani.

From the post:

Nearly a year ago, tech writer John Markoff published a story in The New York Times about Open Source Indicators (OSI), a new program by the Federal government’s Intelligence Advanced Research Projects Activity (IARPA) seeking to automatically collect publicly available data, including Web search queries, blog entries, Internet traffic flows, financial market indicators, traffic webcams, changes in Wikipedia entries, etc., to understand patterns of human communication, consumption, and movement. According to Markoff:

It is intended to be an entirely automated system, a “data eye in the sky” without human intervention, according to the program proposal. The research would not be limited to political and economic events, but would also explore the ability to predict pandemics and other types of widespread contagion, something that has been pursued independently by civilian researchers and by companies like Google.

This past April, IARPA issued contracts to three research teams, providing funding potentially for up to three years, with continuation beyond the first year contingent upon satisfactory progress. At least two of these contracts are now public (following the link):

Erwin reviews what is known about programs at Virginia Tech and BBN Technologies.

And concludes with:

Each OSI research team is being required to make a number of warnings/alerts that will be judged on the basis of lead time, or how early the alert was made; the accuracy of the warning, such as the where/when/what of the alert; and the probability associated with the alert, that is, high vs. very high.

To learn more about the OSI program, check out the IARPA website or a press release issued by Virginia Tech.

Given the complexities of semantics, what has my curiosity up is how “warnings/alerts” are going to be judged?

Recalling that “all the lights were blinking red” before 9/11.

If all the traffic lights in the U.S. flashed three (3) times at the same time, without more, it could mean anything from the end of the Mayan calendar to free beer. One just never knows.

Do you have the stats on the oracle at Delphi?

Might be a good baseline for comparison.

Predictive Models: Build once, Run Anywhere

Tuesday, August 21st, 2012

Predictive Models: Build once, Run Anywhere

From the post:

We have released a new version of our open source Python bindings. This new version aims at showing how the BigML API can be used to build predictive models capable of generating predictions locally or remotely. You can get full access to the code at Github and read the full documentation at Read the Docs.

The complete list of updates includes (drum roll, please):

Development Mode

We recently introduced a free sandbox to help developers play with BigML on smaller datasets without being concerned about credits. In the new Python bindings you can use BigML in development mode, and all dataset and models smaller than 1MB can be created for free:

from bigml.api import BigML

api = BigML(dev_mode=True)

A “sandbox” for your machine learning experiments!

Day Nine of a Predictive Coding Narrative: A scary search…

Wednesday, August 8th, 2012

Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CAR with the Griswold’s, and a moral dilemma by Ralph Losey.

From the post:

In this sixth installment I continue my description, this time covering day nine of the project. Here I do a quality control review of a random sample to evaluate my decision in day eight to close the search.

Ninth Day of Review (4 Hours)

I began by generating a random sample of 1,065 documents from the entire null set (95% +/- 3%) of all documents not reviewed. I was going to review this sample as a quality control test of the adequacy of my search and review project. I would personally review all of them to see if any were False Negatives, in other words, relevant documents, and if relevant, whether any were especially significant or Highly Relevant.

I was looking to see if there were any documents left on the table that should have been produced. Remember that I had already personally reviewed all of the documents that the computer had predicted were like to be relevant (51% probability). I considered the upcoming random sample review of the excluded documents to be a good way to check the accuracy of reliance on the computer’s predictions of relevance.

I know it is not the only way, and there are other quality control measures that could be followed, but this one makes the most sense to me. Readers are invited to leave comments on the adequacy of this method and other methods that could be employed instead. I have yet to see a good discussion of this issue, so maybe we can have one here.

I can appreciate Ralph’s apprehension at a hindsight review of decisions already made. In legal proceedings, decisions are made and they move forward. Some judgements/mistakes can be corrected, others are simply case history.

Days Seven and Eight of a Predictive Coding Narrative [Re-Use of Analysis?]

Wednesday, August 8th, 2012

Days Seven and Eight of a Predictive Coding Narrative: Where I have another hybrid mind-meld and discover that the computer does not know God by Ralph Losey.

From the post:

In this fifth installment I will continue my description, this time covering days seven and eight of the project. As the title indicates, progress continues and I have another hybrid mind-meld moment. I also discover that the computer does not recognize the significance of references to God in an email. This makes sense logically, but is unexpected and kind of funny when encountered in a document review.

Ralph discovered new terms to use for training as the analysis of the documents progressed.

While Ralph captures those for his use, my question would be how to capture what he learned for re-use?

As in re-use by other parties, perhaps in other litigation.

Thinking of reducing the cost of discovery by sharing analysis of data sets, rather than every discovery process starting at ground zero.

Days Five and Six of a Predictive Coding Narrative

Friday, July 27th, 2012

Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment by Ralph Losey.

From the post:

This is my fourth in a series of narrative descriptions of an academic search project of 699,082 Enron emails and attachments. It started as a predictive coding training exercise that I created for Jackson Lewis attorneys. The goal was to find evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane. The third and fourth days are described in Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.

In this fourth installment I continue to describe what I did in days five and six of the project. In this narrative I go deep into the weeds and describe the details of multimodal search. Near the end of day six I have an affirming hybrid multimodal mind-meld moment, which I try to describe. I conclude by sharing some helpful advice I received from Joseph White, one of Kroll Ontrack’s (KO) experts on predictive coding and KO’s Inview software. Before I launch into the narrative, a brief word about vendor experts. Don’t worry, it is not going to be a commercial for my favorite vendors; more like a warning based on hard experience.

You will learn a lot about predictive analytics and e-discovery from this series of posts but the most important paragraphs I have read thus far:

When talking to the experts, be sure that you understand what they say to you, and never just nod in agreement when you do not really get it. I have been learning and working with new computer software of all kinds for over thirty years, and am not at all afraid to say that I do not understand or follow something.

Often you cannot follow because the explanation is so poor. For instance, often the words I hear from vendor tech experts are too filled with company specific jargon. If what you are being told makes no sense to you, then say so. Keep asking questions until it does. Do not be afraid of looking foolish. You need to be able to explain this. Repeat back to them what you do understand in your own words until they agree that you have got it right. Do not just be a parrot. Take the time to understand. The vendor experts will respect you for the questions, and so will your clients. It is a great way to learn, especially when it is coupled with hands-on experience.

Insisting that experts explain until you understand what is being said will help you avoid costly mistakes and make you more sympathetic to a client’s questions when you are the expert.

The technology and software will change for predictive coding will change beyond recognition in a few short years.

Demanding and giving explanations that “explain” is a skill that will last a lifetime.

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree

Friday, July 27th, 2012

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree by Ralph Losey.

From the post:

This is the third in a series of detailed descriptions of a legal search project. The project was an academic training exercise for Jackson Lewis e-discovery liaisons conducted in May and June 2012. I searched a set of 699,082 Enron emails and attachments for possible evidence pertaining to involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane.

The description of day-two was short, but it was preceded by a long explanation of my review plan and search philosophy, along with a rant in favor of humanity and against over-dependence on computer intelligence. Here I will just stick to the facts of what I did in days three and four of my search using Kroll Ontrack’s (KO) Inview software.

Interesting description of where Ralph and the computer disagree on relevant/irrelevant judgement on documents.

Unless I just missed it, Ralph is only told be the software what rating a document was given, not why the software arrived at that rating. Yes?

If you knew what terms drove a particular rating, it would be interesting to “comment out” those terms in a document to see the impact on its relevance rating.

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane

Friday, July 13th, 2012

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane by Ralph Losey.

From the post:

Day One of the search project ended when I completed review of the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and then, with the help of William Webber, go deeper into math and statistical sampling than ever before. I will also give you the big picture of my review plan and search philosophy: its hybrid and multimodal. Some search experts disagree with my philosophy. They think I do not go far enough to fully embrace machine coding. They are wrong. I will explain why and rant on in defense of humanity. Only then will I conclude with the Day Two narrative.

More than you are probably going to want to know about sample sizes and their calculation but persevere until you get to the defense of humanity stuff. It is all quite good.

If I had to add a comment on the defense of humanity rant, it would be that machines have a flat view of documents and not the richly textured one of a human reader. While true that machines can rapidly compare document without tiring, they will miss an executive referring to a secretary as his “cupcake.” A reference that would jump out at a human reader. Same text, different result.

Perhaps because in one case the text is being scanned for tokens and in the other case it is being read.

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron

Friday, July 13th, 2012

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron by Ralph Losey.

The start of a series of posts on predictive coding and searching of the Enron emails by a lawyer. A legal perspective is important enough that I will be posting a note about each post in this series as they occur.

A couple of preliminary notes:

I am sure this is the first time that Ralph has used predictive encoding with the Enron emails. On the other hand, I would not take “…this is the first time for X…” sort of claims from any vendor or service organization. ;-)

You can see other examples of processing the Enron emails at:

And that is just a “lite” scan. There are numerous other projects that use the Enron email collection.

I wonder if that is because we are naturally nosey?

From the post:

This is the first in a series of narrative descriptions of a legal search project using predictive coding. Follow along while I search for evidence of involuntary employee terminations in a haystack of 699,082 Enron emails and attachments.

Joys and Risks of Being First

To the best of my knowledge, this writing project is another first. I do not think anyone has ever previously written a blow-by-blow, detailed description of a large legal search and review project of any kind, much less a predictive coding project. Experts on predictive coding speak only from a mile high perspective; never from the trenches (you can speculate why). That has been my practice here, until now, and also my practice when speaking about predictive coding on panels or in various types of conferences, workshops, and classes.

There are many good reasons for this, including the main one that lawyers cannot talk about their client’s business or information. That is why in order to do this I had to run an academic project and search and review the Enron data. Many people could do the same. In fact, each year the TREC Legal Track participants do similar search projects of Enron data. But still, no one has taken the time to describe the details of their search, not even the spacey TRECkies (sorry Jason).

A search project like this takes an enormous amount of time. In fact, to my knowledge (Maura, please correct me if I’m wrong), no Legal Track TRECkies have ever recorded and reported the time that they put into the project, although there are rumors. In my narrative I will report the amount of time that I put into the project on a day-by-day basis, and also, sometimes, on a per task basis. I am a lawyer. I live by the clock and have done so for thirty-two years. Time is important to me, even non-money time like this. There is also a not-insignificant amount of time it takes to write it up a narrative like this. I did not attempt to record that.

There is one final reason this has never been attempted before, and it is not trivial: the risks involved. Any narrator who publicly describes their search efforts assumes the risk of criticism from monday morning quarterbacks about how the sausage was made. I get that. I think I can handle the inevitable criticism. A quote that Jason R. Baron turned me on to a couple of years ago helps, the famous line from Theodore Roosevelt in his Man in the Arena speech at the Sorbonne:

It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.

I know this narrative is no high achievement, but we all do what we can, and this seems within my marginal capacities.

Predicting link directions via a recursive subgraph-based ranking

Tuesday, June 12th, 2012

Predicting link directions via a recursive subgraph-based ranking by Fangjian Guo, Zimo Yang, and Tao Zhou.

Abstract:

Link directions are essential to the functionality of networks and their prediction is helpful towards a better knowledge of directed networks from incomplete real-world data. We study the problem of predicting the directions of some links by using the existence and directions of the rest of links. We propose a solution by first ranking nodes in a specific order and then predicting each link as stemming from a lower-ranked node towards a higher-ranked one. The proposed ranking method works recursively by utilizing local indicators on multiple scales, each corresponding to a subgraph extracted from the original network. Experiments on real networks show that the directions of a substantial fraction of links can be correctly recovered by our method, which outperforms either purely local or global methods.

This paper focuses mostly on prediction of direction of links, relying on other research for the question of link existence.

I mention it because predicting links and their directions will be important for planning graph database deployments in particular.

It will be a little late to find out when under full load that other modeling choices should have been made. (It is usually under “full load” conditions when retrospectives on modeling choices come up.)

Developing a predictive analytics program doable on a limited budget

Sunday, November 13th, 2011

Developing a predictive analytics program doable on a limited budget

From the post:

Predictive analytics is experiencing what David Menninger, a research director and vice president at Ventana Research Inc., calls “a renewed interest.” And he’s not the only one who is seeing a surge in the number of organizations looking to set up a predictive analytics program.

In September, Hurwitz & Associates, a consulting and market research firm in Needham, Mass., released a report ranking 12 predictive analytics vendors that it views as “strong contenders” in the market. Fern Halper, a Hurwitz partner and the principal researcher for the report, thinks predictive analytics is moving into the user mainstream. She said its growing popularity is being driven by better tools, increased access to high-performance computing resources, reduced storage costs and an economic climate that has businesses hungry for better forecasting.

“Especially in today’s economy, they’re realizing they can’t just look in the rearview mirror and look at what has happened,” said Halper. “They need to look at what can happen and what will happen and become as smart as they can possibly be if they’re going to compete.”

While predictive analytics basks in the limelight, the nuances of developing an effective program are tricky and sometimes can be overwhelming for organizations. But the good news, according to a variety of analysts and consultants, is that finding the right strategy is possible — even on a shoestring budget.

Here are some of their best-practices tips for succeeding on predictive analytics without breaking the bank:

What caught my eye was doable on a limited budget.

Limited budgets aren’t uncommon most of the time and in today’s economy they are down right plentiful. In private and public sectors.

The lessons in this post apply to topic maps. Don’t try to sell converting an entire enterprise or operation to topic maps. Pick some small area of pain or obvious improvement and sell a solution for that part. ROI that they can see this quarter or maybe next. Then build on that experience to propose larger or longer range projects.

Rapid-I: Report the Future

Wednesday, October 19th, 2011

Rapid-I: Report the Future

Source of:

RapidMiner: Professional open source data mining made easy.

Analytical ETL, Data Mining, and Predictive Reporting with a single solution

RapidAnalytics: Collaborative data analysis power.

No 1 in open source business analytics

The key product for business critical predictive analysis

RapidDoc: Webbased solution for document retrieval and analysis.

Classify text, identify trends as well as emerging topics

Easy to use and configure

From About Rapid-I:

Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, i.e. for large amounts of structured data like database systems and unstructured data like texts. The open-source data mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.

The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading open-source system for knowledge discovery and data mining. It is available as a stand-alone application for data analysis and as a data mining engine which can be integrated into own products. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP, Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma, PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses benefitting from the open-source business model of Rapid-I.

Data mining/analysis is the first part of any topic map project, however large or small. These tools, which I have not (yet) tried, are likely to prove useful in such projects. Comments welcome.

Google Prediction API graduates from labs

Sunday, October 16th, 2011

Google Prediction API graduates from labs, adds new features by Zachary Goldberg, Product Manager.

From the post:

Since the general availability launch of the Prediction API this year at Google I/O, we have been working hard to give every developer access to machine learning in the cloud to build smarter apps. We’ve also been working on adding new features, accuracy improvements, and feedback capability to the API.

Today we take another step by announcing Prediction v1.4. With the launch of this version, Prediction is graduating from Google Code Labs, reflecting Google’s commitment to the API’s development and stability. Version 1.4 also includes two new features:

  • Data Anomaly Analysis
    • One of the hardest parts of building an accurate predictive model is gathering and curating a high quality data set. With Prediction v1.4, we are providing a feature to help you identify problems with your data that we notice during the training process. This feedback makes it easier to build accurate predictive models with proper data.
  • PMML Import
    • PMML has become the de facto industry standard for transmitting predictive models and model data between systems. As of v1.4, the Google Prediction API can programmatically accept your PMML for data transformations and preprocessing.
    • The PMML spec is vast and covers many, many features. You can find more details about the specific features that the Google Prediction API supports here.

(I added a paragraph break in the first text block for readability. It should be re-written but I am quoting.)

Suggest you take a close look at the features of PMML that Google does not support. Quite an impressive array of non-support.

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction

Friday, September 23rd, 2011

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction by Ping Shi, Surajit Ray, Qifu Zhu and Mark A Kon.

BMC Bioinformatics 2011, 12:375 doi:10.1186/1471-2105-12-375 Published: 23 September 2011

Abstract:

Background

The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.

Results

We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher’s discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets

Conclusions

The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

Knowing the tools that are already in use in bioinformatics will help you design topic map applications of interest to those in that field. And this is a very nice combination of methods to study on its own.

Online Master of Science in Predictive Analytics

Wednesday, September 21st, 2011

Online Master of Science in Predictive Analytics

As businesses seek to maximize the value of vast new stores of available data, Northwestern University’s Master of Science in Predictive Analytics program prepares students to meet the growing demand in virtually every industry for data-driven leadership and problem solving.

Advanced data analysis, predictive modeling, computer-based data mining, and marketing, web, text, and risk analytics are just some of the areas of study offered in the program. As a student in the Master of Science in Predictive Analytics program, you will:

  • Prepare for leadership-level career opportunities by focusing on statistical concepts and practical application
  • Learn from distinguished Northwestern faculty and from the seasoned industry experts who are redefining how data improve decision-making and boost ROI
  • Build statistical and analytic expertise as well as the management and leadership skills necessary to implement high-level, data-driven decisions
  • Earn your Northwestern University master’s degree entirely online

Just so you know, libraries schools were offering mostly online degrees a decade or so ago. Nice to see other disciplines catching up. ;-)

It would be interesting to see short courses in subject analysis, as in subject identity and the properties that compose a particular identity, in specific domains.

Silverlight® Visualizations… Changing the Way We Look at Predictive Analytics

Tuesday, September 20th, 2011

Silverlight® Visualizations… Changing the Way We Look at Predictive Analytics

Webinar: Tuesday, October 18, 2011 10:00 AM – 11:00 AM PDT

Presented by Caroline Junkin, Director of Analytics Solutions for Predixion Software.

That’s about all the webinar form says so I went looking for more information. ;-)

Predixion Insight™ Video Library

From that page:

Predixion Software’s video library contains tutorials that explore the predictive analytics features currently available in Predixion Insight™, demonstrations that walk you through various applications for predictive analytics and Webinar Replays.

If subjects can include subjects that some people don’t think exist, then subjects can certainly include subjects we think may exist at some point in the future. And no doubt our references to them will change over time.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition

Sunday, April 17th, 2011

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition

by Trevor Hastie, Robert Tibshirani and Jerome Friedman.

The full pdf of the latest printing is available at this site.

Strongly recommend that if you find the text useful, that you ask your library to order the print version.

From the website:

During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book’s coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting–the first comprehensive treatment of this topic in any book.

This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for “wide” data (italics p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.