Archive for the ‘Analytics’ Category

The new analytic stack…

Sunday, March 31st, 2013

The new analytic stack is all about management, transparency and users by George Mathew.

On transparency:

Predictive analytics are essential for data-driven leaders to craft their next best decision. There are a variety of techniques across the predictive and statistical spectrums that help businesses better understand the not too distant future. Today’s biggest challenge for predictive analytics is that it is delivered in a very black-box fashion. As business leaders rely more on predictive techniques to make great data-driven decisions, there needs to be much more of a clear-box approach.

Analytics need to be packaged with self-description of data lineage, derivation of how calculations were made and an explanation of the underlying math behind any embedded algorithms. This is where I think analytics need to shift in the coming years; quickly moving away from black-box capabilities, while deliberately putting decision makers back in the driver’s seat. That’s not just about analytic output, but how it was designed, its underlying fidelity and its inherent lineage — so that trusting in analytics isn’t an act of faith.

Now there’s an opportunity for topic maps.

Data lineage, derivations, math, etc. all have their own “logics” and the “logic” of how they are assembled for a particular use.

Could debate how to formalize those logics and might eventually reach agreement years after the need has passed.

Or, you could use a topic map to declare the subjects and relationships important for your analytics today.

And merge them with the logics you devise for tomorrows analytics.

…The Analytical Sandbox [Topic Map Sandbox?]

Thursday, March 28th, 2013

Analytics Best Practices: The Analytical Sandbox by Rick Sherman.

From the post:

So this situation sounds familiar, and you are wondering if you need an analytical sandbox…

The goal of an analytical sandbox is to enable business people to conduct discovery and situational analytics. This platform is targeted for business analysts and “power users” who are the go-to people that the entire business group uses when they need reporting help and answers. This target group is the analytical elite of the enterprise.

The analytical elite have been building their own makeshift sandboxes, referred to as data shadow systems or spreadmarts. The intent of the analytical sandbox is to provide the dedicated storage, tools and processing resources to eliminate the need for the data shadow systems.

Rick outlines what he thinks is needed for an analytical sandbox.

What would you include in a topic map sand box?

Calculate Return On Analytics Investment! [TM ROI/ROA?]

Monday, February 25th, 2013

Excellent Analytics Tip #22: Calculate Return On Analytics Investment! by Avinash Kaushik.

From the post:

Analysts: Put up or shut up time!

This blog is centered around creating incredible digital experiences powered by qualitative and quantitative data insights. Every post is about unleashing the power of digital analytics (the potent combination of data, systems, software and people). But we’ve never stopped to consider this question:

What is the return on investment (ROI) of digital analytics? What is the incremental revenue impact on the company’s bottom-line for the investment in data, systems and people?

Isn’t it amazing? We’ve not pointed the sexy arrow of accountability on ourselves!

Let’s fix that in this post. Let’s calculate the ROI of digital analytics. Let’s show, with real numbers (!) and a mathematical formula (oh, my!), that we are worth it!

We shall do that in in two parts.

In part one, my good friend Jesse Nichols will present his wonderful formula for computing ROA (return on analytics).

In part two, we are going to build on the formula and create a model (ok, spreadsheet :) ) that you can use to compute ROA for your own company. We’ll have a lot of detail in the model. It contains a sample computation you can use to build your own. It also contains multiple tabs full of specific computations of revenue incrementality delivered for various analytical efforts (Paid Search, Email Marketing, Attribution Analysis, and more). It also has one tab so full of awesomeness, you are going to have to download it to bathe in its glory.

Bottom-line: The model will give you the context you need to shine the bright sunshine of Madam Accountability on your own analytics practice.

Ready? (It is okay if you are scared. :) ).

Would this work for measuring topic map ROI/ROA?

What other measurement techniques would you suggest?

OpenGamma updates its open source financial analytics platform [TM Opportunity in 2013]

Sunday, December 23rd, 2012

OpenGamma updates its open source financial analytics platform

From the post:

OpenGamma has released version 1.2 of its open source financial analytic and risk management platform. Released as Apache 2.0 licensed open source in April, the Java-based platform offers an architecture for delivering real-time available trading and risk analytics for front-office-traders, quants, and risk managers.
Version 1.2 includes a newly rebuilt beta of a new web GUI offering multi-pane analytics views with drag and drop panels, independent pop-out panels, multi-curve and surface viewers, and intelligent tab handling. Copy and paste is now more extensive and is capable of handing complex structures.
Underneath, the Analytics Library has been expanded to include support for Credit Default Swaps, Extended Futures, Commodity Futures and Options databases, and equity volatility surfaces. Data Management has improved robustness with schema checking on production systems and an auto-upgrade tool being added to handle restructuring of the futures/forwards database. The market and reference data’s live system now uses OpenGamma’s own component system. The Excel Integration module has also been enhanced and thanks to a backport now works with Excel 2003. A video shows the Excel module in action:

Integration with OpenGamma billed by OpenGamma as:

While true green-field development does exist in financial services, it’s exceptionally rare. Firms already have a variety of trade processing, analytics, and risk systems in place. They may not support your current requirements, or may be lacking in capabilities/flexibility; but no firm can or should simply throw them all away and start from scratch.

We think risk technology architecture should be designed to use and complement systems already supporting traders and risk managers. Whether proprietary or vendor solutions, considerable investments have been made in terms of time and money. Discarding them and starting from scratch risks losing valuable data and insight, and adds to the cost of rebuilding.

That being said, a primary goal of any project rethinking analytics or risk computation needs to be the elimination of all the problems siloed, legacy systems have: duplication of technology, lack of transparency, reconciliation difficulties, inefficient IT resourcing, etc.

The OpenGamma Platform was built from scratch specifically to integrate with any legacy data source, analytics library, trading system, or market data feed. Once that integration is done against our rich set of APIs and network endpoints, you can make use of it across any project based on the OpenGamma Platform.

A very valuable approach to integration, being able to access legacy or even current data sources.

But that leaves the undocumented semantics of data from those feeds on the cutting room floor.

The unspoken semantics of data from integrated feeds is like dry rot just waiting to make its presence known.

Suddenly and at the worst possible moment.

Compare that to documented data identity and semantics, which enables reliable re-use/merging of data from multiple sources.


So we are clear, I am not suggesting a topic maps platform with financial analytics capabilities.

I am suggesting incorporation of topic map capabilities into existing applications, such as OpenGamma.

That would take data integration to a whole new level.

Continuum Unleases Anaconda on Python Analytics Community

Tuesday, December 4th, 2012

Continuum Unleases Anaconda on Python Analytics Community

From the post:

Python-based data analytics solutions and services company, Continuum Analytics, today announced the release of the latest version of Anaconda, its collection of libraries for Python that includes Numba Pro, IOPro and wiseRF all in one package.

Anaconda enables large-scale data management, analysis, and visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 1.2.1, includes improved performance and feature enhancements for Numba Pro and IOPro.

Available for Windows, Mac OS X and Linux, Anaconda includes packages more than 80 popular numerical and scientific Python libraries used by scientists, engineers and data analysts, with a single integrated and flexible installer. The company says its goal is to seamlessly support switching between multiple versions of Python and other packages, via a “Python environments” feature that allows mixing and matching different versions of Python, Numpy and Scipy.

New features and upgrades in the latest version of Anaconda include performance and feature enhancements to Numba Pro and IOPro, improved conda command and in addition, Continuum has added Qt to Linux versions and has also added mdp, MLTK and pytest.

Oh, you might like the Continuum Analytics link.

And the direct Anaconda link as well.

I expect people to go elsewhere after reading my analysis or finding a resource of interest.

Isn’t that what the web is supposed to be about?

AXLE: Advanced Analytics for Extremely Large European Databases

Monday, December 3rd, 2012

AXLE: Advanced Analytics for Extremely Large European Databases

From the webpage:

The objectives of the AXLE project are to greatly improve the speed and quality of decision making on real-world data sets. AXLE aims to make these improvements generally available through high quality open source implementations via the PostgreSQL and Orange products.

The project started in early November 2012. Will be checking back to see what is proposed for PostgreSQL and/or Orange.

Fantasy Analytics

Saturday, November 10th, 2012

Fantasy Analytics by Jeff Jonas.

From the post:

Sometimes it just amazes me what people think is computable given their actual observation space. At times you have to look them in the eye and tell them they are living in fantasyland.

Jeff’s post will have you rolling on the floor!

Except that you can think of several industry and government IT projects that would fit seamlessly into his narrative.

The TSA doesn’t need “bomb” written on the outside of your carry-on luggage. They have “observers” who are watching passengers to identify terrorists. Their score so far? 0.

Which means really clever terrorists are eluding these brooding “observers.”

The explanation could not be after spending $millions on training, salaries, etc., that the concept of observers spotting terrorists is absurd.

They might recognize a suicide vest but most TSA employees can do that.

I am printing out Jeff’s post to keep on my desk.

To share with clients who are asking for absurd things.

If they don’t “get it,” I can thank them for their time and move on to more intelligent clients.

Who will complain less about being specific, appreciate the results and be good references for future business.

I first saw this in a tweet by Jeffrey Carr.

Design a Twitter Like Application with Nati Shalom

Thursday, November 1st, 2012

Design a Twitter Like Application with Nati Shalom

From the description:

Design a large scale NoSQL/DataGrid application similar to Twitter with Nati Shalom.

The use case is solved with Gigaspaces and Cassandra but other NoSQL and DataGrids solutions could be used.

Slides : xebia-video.s3-website-eu-west-1.amazonaws.com/2012-02/realtime-analytics-for-big-data-a-twitter-case-study-v2-ipad.pdf

If you enjoyed the posts I pointed to at: Building your own Facebook Realtime Analytics System, you will enjoy the video. (Same author.)

Not to mention Nati teaches patterns, the specific software being important but incidental.

Up to Date on Open Source Analytics

Tuesday, October 23rd, 2012

Up to Date on Open Source Analytics by Steve Miller.

Steve updates his Wintel laptop with the latest releases of open source analytics tools.

Steve’s list:

What’s on your list?

I first saw this mentioned at KDNuggets.

Bigger Than A Bread Box

Thursday, October 18th, 2012

Hortonworks & Teradata: More Than Just an Elephant in a Box by Jim Walker.

I’m not going to wake up Christmas morning to find:

Teredata

But in case you are in the market for a big analytics hardware/software appliance, Jim writes:

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye… it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with. It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this

This is an engineered solution. Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL. This is a great approach but it lacks integration of metadata and metadata exchange. With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product. SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata. Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze. All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration

In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics. With this package, these valuable functions can now be applied to big data in Hadoop. This shortens the time it takes for an analyst to explore and discover value in big data. And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Just as well.

Red doesn’t really go with my office decor. Runs more towards the hulking black server tower, except for the artificial pink tree in the corner. ;-)

R for Business Analytics

Thursday, October 4th, 2012

R for Business Analytics by A. Ohri.

I haven’t seen this volume, yet, but have read and cited Ohri’s blog, Decision Stats, often enough to have high expectations!

From the publisher’s blurb:

R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages. With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.

This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.

When you have seen it, please check back and post your comments.

Thanks!

The Top 10 Challenges in Extreme-Scale Visual Analytics [Human Bottlenecks and Parking Meters]

Thursday, August 30th, 2012

The Top 10 Challenges in Extreme-Scale Visual Analytics by Pak Chung Wong, Han-Wei Shen, Christopher R. Johnson, Chaomei Chen, and Robert B. Ross. (Link to PDF. IEEE Computer Graphics and Applications, July-Aug. 2012, pp. 63–67)

The top 10 challenges are:

  1. In Situ Interactive Analysis
  2. User-Driven Data Reduction
  3. Scalability and Multilevel Hierarchy
  4. Representing Evidence and Uncertainty
  5. Heterogeneous-Data Fusion
  6. Data Summarization and Triage for Interactive Query
  7. Analytics of Temporally Evolved Features
  8. The Human Bottleneck
  9. Design and Engineering Development
  10. The Renaissance of Conventional Wisdom

I was amused by #8: The Human Bottleneck, which reads:

Experts predict that all major high-performance computing (HPC) components—power, memory, storage, bandwidth, concurrence, and so on—will improve performance by a factor of 3 to 4,444 by 2018.2 Human cognitive capability will certainly remain constant. One challenge is to find alternative ways to compensate for human cognitive weaknesses.

It isn’t clear to me how speed counting 0′s and 1′s is an indicator of “human cognitive weakness?”

Parking meters stand in the weather day and night. I don’t take that as a commentary on human endurance.

Do you?

Going Beyond the Numbers:…

Friday, August 24th, 2012

Going Beyond the Numbers: How to Incorporate Textual Data into the Analytics Program by Cindi Thompson.

From the post:

Leveraging the value of text-based data by applying text analytics can help companies gain competitive advantage and an improved bottom line, yet many companies are still letting their document repositories and external sources of unstructured information lie fallow.

That’s no surprise, since the application of analytics techniques to textual data and other unstructured content is challenging and requires a relatively unfamiliar skill set. Yet applying business and industry knowledge and starting small can yield satisfying results.

Capturing More Value from Data with Text Analytics

There’s more to data than the numerical organizational data generated by transactional and business intelligence systems. Although the statistics are difficult to pin down, it’s safe to say that the majority of business information for a typical company is stored in documents and other unstructured data sources, not in structured databases. In addition, there is a huge amount of business-relevant information in documents and text that reside outside the enterprise. To ignore the information hidden in text is to risk missing opportunities, including the chance to:

  • Capture early signals of customer discontent.
  • Quickly target product deficiencies.
  • Detect fraud.
  • Route documents to those who can effectively leverage them.
  • Comply with regulations such as XBRL coding or redaction of personally identifiable information.
  • Better understand the events, people, places and dates associated with a large set of numerical data.
  • Track competitive intelligence.

To be sure, textual data is messy and poses difficulties.

But, as Cindi points out, there are golden benefits in those hills of textual data.

Building LinkedIn’s Real-time Activity Data Pipeline

Thursday, August 16th, 2012

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)

Abstract:

One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

R2RML: RDB to RDF Mapping Language

Tuesday, August 14th, 2012

R2RML: RDB to RDF Mapping Language from the RDB2RDF Working Group.

From the news:

This document describes R2RML, a language for expressing customized mappings from relational databases to RDF datasets. Such mappings provide the ability to view existing relational data in the RDF data model, expressed in a structure and target vocabulary of the mapping author’s choice. R2RML mappings are themselves RDF graphs and written down in Turtle syntax. R2RML enables different types of mapping implementations. Processors could, for example, offer a virtual SPARQL endpoint over the mapped relational data, or generate RDF dumps, or offer a Linked Data interface. Comments are welcome through 15 September. (emphasis added)

Subscribe (prior to commenting).

Comments to: public-rdb2rdf-comments@w3.org.

Analyzing 20,000 Comments

Thursday, July 19th, 2012

Analyzing 20,000 Comments

First, congratulations on Chandoo.org reaching its 20,000th comment!

Second, the post does not release the data (email addresses, etc.) so it also doesn’t include the code.

Thinking of this as an exercise in analytics, which of the measures applied should lead to changes in behavior?

After all, we don’t mine data simply because we can.

What goals would you suggest and how would we measure meeting them based on the analysis described here?

Hadoop: A Powerful Weapon for Retailers

Friday, July 13th, 2012

Hadoop: A Powerful Weapon for Retailers

From the post:

With big data basking in the limelight, it is no surprise that large retailers have been closely watching its development… and more power to them! By learning to effectively utilize big data, retailers can significantly mold the market to their advantage, making themselves more competitive and increasing the likelihood that they will come out on top as a successful retailer. Now that there are open source analytical platforms like Hadoop, which allow for unstructured data to be transformed and organized, large retailers are able to make smart business decisions using the information they collect about customers’ habits, preferences, and needs.

As IT industry analyst Jeff Kelly explained on Wikibon, “Big Data combined with sophisticated business analytics have the potential to give enterprises unprecedented insights into customer behavior and volatile market conditions, allowing them to make data-driven business decisions faster and more effectively than the competition.” Predicting what customers want to buy, without a doubt, affects how many products they want to buy (especially if retailers add on a few of those wonderful customer discounts). Not only will big data analytics prove financially beneficial, it will also present the opportunity for customers to have a more individualized shopping experience.

This all sounds very promising but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage. In order for retailers to thrive in the market, they must learn to manage and hone in on all (or at least most) of these facets of business, which can be difficult if you keep in mind the amount of data that each channel generates. Sam Sliman, president at Optimal Solutions Integration, summarizes it perfectly: “Transparency rules the day. Inconsistency turns customers away. Retailer missteps can be glaring and costly.” By making fast market decisions, retailers can increase sales, win and maintain customers, improve margins, and boost market share, but this can really only be done with the right business analytics tools.

Interesting but I disagree with “…but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage.”

That can be a difficulty, if you are not technically capable of effectively using information from different channels.

But there is a more fundamental difficulty. Having the capacity to use multiple channels of information is no guarantee of effective use of those channels of information.

You could buy your programming department a Cray supercomputer but that doesn’t mean they can make good use of it.

Same is true for collecting or having the software to process “big data.”

The real difficulty is the shortage of analytical skills to explore and exploit data. Computers and software can enhance but not create those analytical skills.

Analytical skills are powerful weapons for retailers.

Subverting Ossified Departments [Moving beyond name calling]

Saturday, July 7th, 2012

Brian Sommer has written on why analytics will not lead to new revenue streams, improved customer service, better stock options or other signs of salvation:

The Ossified Organization Won’t ‘Get’ Analytics (part 1 of 3)

How Tough Will Analytics Be in Ossified Firms? (Part 2 of 3)

Analytics and the Nimble Organization (part 3 of 3)

Why most firms won’t profit from analytics:

… Every day, companies already get thousands of ideas for new products, process innovations, customer interaction improvements, etc. and they fail to act on them. The rationale for this lack of movement can be:

- That’s not the way we do things here

- It’s a good idea but it’s just not us

- It’s too big of an idea

- It will be too disruptive

- We’d have to change so many things

- I don’t know who would be responsible for such a change

And, of course,

- It’s not my job

So if companies don’t act on the numerous, free suggestions from current customers and suppliers, why are they so deluded into thinking that IT-generated, analytic insights will actually fare better? They’re kidding themselves.

[part 1]

What Brian describes in amusing and great detail are all failures that no amount of IT, analytics or otherwise, can address. Not a technology problem. Not even an organization (as in form) issue.

It is a personnel issue. You can either retrain (I find unlikely to succeed) or you can get new personnel. it really is that simple. And with a glutted IT market, now would be the time to recruit an IT department not wedded to current practices. But you would need to do the same in accounting, marketing, management, etc.

But calling a department “ossified” is just name calling. You have to move beyond name calling to establish a bottom line reason for change.

Assuming you have access, topic maps can help you integrate data across department that don’t usually interchange data. So you can make the case for particular changes in terms of bottom line expenses.

Here is a true story with the names omitted and the context changed a bit:

Assume you are a publisher of journals, with both institutional and personal subscriptions. One of the things that all periodical publishers have to address are claims for “missing” issues. It happens, mail room mistakes, postal system errors, simply lost in transit, etc. Subscribers send in claims for those missing issues.

Some publishers maintain records of all subscriptions, including any correspondence and records, which are consulted by some full time staffer who answers all “claim” requests. One argument being there is a moral obligation to make sure non-subscribers don’t get an issue to which they are not entitled. Seriously, I have heard that argument made.

Analytics and topic maps could combine the subscription records with claim records and expenses for running the claims operation to show the expense of detailed claim service. Versus the cost of having the mail room toss another copy back to the requester. (Our printing cost was $3.00/copy so the math wasn’t the hard part.)

Topic maps help integrate the data you “obtain” from other departments. Just enough to make your point. Don’t have to integrate all the data, just enough to win the argument. Until the next argument comes along and you take a bit bigger bite of the apple.

Agile organizations are run by people agile enough to take control of them.

You can wait for permission from an ossified organization or you can use topic maps to take the first “bite.”

Your move.

PS: If you have investments in journal publishing you might want to check on claims handling.

Open-Source R software driving Big Data analytics in government

Monday, June 11th, 2012

Open-Source R software driving Big Data analytics in government by David Smith.

From the post:

As government agencies and departments expand their capabilities for collecting information, the volume and complexity of digital data stored for public purposes is far outstripping departments’ ability to make sense of it all. Even worse, with data siloed within individual departments and little cross-agency collaboration, untold hours and dollars are being spent on data collection and storage with return on investment in the form of information-based products and services for the public good.

But that may now be starting to change, with the Obama administration’s Big Data Research and Development Initiative.

In fact, the administration has had a Big Data agenda since its earliest days, with the appointment of Aneesh Chopra as the nation’s first chief technology officer in 2009. (Chopra passed the mantle to current CTO Todd Park in March.) One of Chopra’s first initiatives was the creation of data.gov, a vehicle to make government data and open-source tools available in a timely and accessible format for a community of citizen data scientists to make sense of it all.

For example, independent statistical analysis of data released by data.gov revealed a flaw in the 2000 Census results that apparently went unnoticed by the Census Bureau.

David goes on to give some other examples of the use of R with government data.

The US federal government is diverse enough that its IT solutions will be diverse as well. But R will be familiar to some potential clients.

I first saw this at the Revolutions blog on R.

Real-time Analytics with HBase [Aggregation is a form of merging.]

Monday, June 11th, 2012

Real-time Analytics with HBase

From the post:

Here are slides from another talk we gave at both Berlin Buzzwords and at HBaseCon in San Francisco last month. In this presentation Alex describes one approach to real-time analytics with HBase, which we use at Sematext via HBaseHUT. If you like these slides you will also like HBase Real-time Analytics Rollbacks via Append-based Updates.

The slides come in a long and short version. Both are very good but I suggest the long version.

I particularly liked the “Background: pre-aggregation” slide (8 in the short version, 9 in the long version).

Aggregation as a form of merging.

What information is lost as part of aggregation? (That assumes we know the aggregation process. Without that, can’t say what is lost.)

What information (read subjects/relationships) do we want to preserve through an aggregation process?

What properties should those subjects/relationships have?

(Those are topic map design/modeling questions.)

Data hoarding and bias among big challenges in big data and analytics

Monday, June 4th, 2012

Data hoarding and bias among big challenges in big data and analytics by Linda Tucci.

From the post:

Hype aside, exploiting big data and analytics will matter hugely to companies’ future performance, remaking whole industries and spawning new ones. The list of challenges is long, however. They range from the well-documented paucity of data scientists available to crunch that big data, to more intractable but less-mentioned problems rooted in human nature.

One of the latter is humans’ tendency to hoard data. Another is their tendency to hold on to preconceived beliefs even when the data screams otherwise. That was the consensus of a panel of data experts speaking on big data and analytics at the recent MIT Sloan CIO Symposium in Cambridge, Mass. Another landmine? False hope. There is no final truth in big data and analytics, as the enterprises that do big data well already know. Iteration is all, the panel agreed.

Moreover, except for the value of iteration, CIOs can forget about best practices. Emerging so-called next practices are about the best companies can lean on as they dive into big data, said computer scientist Michael Chui, San Francisco-based senior fellow at the McKinsey Global Institute, the research arm of New York-based McKinsey & Co. Inc.

“The one thing we know that doesn’t work: Wait five years until the perfect data warehouse is ready,” said Chui, who’s an author of last year’s massive McKinsey report on the value of big data.

Seeing data quality in relative terms

In fact, obsessing over data quality is one of the first hurdles many companies have to overcome if they hope to use big data effectively, Chui said. Data accuracy is of paramount importance in banks’ financial statements. Messy data, however, contains patterns that can highlight business problems or provide insights that generate significant value, as laid out in a related story about the symposium panel, “Seize big data and analytics or fall behind, MIT panel says.

Issues that you will have to face in the creation of topic maps, big data or no.

Build your own twitter like real time analytics – a step by step guide

Friday, May 25th, 2012

Build your own twitter like real time analytics – a step by step guide

Where else but High Scalability would you find a “how-to” article like this one? Complete with guide and source code.

Good DYI project for the weekend.

Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.

In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation:

  1. Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
  2. BigData data-base (Cassandra) for storing the historical data and manage the trend analytics
  3. Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or pubic cloud

The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique.

Then we need persist and process the data with low latency, and for this we store the tweets in memory.

Automated harvesting of tweets has real potential, even with clear text transmission. Or perhaps because of it.

Uncertainty Principle for Serendipity?

Tuesday, May 22nd, 2012

Curt Monash writes in Cool analytic stories

There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …

… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.

Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.

How will we “know” when better data display/mining techniques enable more serendipity?

Anecdotal stories about serendipity abound.

Measuring serendipity requires knowing: (rate of serendipitous discoveries x importance of serendipitous discoveries)/ opportunity for serendipitous discoveries.

Need to add in a multiplier effect for the impact that one serendipitous discovery may have to create opportunities or other serendipitous discoveries (a serendipitous criticality point) and probably some other things I have overlooked.

What would you add to the equation?

Realizing that we may be staring at the “right” answer and never realize it.

How’s that for an uncertainty principle?

Lavastorm Desktop Public

Friday, May 18th, 2012

Lavastorm Desktop Public

Lavastorm Desktop Public is a powerful, visual and easy-to-use tool for anyone combining and analyzing data. A free version of our award winning Lavastorm Desktop software, the Public edition allows you to harness the power of our enterprise-class analytics engine right on your desktop. You’ll love Lavastorm Desktop Public if you want to:

  • Get more productive by reducing the time to create analytics by 90% or more compared to underpowered analytic tools, such as Excel or Access
  • Stop flying blind by unifying data locked away in silos or scattered on your desktop
  • Eliminate time spent waiting for others to integrate data or implement new analytics
  • Gain greater control for analyzing data against complex business logic and for manipulating data from Excel, CSV or ASCII files

First time I have encountered it.

Suggestions/comments?

Reading Other People’s Mail For Fun and Profit

Tuesday, May 8th, 2012

Bob Gourley writes much better content than he does titles: Osama Bin Laden Letters Analyzed: A rapid assessment using Recorded Future’s temporal analytic technologies and intelligence analysis tools. (Sorry Bob.)

Bob writes:

The Analysis Intelligence site provides open source analysis and information on a variety of topics based on the the temporal analytic technology and intelligence analysis tools of Recorded Future. Shortly after the release of 175 pages of documents from the Combatting Terrorism Center (CTC) a very interesting assessment was posted on the site. This assessment sheds light on the nature of these documents and also highlights some of the important context that the powerful capabilities of Recorded Future can provide.

The analysis by Recorded Future is succinct and well done so I cite most of it below. I’ll conclude with some of my own thoughts as an experienced intelligence professional and technologist on some of the “So What” of this assessment.

If you are interested in analytics, particularly visual analytics, you will really appreciate this piece.

Recorded Future has a post on the US Presidential Election. Just to be on the safe side, I would “fuzz” the data when it got close to the election. ;-)

HBase Real-time Analytics & Rollbacks via Append-based Updates

Sunday, April 29th, 2012

HBase Real-time Analytics & Rollbacks via Append-based Updates by Alex Baranau.

From the post:

In this part 1 of a 3-part post series we’ll describe how we use HBase at Sematext for real-time analytics and how we can perform data rollbacks by using an append-only updates approach.

Some bits of this topic were already covered in Deferring Processing Updates to Increase HBase Write Performance and some were briefly presented at BerlinBuzzwords 2011 (video). We will also talk about some of the ideas below during HBaseCon-2012 in late May (see Real-time Analytics with HBase). The approach described in this post is used in our production systems (SPM & SA) and the implementation was open-sourced as HBaseHUT project.

Problem we are Solving

While HDFS & MapReduce are designed for massive batch processing and with the idea of data being immutable (write once, read many times), HBase includes support for additional operations such as real-time and random read/write/delete access to data records. HBase performs its basic job very well, but there are times when developers have to think at a higher level about how to utilize HBase capabilities for specific use-cases. HBase is a great tool with good core functionality and implementation, but it does require one to do some thinking to ensure this core functionality is used properly and optimally. The use-case we’ll be working with in this post is a typical data analytics system where:

  • new data are continuously streaming in
  • data are processed and stored in HBase, usually as time-series data
  • processed data are served to users who can navigate through most recent data as well as dig deep into historical data

Although the above points frame the use-case relatively narrowly, the approach and its implementation that we’ll describe here are really more general and applicable to a number of other systems, too. The basic issues we want to solve are the following:

  • increase record update throughput. Ideally, despite high volume of incoming data changes can be applied in real-time . Usually. due to the limitations of the “normal HBase update”, which requires Get+Put operations, updates are applied using batch-processing approach (e.g. as MapReduce jobs). This, of course, is anything but real-time: incoming data is not immediately seen. It is seen only after it has been processed.
  • ability to roll back changes in the served data. Human errors or any other issues should not permanently corrupt data that system serves.
  • ability to fetch data interactively (i.e. fast enough for inpatient humans). When one navigates through a small amount of recent data, as well as when selected time interval spans years, the retrieval should be fast.

Here is what we consider an “update”:

  • addition of a new record if no records with same key exists
  • update of an existing record with a particular key

See anything familiar? That resembles your use cases?

The proffered solution may not fit your use case(s) but this is an example of exploring a solution. Not fitting a problem to a solution. Not the same thing.

HBase Real-time Analytics & Rollbacks via Append-based Updates Part 2 is available. Solution uses HBaseHUT. Really informative graphics in part 2 as well.

Very interested in seeing Part 3!

Text Analytics Summit Europe – highlights and reflections

Sunday, April 29th, 2012

Text Analytics Summit Europe – highlights and reflections by Tony Russell-Rose.

Earlier this week I had the privilege of attending the Text Analytics Summit Europe at the Royal Garden Hotel in Kensington. Some of you may of course recognise this hotel as the base for Justin Bieber’s recent visit to London, but sadly (or is that fortunately?) he didn’t join us. Next time, maybe…

Ranking reasons to attend:

  • #1 Text Analytics Summit Europe – meet other attendees, presentations
  • #2 Kensington Gardens and Hyde Park (been there, it is more impressive than you can imagine)
  • #N +1 Justin Bieber being in London (or any other location)

I was disappointed by the lack of links to slides or videos of the presentations.

Tony’s post does have pointers to people and resources you may have missed.

Question: Do you think “text analytics” and “data mining” are different? If so, how?

The wrong way: Worst best practices in ‘big data’ analytics programs

Sunday, April 22nd, 2012

The wrong way: Worst best practices in ‘big data’ analytics programs

Rick Sherman writes:

“Big data” analytics is hot. Read any IT publication or website and you’ll see business intelligence (BI) vendors and their systems integration partners pitching products and services to help organizations implement and manage big data analytics systems. The ads and the big data analytics press releases and case studies that vendors are rushing out might make you think it’s easy — that all you need for a successful deployment is a particular technology.

If only it were that simple. While BI vendors are happy to tell you about their customers who are successfully leveraging big data for analytics uses, they’re not so quick to discuss those who have failed. There are many potential reasons why big data analytics projects fall short of their goals and expectations. You can find lots of advice on big data analytics best practices; below are some worst practices for big data analytics programs so you know what to avoid.

Rick gives seven reasons why “big data” analytics projects fail:

  1. “If we build, it they will come.”
  2. Assuming that the software will have all the answers.
  3. Not understanding that you need to think differently.
  4. Forgetting all the lessons of the past.
  5. Not having the requisite business and analytical expertise.
  6. Treating the project like it’s a science experiment.
  7. Promising and trying to do too much.

Seven reasons that should be raised when the NSA Money Trap project fails.

Because no one has taken responsibility for those seven issues.

Or asked the contractors: What about your failed “big data” analytics projects?

Simple enough question.

Do you ask that question?

The Trend Point

Tuesday, April 10th, 2012

The Trend Point

Described by a “sister” publication as:

ArnoldIT has rolled out The Trend Point information service. Published Monday through Friday, the information services focuses on the intersection of open source software and next-generation analytics. The approach will be for the editors and researchers to identify high-value source documents and then encapsulate these documents into easily-digested articles and stories. In addition, critical commentary, supplementary links, and important facts from the source document are provided. Unlike a news aggregation service run by automated agents, librarians and researchers use the ArnoldIT Overflight tools to track companies, concepts, and products. The combination of human-intermediated research with Overflight provide an executive or business professional with a quick, easy, and free way to keep track of important developments in open source analytics. There is no charge for the service.

I was looking for something different to say other than just reporting a new data stream and found this under the “about” link:

I write for fee columns for Enterprise Technology Management, Information Today, Online Magazine, and KMWorld plus a few occasional items. My content reaches somewhere between one and three people each month.

I started to monetize Beyond Search in 2008. I have expanded our content services to white papers about a search, content processing or analytics. These reports are prepared for a client. The approach is objective and we include information that makes these documents suitable for the client’s marketing and sales efforts. Clients work closely with the Beyond Search professional to help ensure that the message is on target and clear. Rates are set to be within reach of organizations regardless of their size.

You can get coverage in this or one of our other information services, but we charge for our time. Stated another way: If you want a story about you, your company, or your product, you will be expected to write a check or pay via PayPal. We do not do news. We do this. (emphasis added to the first paragraph)

For some reason, I would have expected Stephen E. Arnold to reach more than …between one and three people each month. That sounds low to me. ;-)

The line: “We do not do news.” Makes me wonder what the University of Southhampton paid to have a four page document described as a “dissertation.” See: New Paper: Linked Data Strategy for Global Identity. Or for that matter, what will it cost to get into “The Trend Point?”

Thoughts?

We’re Not Very Good Statisticians

Monday, March 26th, 2012

We’re Not Very Good Statisticians by Steve Miller.

From the post:

I’ve received several emails/comments about my recent series of blogs on Duncan Watts’ interesting book “Everything is Obvious: *Once You Know the Answer — How Common Sense Fails Us.” Watts’ thesis is that the common sense that generally guides us well for life’s simple, mundane tasks often fails miserably when decisions get more complicated.

Three of the respondents suggested I take a look at “Thinking Fast and Slow,” by psychologist Daniel Kahneman, who along with the late economist Amos Tversky, was awarded the Nobel Prize in Economic Sciences for “seminal work in psychology that challenged the rational model of judgment and decision making.”

Steve’s post and the ones to follow are worth a close read.

When data, statistical or otherwise, agrees with me, I take that as a sign to evaluate it very carefully. Your mileage may vary.