Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 18, 2012

Bigger Than A Bread Box

Filed under: Analytics,BigData,Hortonworks — Patrick Durusau @ 10:38 am

Hortonworks & Teradata: More Than Just an Elephant in a Box by Jim Walker.

I’m not going to wake up Christmas morning to find:

Teredata

But in case you are in the market for a big analytics hardware/software appliance, Jim writes:

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye… it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with. It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this

This is an engineered solution. Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL. This is a great approach but it lacks integration of metadata and metadata exchange. With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product. SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata. Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze. All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration

In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics. With this package, these valuable functions can now be applied to big data in Hadoop. This shortens the time it takes for an analyst to explore and discover value in big data. And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Just as well.

Red doesn’t really go with my office decor. Runs more towards the hulking black server tower, except for the artificial pink tree in the corner. 😉

October 4, 2012

R for Business Analytics

Filed under: Analytics,R — Patrick Durusau @ 2:11 pm

R for Business Analytics by A. Ohri.

I haven’t seen this volume, yet, but have read and cited Ohri’s blog, Decision Stats, often enough to have high expectations!

From the publisher’s blurb:

R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages. With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.

This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.

When you have seen it, please check back and post your comments.

Thanks!

August 30, 2012

The Top 10 Challenges in Extreme-Scale Visual Analytics [Human Bottlenecks and Parking Meters]

Filed under: Analytics,Interface Research/Design,Visualization — Patrick Durusau @ 2:04 pm

The Top 10 Challenges in Extreme-Scale Visual Analytics by Pak Chung Wong, Han-Wei Shen, Christopher R. Johnson, Chaomei Chen, and Robert B. Ross. (Link to PDF. IEEE Computer Graphics and Applications, July-Aug. 2012, pp. 63–67)

The top 10 challenges are:

  1. In Situ Interactive Analysis
  2. User-Driven Data Reduction
  3. Scalability and Multilevel Hierarchy
  4. Representing Evidence and Uncertainty
  5. Heterogeneous-Data Fusion
  6. Data Summarization and Triage for Interactive Query
  7. Analytics of Temporally Evolved Features
  8. The Human Bottleneck
  9. Design and Engineering Development
  10. The Renaissance of Conventional Wisdom

I was amused by #8: The Human Bottleneck, which reads:

Experts predict that all major high-performance computing (HPC) components—power, memory, storage, bandwidth, concurrence, and so on—will improve performance by a factor of 3 to 4,444 by 2018.2 Human cognitive capability will certainly remain constant. One challenge is to find alternative ways to compensate for human cognitive weaknesses.

It isn’t clear to me how speed counting 0’s and 1’s is an indicator of “human cognitive weakness?”

Parking meters stand in the weather day and night. I don’t take that as a commentary on human endurance.

Do you?

August 24, 2012

Going Beyond the Numbers:…

Filed under: Analytics,Text Analytics,Text Mining — Patrick Durusau @ 1:39 pm

Going Beyond the Numbers: How to Incorporate Textual Data into the Analytics Program by Cindi Thompson.

From the post:

Leveraging the value of text-based data by applying text analytics can help companies gain competitive advantage and an improved bottom line, yet many companies are still letting their document repositories and external sources of unstructured information lie fallow.

That’s no surprise, since the application of analytics techniques to textual data and other unstructured content is challenging and requires a relatively unfamiliar skill set. Yet applying business and industry knowledge and starting small can yield satisfying results.

Capturing More Value from Data with Text Analytics

There’s more to data than the numerical organizational data generated by transactional and business intelligence systems. Although the statistics are difficult to pin down, it’s safe to say that the majority of business information for a typical company is stored in documents and other unstructured data sources, not in structured databases. In addition, there is a huge amount of business-relevant information in documents and text that reside outside the enterprise. To ignore the information hidden in text is to risk missing opportunities, including the chance to:

  • Capture early signals of customer discontent.
  • Quickly target product deficiencies.
  • Detect fraud.
  • Route documents to those who can effectively leverage them.
  • Comply with regulations such as XBRL coding or redaction of personally identifiable information.
  • Better understand the events, people, places and dates associated with a large set of numerical data.
  • Track competitive intelligence.

To be sure, textual data is messy and poses difficulties.

But, as Cindi points out, there are golden benefits in those hills of textual data.

August 16, 2012

Building LinkedIn’s Real-time Activity Data Pipeline

Filed under: Aggregation,Analytics,Data Streams,Kafka,Systems Administration — Patrick Durusau @ 1:21 pm

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)

Abstract:

One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

August 14, 2012

R2RML: RDB to RDF Mapping Language

Filed under: Analytics,BigData,R2RML,RDB,RDF — Patrick Durusau @ 3:29 pm

R2RML: RDB to RDF Mapping Language from the RDB2RDF Working Group.

From the news:

This document describes R2RML, a language for expressing customized mappings from relational databases to RDF datasets. Such mappings provide the ability to view existing relational data in the RDF data model, expressed in a structure and target vocabulary of the mapping author’s choice. R2RML mappings are themselves RDF graphs and written down in Turtle syntax. R2RML enables different types of mapping implementations. Processors could, for example, offer a virtual SPARQL endpoint over the mapped relational data, or generate RDF dumps, or offer a Linked Data interface. Comments are welcome through 15 September. (emphasis added)

Subscribe (prior to commenting).

Comments to: public-rdb2rdf-comments@w3.org.

July 19, 2012

Analyzing 20,000 Comments

Filed under: Analytics,Data Mining — Patrick Durusau @ 7:34 am

Analyzing 20,000 Comments

First, congratulations on Chandoo.org reaching its 20,000th comment!

Second, the post does not release the data (email addresses, etc.) so it also doesn’t include the code.

Thinking of this as an exercise in analytics, which of the measures applied should lead to changes in behavior?

After all, we don’t mine data simply because we can.

What goals would you suggest and how would we measure meeting them based on the analysis described here?

July 13, 2012

Hadoop: A Powerful Weapon for Retailers

Filed under: Analytics,Data Science,Hadoop — Patrick Durusau @ 4:15 pm

Hadoop: A Powerful Weapon for Retailers

From the post:

With big data basking in the limelight, it is no surprise that large retailers have been closely watching its development… and more power to them! By learning to effectively utilize big data, retailers can significantly mold the market to their advantage, making themselves more competitive and increasing the likelihood that they will come out on top as a successful retailer. Now that there are open source analytical platforms like Hadoop, which allow for unstructured data to be transformed and organized, large retailers are able to make smart business decisions using the information they collect about customers’ habits, preferences, and needs.

As IT industry analyst Jeff Kelly explained on Wikibon, “Big Data combined with sophisticated business analytics have the potential to give enterprises unprecedented insights into customer behavior and volatile market conditions, allowing them to make data-driven business decisions faster and more effectively than the competition.” Predicting what customers want to buy, without a doubt, affects how many products they want to buy (especially if retailers add on a few of those wonderful customer discounts). Not only will big data analytics prove financially beneficial, it will also present the opportunity for customers to have a more individualized shopping experience.

This all sounds very promising but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage. In order for retailers to thrive in the market, they must learn to manage and hone in on all (or at least most) of these facets of business, which can be difficult if you keep in mind the amount of data that each channel generates. Sam Sliman, president at Optimal Solutions Integration, summarizes it perfectly: “Transparency rules the day. Inconsistency turns customers away. Retailer missteps can be glaring and costly.” By making fast market decisions, retailers can increase sales, win and maintain customers, improve margins, and boost market share, but this can really only be done with the right business analytics tools.

Interesting but I disagree with “…but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage.”

That can be a difficulty, if you are not technically capable of effectively using information from different channels.

But there is a more fundamental difficulty. Having the capacity to use multiple channels of information is no guarantee of effective use of those channels of information.

You could buy your programming department a Cray supercomputer but that doesn’t mean they can make good use of it.

Same is true for collecting or having the software to process “big data.”

The real difficulty is the shortage of analytical skills to explore and exploit data. Computers and software can enhance but not create those analytical skills.

Analytical skills are powerful weapons for retailers.

July 7, 2012

Subverting Ossified Departments [Moving beyond name calling]

Filed under: Analytics,Business Intelligence,Marketing,Topic Maps — Patrick Durusau @ 10:21 am

Brian Sommer has written on why analytics will not lead to new revenue streams, improved customer service, better stock options or other signs of salvation:

The Ossified Organization Won’t ‘Get’ Analytics (part 1 of 3)

How Tough Will Analytics Be in Ossified Firms? (Part 2 of 3)

Analytics and the Nimble Organization (part 3 of 3)

Why most firms won’t profit from analytics:

… Every day, companies already get thousands of ideas for new products, process innovations, customer interaction improvements, etc. and they fail to act on them. The rationale for this lack of movement can be:

– That’s not the way we do things here

– It’s a good idea but it’s just not us

– It’s too big of an idea

– It will be too disruptive

– We’d have to change so many things

– I don’t know who would be responsible for such a change

And, of course,

– It’s not my job

So if companies don’t act on the numerous, free suggestions from current customers and suppliers, why are they so deluded into thinking that IT-generated, analytic insights will actually fare better? They’re kidding themselves.

[part 1]

What Brian describes in amusing and great detail are all failures that no amount of IT, analytics or otherwise, can address. Not a technology problem. Not even an organization (as in form) issue.

It is a personnel issue. You can either retrain (I find unlikely to succeed) or you can get new personnel. it really is that simple. And with a glutted IT market, now would be the time to recruit an IT department not wedded to current practices. But you would need to do the same in accounting, marketing, management, etc.

But calling a department “ossified” is just name calling. You have to move beyond name calling to establish a bottom line reason for change.

Assuming you have access, topic maps can help you integrate data across department that don’t usually interchange data. So you can make the case for particular changes in terms of bottom line expenses.

Here is a true story with the names omitted and the context changed a bit:

Assume you are a publisher of journals, with both institutional and personal subscriptions. One of the things that all periodical publishers have to address are claims for “missing” issues. It happens, mail room mistakes, postal system errors, simply lost in transit, etc. Subscribers send in claims for those missing issues.

Some publishers maintain records of all subscriptions, including any correspondence and records, which are consulted by some full time staffer who answers all “claim” requests. One argument being there is a moral obligation to make sure non-subscribers don’t get an issue to which they are not entitled. Seriously, I have heard that argument made.

Analytics and topic maps could combine the subscription records with claim records and expenses for running the claims operation to show the expense of detailed claim service. Versus the cost of having the mail room toss another copy back to the requester. (Our printing cost was $3.00/copy so the math wasn’t the hard part.)

Topic maps help integrate the data you “obtain” from other departments. Just enough to make your point. Don’t have to integrate all the data, just enough to win the argument. Until the next argument comes along and you take a bit bigger bite of the apple.

Agile organizations are run by people agile enough to take control of them.

You can wait for permission from an ossified organization or you can use topic maps to take the first “bite.”

Your move.

PS: If you have investments in journal publishing you might want to check on claims handling.

June 11, 2012

Open-Source R software driving Big Data analytics in government

Filed under: Analytics,BigData,R — Patrick Durusau @ 4:22 pm

Open-Source R software driving Big Data analytics in government by David Smith.

From the post:

As government agencies and departments expand their capabilities for collecting information, the volume and complexity of digital data stored for public purposes is far outstripping departments’ ability to make sense of it all. Even worse, with data siloed within individual departments and little cross-agency collaboration, untold hours and dollars are being spent on data collection and storage with return on investment in the form of information-based products and services for the public good.

But that may now be starting to change, with the Obama administration’s Big Data Research and Development Initiative.

In fact, the administration has had a Big Data agenda since its earliest days, with the appointment of Aneesh Chopra as the nation’s first chief technology officer in 2009. (Chopra passed the mantle to current CTO Todd Park in March.) One of Chopra’s first initiatives was the creation of data.gov, a vehicle to make government data and open-source tools available in a timely and accessible format for a community of citizen data scientists to make sense of it all.

For example, independent statistical analysis of data released by data.gov revealed a flaw in the 2000 Census results that apparently went unnoticed by the Census Bureau.

David goes on to give some other examples of the use of R with government data.

The US federal government is diverse enough that its IT solutions will be diverse as well. But R will be familiar to some potential clients.

I first saw this at the Revolutions blog on R.

Real-time Analytics with HBase [Aggregation is a form of merging.]

Filed under: Aggregation,Analytics,HBase — Patrick Durusau @ 4:21 pm

Real-time Analytics with HBase

From the post:

Here are slides from another talk we gave at both Berlin Buzzwords and at HBaseCon in San Francisco last month. In this presentation Alex describes one approach to real-time analytics with HBase, which we use at Sematext via HBaseHUT. If you like these slides you will also like HBase Real-time Analytics Rollbacks via Append-based Updates.

The slides come in a long and short version. Both are very good but I suggest the long version.

I particularly liked the “Background: pre-aggregation” slide (8 in the short version, 9 in the long version).

Aggregation as a form of merging.

What information is lost as part of aggregation? (That assumes we know the aggregation process. Without that, can’t say what is lost.)

What information (read subjects/relationships) do we want to preserve through an aggregation process?

What properties should those subjects/relationships have?

(Those are topic map design/modeling questions.)

June 4, 2012

Data hoarding and bias among big challenges in big data and analytics

Filed under: Analytics,BigData — Patrick Durusau @ 4:33 pm

Data hoarding and bias among big challenges in big data and analytics by Linda Tucci.

From the post:

Hype aside, exploiting big data and analytics will matter hugely to companies’ future performance, remaking whole industries and spawning new ones. The list of challenges is long, however. They range from the well-documented paucity of data scientists available to crunch that big data, to more intractable but less-mentioned problems rooted in human nature.

One of the latter is humans’ tendency to hoard data. Another is their tendency to hold on to preconceived beliefs even when the data screams otherwise. That was the consensus of a panel of data experts speaking on big data and analytics at the recent MIT Sloan CIO Symposium in Cambridge, Mass. Another landmine? False hope. There is no final truth in big data and analytics, as the enterprises that do big data well already know. Iteration is all, the panel agreed.

Moreover, except for the value of iteration, CIOs can forget about best practices. Emerging so-called next practices are about the best companies can lean on as they dive into big data, said computer scientist Michael Chui, San Francisco-based senior fellow at the McKinsey Global Institute, the research arm of New York-based McKinsey & Co. Inc.

“The one thing we know that doesn’t work: Wait five years until the perfect data warehouse is ready,” said Chui, who’s an author of last year’s massive McKinsey report on the value of big data.

Seeing data quality in relative terms

In fact, obsessing over data quality is one of the first hurdles many companies have to overcome if they hope to use big data effectively, Chui said. Data accuracy is of paramount importance in banks’ financial statements. Messy data, however, contains patterns that can highlight business problems or provide insights that generate significant value, as laid out in a related story about the symposium panel, “Seize big data and analytics or fall behind, MIT panel says.

Issues that you will have to face in the creation of topic maps, big data or no.

May 25, 2012

Build your own twitter like real time analytics – a step by step guide

Filed under: Analytics,Cassandra,Tweets — Patrick Durusau @ 4:22 am

Build your own twitter like real time analytics – a step by step guide

Where else but High Scalability would you find a “how-to” article like this one? Complete with guide and source code.

Good DYI project for the weekend.

Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.

In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation:

  1. Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
  2. BigData data-base (Cassandra) for storing the historical data and manage the trend analytics
  3. Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or pubic cloud

The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique.

Then we need persist and process the data with low latency, and for this we store the tweets in memory.

Automated harvesting of tweets has real potential, even with clear text transmission. Or perhaps because of it.

May 22, 2012

Uncertainty Principle for Serendipity?

Filed under: Analytics,Data Integration — Patrick Durusau @ 10:22 am

Curt Monash writes in Cool analytic stories

There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …

… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.

Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.

How will we “know” when better data display/mining techniques enable more serendipity?

Anecdotal stories about serendipity abound.

Measuring serendipity requires knowing: (rate of serendipitous discoveries x importance of serendipitous discoveries)/ opportunity for serendipitous discoveries.

Need to add in a multiplier effect for the impact that one serendipitous discovery may have to create opportunities or other serendipitous discoveries (a serendipitous criticality point) and probably some other things I have overlooked.

What would you add to the equation?

Realizing that we may be staring at the “right” answer and never realize it.

How’s that for an uncertainty principle?

May 18, 2012

Lavastorm Desktop Public

Filed under: Analytics,Lavastorm Desktop Public — Patrick Durusau @ 4:09 pm

Lavastorm Desktop Public

Lavastorm Desktop Public is a powerful, visual and easy-to-use tool for anyone combining and analyzing data. A free version of our award winning Lavastorm Desktop software, the Public edition allows you to harness the power of our enterprise-class analytics engine right on your desktop. You’ll love Lavastorm Desktop Public if you want to:

  • Get more productive by reducing the time to create analytics by 90% or more compared to underpowered analytic tools, such as Excel or Access
  • Stop flying blind by unifying data locked away in silos or scattered on your desktop
  • Eliminate time spent waiting for others to integrate data or implement new analytics
  • Gain greater control for analyzing data against complex business logic and for manipulating data from Excel, CSV or ASCII files

First time I have encountered it.

Suggestions/comments?

May 8, 2012

Reading Other People’s Mail For Fun and Profit

Filed under: Analytics,Data Analysis,Intelligence — Patrick Durusau @ 6:16 pm

Bob Gourley writes much better content than he does titles: Osama Bin Laden Letters Analyzed: A rapid assessment using Recorded Future’s temporal analytic technologies and intelligence analysis tools. (Sorry Bob.)

Bob writes:

The Analysis Intelligence site provides open source analysis and information on a variety of topics based on the the temporal analytic technology and intelligence analysis tools of Recorded Future. Shortly after the release of 175 pages of documents from the Combatting Terrorism Center (CTC) a very interesting assessment was posted on the site. This assessment sheds light on the nature of these documents and also highlights some of the important context that the powerful capabilities of Recorded Future can provide.

The analysis by Recorded Future is succinct and well done so I cite most of it below. I’ll conclude with some of my own thoughts as an experienced intelligence professional and technologist on some of the “So What” of this assessment.

If you are interested in analytics, particularly visual analytics, you will really appreciate this piece.

Recorded Future has a post on the US Presidential Election. Just to be on the safe side, I would “fuzz” the data when it got close to the election. 😉

April 29, 2012

HBase Real-time Analytics & Rollbacks via Append-based Updates

Filed under: Analytics,HBase — Patrick Durusau @ 3:21 pm

HBase Real-time Analytics & Rollbacks via Append-based Updates by Alex Baranau.

From the post:

In this part 1 of a 3-part post series we’ll describe how we use HBase at Sematext for real-time analytics and how we can perform data rollbacks by using an append-only updates approach.

Some bits of this topic were already covered in Deferring Processing Updates to Increase HBase Write Performance and some were briefly presented at BerlinBuzzwords 2011 (video). We will also talk about some of the ideas below during HBaseCon-2012 in late May (see Real-time Analytics with HBase). The approach described in this post is used in our production systems (SPM & SA) and the implementation was open-sourced as HBaseHUT project.

Problem we are Solving

While HDFS & MapReduce are designed for massive batch processing and with the idea of data being immutable (write once, read many times), HBase includes support for additional operations such as real-time and random read/write/delete access to data records. HBase performs its basic job very well, but there are times when developers have to think at a higher level about how to utilize HBase capabilities for specific use-cases. HBase is a great tool with good core functionality and implementation, but it does require one to do some thinking to ensure this core functionality is used properly and optimally. The use-case we’ll be working with in this post is a typical data analytics system where:

  • new data are continuously streaming in
  • data are processed and stored in HBase, usually as time-series data
  • processed data are served to users who can navigate through most recent data as well as dig deep into historical data

Although the above points frame the use-case relatively narrowly, the approach and its implementation that we’ll describe here are really more general and applicable to a number of other systems, too. The basic issues we want to solve are the following:

  • increase record update throughput. Ideally, despite high volume of incoming data changes can be applied in real-time . Usually. due to the limitations of the “normal HBase update”, which requires Get+Put operations, updates are applied using batch-processing approach (e.g. as MapReduce jobs). This, of course, is anything but real-time: incoming data is not immediately seen. It is seen only after it has been processed.
  • ability to roll back changes in the served data. Human errors or any other issues should not permanently corrupt data that system serves.
  • ability to fetch data interactively (i.e. fast enough for inpatient humans). When one navigates through a small amount of recent data, as well as when selected time interval spans years, the retrieval should be fast.

Here is what we consider an “update”:

  • addition of a new record if no records with same key exists
  • update of an existing record with a particular key

See anything familiar? That resembles your use cases?

The proffered solution may not fit your use case(s) but this is an example of exploring a solution. Not fitting a problem to a solution. Not the same thing.

HBase Real-time Analytics & Rollbacks via Append-based Updates Part 2 is available. Solution uses HBaseHUT. Really informative graphics in part 2 as well.

Very interested in seeing Part 3!

Text Analytics Summit Europe – highlights and reflections

Filed under: Analytics,Natural Language Processing,Text Analytics — Patrick Durusau @ 2:01 pm

Text Analytics Summit Europe – highlights and reflections by Tony Russell-Rose.

Earlier this week I had the privilege of attending the Text Analytics Summit Europe at the Royal Garden Hotel in Kensington. Some of you may of course recognise this hotel as the base for Justin Bieber’s recent visit to London, but sadly (or is that fortunately?) he didn’t join us. Next time, maybe…

Ranking reasons to attend:

  • #1 Text Analytics Summit Europe – meet other attendees, presentations
  • #2 Kensington Gardens and Hyde Park (been there, it is more impressive than you can imagine)
  • #N +1 Justin Bieber being in London (or any other location)

I was disappointed by the lack of links to slides or videos of the presentations.

Tony’s post does have pointers to people and resources you may have missed.

Question: Do you think “text analytics” and “data mining” are different? If so, how?

April 22, 2012

The wrong way: Worst best practices in ‘big data’ analytics programs

Filed under: Analytics,BigData,Government,Government Data — Patrick Durusau @ 7:07 pm

The wrong way: Worst best practices in ‘big data’ analytics programs

Rick Sherman writes:

“Big data” analytics is hot. Read any IT publication or website and you’ll see business intelligence (BI) vendors and their systems integration partners pitching products and services to help organizations implement and manage big data analytics systems. The ads and the big data analytics press releases and case studies that vendors are rushing out might make you think it’s easy — that all you need for a successful deployment is a particular technology.

If only it were that simple. While BI vendors are happy to tell you about their customers who are successfully leveraging big data for analytics uses, they’re not so quick to discuss those who have failed. There are many potential reasons why big data analytics projects fall short of their goals and expectations. You can find lots of advice on big data analytics best practices; below are some worst practices for big data analytics programs so you know what to avoid.

Rick gives seven reasons why “big data” analytics projects fail:

  1. “If we build, it they will come.”
  2. Assuming that the software will have all the answers.
  3. Not understanding that you need to think differently.
  4. Forgetting all the lessons of the past.
  5. Not having the requisite business and analytical expertise.
  6. Treating the project like it’s a science experiment.
  7. Promising and trying to do too much.

Seven reasons that should be raised when the NSA Money Trap project fails.

Because no one has taken responsibility for those seven issues.

Or asked the contractors: What about your failed “big data” analytics projects?

Simple enough question.

Do you ask that question?

April 10, 2012

The Trend Point

Filed under: Analytics,Blogs,Open Source — Patrick Durusau @ 6:45 pm

The Trend Point

Described by a “sister” publication as:

ArnoldIT has rolled out The Trend Point information service. Published Monday through Friday, the information services focuses on the intersection of open source software and next-generation analytics. The approach will be for the editors and researchers to identify high-value source documents and then encapsulate these documents into easily-digested articles and stories. In addition, critical commentary, supplementary links, and important facts from the source document are provided. Unlike a news aggregation service run by automated agents, librarians and researchers use the ArnoldIT Overflight tools to track companies, concepts, and products. The combination of human-intermediated research with Overflight provide an executive or business professional with a quick, easy, and free way to keep track of important developments in open source analytics. There is no charge for the service.

I was looking for something different to say other than just reporting a new data stream and found this under the “about” link:

I write for fee columns for Enterprise Technology Management, Information Today, Online Magazine, and KMWorld plus a few occasional items. My content reaches somewhere between one and three people each month.

I started to monetize Beyond Search in 2008. I have expanded our content services to white papers about a search, content processing or analytics. These reports are prepared for a client. The approach is objective and we include information that makes these documents suitable for the client’s marketing and sales efforts. Clients work closely with the Beyond Search professional to help ensure that the message is on target and clear. Rates are set to be within reach of organizations regardless of their size.

You can get coverage in this or one of our other information services, but we charge for our time. Stated another way: If you want a story about you, your company, or your product, you will be expected to write a check or pay via PayPal. We do not do news. We do this. (emphasis added to the first paragraph)

For some reason, I would have expected Stephen E. Arnold to reach more than …between one and three people each month. That sounds low to me. 😉

The line: “We do not do news.” Makes me wonder what the University of Southhampton paid to have a four page document described as a “dissertation.” See: New Paper: Linked Data Strategy for Global Identity. Or for that matter, what will it cost to get into “The Trend Point?”

Thoughts?

March 26, 2012

We’re Not Very Good Statisticians

Filed under: Analytics,Statistics — Patrick Durusau @ 6:36 pm

We’re Not Very Good Statisticians by Steve Miller.

From the post:

I’ve received several emails/comments about my recent series of blogs on Duncan Watts’ interesting book “Everything is Obvious: *Once You Know the Answer — How Common Sense Fails Us.” Watts’ thesis is that the common sense that generally guides us well for life’s simple, mundane tasks often fails miserably when decisions get more complicated.

Three of the respondents suggested I take a look at “Thinking Fast and Slow,” by psychologist Daniel Kahneman, who along with the late economist Amos Tversky, was awarded the Nobel Prize in Economic Sciences for “seminal work in psychology that challenged the rational model of judgment and decision making.”

Steve’s post and the ones to follow are worth a close read.

When data, statistical or otherwise, agrees with me, I take that as a sign to evaluate it very carefully. Your mileage may vary.

March 25, 2012

Tesseract – Fast Multidimensional Filtering for Coordinated Views

Filed under: Analytics,Dataset,Filters,Multivariate Statistics,Visualization — Patrick Durusau @ 7:16 pm

Tesseract – Fast Multidimensional Filtering for Coordinated Views

From the post:

Tesseract is a JavaScript library for filtering large multivariate datasets in the browser. Tesseract supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Tesseract uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Tesseract works, see the API reference.

Are you ready to “slice and dice” your data set?

January 27, 2012

Analytics with MongoDB (commercial opportunity here)

Filed under: Analytics,Data,Data Analysis,MongoDB — Patrick Durusau @ 4:35 pm

Analytics with MongoDB

Interesting enough slide deck on analytics with MongoDB.

Relies on custom programming and then closes with this punchline (along with others, slide #41):

  • If you’re a business analyst you have a problem
    • better be BFF with some engineer 🙂

I remember when word processing required a lot of “dot” commands and editing markup languages with little or no editor support. Twenty years (has it been that long?) later and business analysts are doing word processing, markup and damned near print shop presentation without working close to the metal.

Can anyone name any products that have made large sums of money making it possible for business analysts and others to perform those tasks?

If so, ask yourself if you would like to have a piece of the action that frees business analysts from script kiddie engineers?

Even if a general application is out of reach at present, imagine writing access routines for common public data sites.

Create a market for the means to import and access particular data sets.

December 17, 2011

IBM Redbooks Reveals Content Analytics

Filed under: Analytics,Data Mining,Entity Extraction,Text Analytics — Patrick Durusau @ 6:31 am

IBM Redbooks Reveals Content Analytics

From Beyond Search:

IBM Redbooks has put out some juicy reading for the azure chip consultants wanting to get smart quickly with IBM Content Analytics Version 2.2: Discovering Actionable Insight from Your Content. The sixteen chapters of this book take the reader from an overview of IBM content analytics, through understanding the details, to troubleshooting tips. The above link provides an abstract of the book, as well as links to download it as a PDF, view in HTML/Java, or order a hardcopy.

Abstract:

With IBM® Content Analytics Version 2.2, you can unlock the value of unstructured content and gain new business insight. IBM Content Analytics Version 2.2 provides a robust interface for exploratory analytics of unstructured content. It empowers a new class of analytical applications that use this content. Through content analysis, IBM Content Analytics provides enterprises with tools to better identify new revenue opportunities, improve customer satisfaction, and provide early problem detection.

To help you achieve the most from your unstructured content, this IBM Redbooks® publication provides in-depth information about Content Analytics. This book examines the power and capabilities of Content Analytics, explores how it works, and explains how to design, prepare, install, configure, and use it to discover actionable business insights.

This book explains how to use the automatic text classification capability, from the IBM Classification Module, with Content Analytics. It explains how to use the LanguageWare® Resource Workbench to create custom annotators. It also explains how to work with the IBM Content Assessment offering to timely decommission obsolete and unnecessary content while preserving and using content that has business value.

The target audience of this book is decision makers, business users, and IT architects and specialists who want to understand and use their enterprise content to improve and enhance their business operations. It is also intended as a technical guide for use with the online information center to configure and perform content analysis with Content Analytics.

The cover article points out the Redbooks have an IBM slant, which isn’t surprising. When you need big iron for an enterprise project, that IBM is one of a handful of possible players isn’t surprising either.

October 27, 2011

AnalyticBridge

Filed under: Analytics,Bibliography,Data Analysis — Patrick Durusau @ 4:45 pm

AnalyticBridge: A Social Network for Analytic Professionals

Some interesting resources, possibly useful groups.

Anyone with experience with this site?

October 19, 2011

Rapid-I: Report the Future

Filed under: Analytics,Data Mining,Document Classification,Prediction — Patrick Durusau @ 3:15 pm

Rapid-I: Report the Future

Source of:

RapidMiner: Professional open source data mining made easy.

Analytical ETL, Data Mining, and Predictive Reporting with a single solution

RapidAnalytics: Collaborative data analysis power.

No 1 in open source business analytics

The key product for business critical predictive analysis

RapidDoc: Webbased solution for document retrieval and analysis.

Classify text, identify trends as well as emerging topics

Easy to use and configure

From About Rapid-I:

Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, i.e. for large amounts of structured data like database systems and unstructured data like texts. The open-source data mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.

The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading open-source system for knowledge discovery and data mining. It is available as a stand-alone application for data analysis and as a data mining engine which can be integrated into own products. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP, Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma, PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses benefitting from the open-source business model of Rapid-I.

Data mining/analysis is the first part of any topic map project, however large or small. These tools, which I have not (yet) tried, are likely to prove useful in such projects. Comments welcome.

September 20, 2011

Silverlight® Visualizations… Changing the Way We Look at Predictive Analytics

Filed under: Analytics,Prediction,Subject Identity — Patrick Durusau @ 7:53 pm

Silverlight® Visualizations… Changing the Way We Look at Predictive Analytics

Webinar: Tuesday, October 18, 2011 10:00 AM – 11:00 AM PDT

Presented by Caroline Junkin, Director of Analytics Solutions for Predixion Software.

That’s about all the webinar form says so I went looking for more information. 😉

Predixion Insight™ Video Library

From that page:

Predixion Software’s video library contains tutorials that explore the predictive analytics features currently available in Predixion Insight™, demonstrations that walk you through various applications for predictive analytics and Webinar Replays.

If subjects can include subjects that some people don’t think exist, then subjects can certainly include subjects we think may exist at some point in the future. And no doubt our references to them will change over time.

September 1, 2011

Greenplum Community

Filed under: Algorithms,Analytics,Machine Learning,SQL — Patrick Durusau @ 6:00 pm

A post by Alex Popescu, Data Scientist Summit Videos, lead me to discover the Greenplum Community.

Hosted by Greenplum:

Greenplum is the pioneer of Enterprise Data Cloud™ solutions for large-scale data warehousing and analytics, providing customers with flexible access to all their data for business intelligence and advanced analytics. Greenplum offers industry-leading performance at a low cost for companies managing terabytes to petabytes of data. Data-driven businesses around the world, including NASDAQ, NYSE Euronext, Silver Spring Networks and Zions Bancorporation, have adopted Greenplum Database-based products to support their mission-critical business functions.

registration (free) brings access to the videos from the Data Scientist Summit.

The “community” is focused on Greenplum software (there is a “community” edition). Do be aware that Greenplum Database CE is a 1.7 GB download. Just so you know.

August 18, 2011

Building data startups: Fast, big, and focused

Filed under: Analytics,BigData,Data,Data Analysis,Data Integration — Patrick Durusau @ 6:54 pm

Building data startups: Fast, big, and focused (O’Reilly original)

Republished by Forbes as:
Data powers a new breed of startup

Based on the talk Building data startups: Fast, Big, and Focused

by Michael E. Driscoll

From the post:

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Describes the emerging big data stack and says:

The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

The future isn’t going to be getting users to develop topic maps but your use of topic maps (and other tools) to create data products of interest to users.

Think of it as being the difference between selling oil change equipment versus being the local Jiffy Lube. (Sorry, for non-U.S. residents, Jiffy Lube is a chain of oil change and other services. Some 2,000 locations in the North America.) I dare say that Jiffy Lube and its competitors do more auto services than users of oil change equipment.

August 14, 2011

KDnuggets

Filed under: Analytics,Conferences,Data Mining,Humor — Patrick Durusau @ 7:13 pm

KDnuggests

Good site to follow for data mining and analytics resources, ranges from conference announcements, data mining sites and forums, software, to crossword puzzles.

See: Analytics Crossword Puzzle 2.

I like that, it has a timer. One that starts automatically.

Maybe topic maps needs a cross-word puzzle or two. Pointers? Suggestions for content/clues?

« Newer PostsOlder Posts »

Powered by WordPress