Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 11, 2011

Building The Ocean With Big Data

Filed under: Analytics,BigData,Data Analysis — Patrick Durusau @ 6:33 pm

Building The Ocean With Big Data

From the post:

While working at an agency with a robust analytics group is exciting, it can also be frustrating at times. Clients challenge us with questions that are often difficult to answer with a simple data pull/request. For example, an auto client may ask how digital media is driving auto sales for a specific model in a specific location. Another client may like to better understand how much they need to spend on digital media, and to that end, which media sequencing is most effective (e.g. search -> display -> search -> social, etc.). Questions like these require multiple large sets of data, often in varying formats and time ranges. So the question becomes, with data collection and aggregation more important than ever, what steps can we take to make sure we analyze Big Data in a meaningful way?

Topic maps face the same issues as analysis of Big Data, where do you start?

If you start with no plan or a poorly planned one, you can work very hard for little or no gain. This article, while framed for analysis, has good principals for organizing analysis or mapping of Big Data.

August 5, 2011

Sentiment Analysis: Machines Are Like Us

Filed under: Analytics,Artificial Intelligence,Classifier,Machine Learning — Patrick Durusau @ 7:07 pm

Sentiment Analysis: Machines Are Like Us

Interesting post but in particular for:

We are very aware of the importance of industry-specific language here at Brandwatch and we do our best to offer language analysis that specialises in industries as much as possible.

We constantly refine our language systems by adding newly trained classifiers (a classifier is the particular system used to detect and analyse the language of a query’s matches – which classifier should be used is determined upon query creation).

We have over 500 classifiers for different industries across the 17 languages we cover.

Did you catch that? Over 500 classifiers for different industries.

In other words, we don’t need a single classifier that does all the heavy lifting on entity recognition for building topic maps. We could, for example, train a classifier for use with all the journals in a field or sub-field. For astronomy, for example, we don’t have to disambiguate all the various uses of “Venus” but can concentrate on the one most likely to be found in a sub-set of astronomy literature.

By using specialized classifiers, perhaps we can reduce the target for more generalized classifiers to a manageable size.

July 19, 2011

Overview: Visualization to Connect the Dots

Filed under: Analytics,Java,Scala,Visualization — Patrick Durusau @ 7:54 pm

Overview is Hiring!

I don’t think I have ever re-posted a job ad but this one merits wide distribution:

We need two Java or Scala ninjas to build the core analytics and visualization components of Overview, and lead the open-source development community. You’ll work in the newsroom at AP’s global headquarters in New York, which will give you plenty of exposure to the very real problems of large document sets.

The exact responsibilities will depend on who we hire, but we imagine that one of these positions will be more focused on user experience and process design, while the other will do the computer science heavy lifting — though both must be strong, productive software engineers. Core algorithms must run on a distributed cluster, and scale to millions of documents. Visualization will be through high-performance OpenGL. And it all has to be simple and obvious for a reporter on deadline who has no time to fight technology. You will be expected to implement complex algorithms from academic references, and expand prototype techniques into a production application.

From the about page:

Overview is an open-source tool to help journalists find stories in large amounts of data, by cleaning, visualizing and interactively exploring large document and data sets. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read.

There are good tools for searching within large document sets for names and keywords, but that doesn’t help find stories we’re not looking for. Overview will display relationships among topics, people, places and dates to help journalists to answer the question, “What’s in there?”

We’re building an interactive system where computers do the visualization, while a human guides the exploration. We will also produce documentation and training to help people learn how to use this system. The goal is to make this capability available to anyone who needs it.

Overview is a project of The Associated Press, supported by the John S. and James L. Knight Foundation as part of its Knight News Challenge. The Associated Press invests its resources to advance the news industry, delivering fast, unbiased news from every corner of the world to all media platforms and formats. The Knight News Challenge is an international contest to fund digital news experiments that use technology to inform and engage communities.

Sounds like a project that is worth supporting to me!

Analytics are great, but subject identity would be more useful.

Apply if you have the skill sets, repost the link, and/or volunteer to carry the good news of topic maps to the project.

Building your own Facebook Realtime Analytics System

Filed under: Analytics — Patrick Durusau @ 7:52 pm

Building your own Facebook Realtime Analytics System

Much more interesting than most of the content I see on Facebook.

From the post:

Recently, I was reading Todd Hoff’s write-up on FaceBook real time analytics system. As usual, Todd did an excellent job in summarizing this video from Engineering Manager at Facebook Alex Himel.

In the first post, I’d like to summarize the case study, and consider some things that weren’t mentioned in the summaries. This will lead to an architecture for building your own Realtime Time Analytics for Big-Data that might be easier to implement, using Facebook’s experience as a starting point and guide as well as the experience gathered through a recent work with few of GigaSpaces customers. The second post provide a summary of that new approach as well as a pattern and a demo for building your own Real Time Analytics system..

July 18, 2011

IBM Targets the Future of Social Media Analytics

Filed under: Analytics,Hadoop — Patrick Durusau @ 6:42 pm

IBM Targets the Future of Social Media Analytics

This is from back in April, 2011 but thought it was worthy of a note. The post reads in part:

The new product, called Cognos Consumer Insight, is built upon IBM’s Cognos business intelligence technology along with Hadoop to process the piles of unstructured social media data. According to Deepak Advani, IBM’s VP of predictive analytics, there’s a lot of value in performing text analytics on data derived from Twitter, Facebook and other social forums to determine how companies or their products are faring among consumers. Cognos lets customers view sentiment levels over time to determine how efforts are working, he added, and skilled analysts can augment their Cognos Consumer Insight usage with IBM’s SPSS product to bring predictive analytics into the mix.

The partnership with Yale is designed to address the current dearth of analytic skills among business leaders, Advani said. Although the program will involve training on analytics technologies, Advani explained that business people still need some grounding in analytic theory and thinking rather than just knowing how to use a particular piece of software. “I think the primary goal is for students to learn analytically,” he said, which will help know which technology to put to work on what data, and how.

Within many organizations, he added, the main problem is that they’re not using analytics at the point of decision or across all their business processes. Advani says partnerships like those with Yale will help instill the thought process of using mathematical algorithms instead of gut feelings.

I was with them up to the point that it says: “….instill the thought process of using mathematical algorithms instead of gut feelings.”

I don’t take “analytical thinking” to be limited to mathematical algorithms.

Moreover, we have been down this road before, when Jack Kennedy was president and Robert McNamara was Secretary of Defense. Operations analysis they called it back then. Should be able to determine, mathematically, how much equipment was needed at any particular location and didn’t need to ask local “gut” opinions about it. True, some bases don’t need snow plows every year, but when planes are trying to land, they are very nice to have.

If you object that is an abuse of operations theory I would have to concede you are correct, but abused it was on a regular basis.

I suspect the program will be a very good one along with the software. My only caution is really on any analytical technique that gives an answer at variance with years of experience in a trade. At least a reason to pause to ask why?

July 14, 2011

…20 Billion Events Per Day

Filed under: Analytics,HBase — Patrick Durusau @ 4:13 pm

Facebook’s New Realtime Analytics System: HBase to Process 20 Billion Events Per Day

The post covers the use of HBase with pointers to additional comments. Some of the additional analysis caught my eye:

Facebook’s Social Plugins are Roman Empire Management 101. You don’t have to conquer everyone to build an empire. You just have control everyone with the threat they could be conquered while making them realize, oh by the way, there’s lots of money to be made being friendly with Rome. This strategy worked for quite a while as I recall.

You’ve no doubt seen Social Plugins on websites out the wild. A social plugin lets you see what your friends have liked, commented on or shared on sites across the web. The idea is putting social plugins on a site makes the content more engaging. Your friends can see what you are liking and in turn websites can see what everyone is liking. Content that is engaging gives you more clicks, more likes, and more comments. For a business or brand, or even an individual, the more engaging the content is, the more people see it, the more it pops up in news feeds, the more it drives traffic to a site.

The formerly lone-wolf web, where content hunters stalked web sites silently and singly, has been turned into a charming little village, where everyone knows your name. That’s the power of social.

Turning content hunters into villagers is quite attractive.

I checked out the reference on Like buttons. You can use the Open Graph protocol but:

When your Web page represents a real-world entity, things like movies, sports teams, celebrities, and restaurants, use the Open Graph protocol to specify information about the entity.

Isn’t a web page at the wrong level of granularity?

This page has already talked about social plugins, Facebook, web pages, Like buttons, HBase, the Roman Empire and several other “entities.”

But:

og:url – The canonical, permanent URL of the page representing the entity. When you use Open Graph tags, the Like button posts a link to the og:url instead of the URL in the Like button code.

Opps. I have to either choose one entity or have the same URL for the Roman Empire as I do Facebook.

That doesn’t sound like a good solution.

Does it to you?

June 27, 2011

Gartner Restates The Obvious, Again

Filed under: Analytics,BigData — Patrick Durusau @ 6:31 pm

The press release, Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data, did not take anyone interested in ‘Big Data’ by surprise.

From the news release:

Worldwide information volume is growing annually at a minimum rate of 59 percent annually, and while volume is a significant challenge in managing big data, business and IT leaders must focus on information volume, variety and velocity.

Volume: The increase in data volumes within enterprise systems is caused by transaction volumes and other traditional data types, as well as by new types of data. Too much volume is a storage issue, but too much data is also a massive analysis issue.

Variety: IT leaders have always had an issue translating large volumes of transactional information into decisions — now there are more types of information to analyze — mainly coming from social media and mobile (context-aware). Variety includes tabular data (databases), hierarchical data, documents, e-mail, metering data, video, still images, audio, stock ticker data, financial transactions and more.

Velocity: This involves streams of data, structured record creation, and availability for access and delivery. Velocity means both how fast data is being produced and how fast the data must be processed to meet demand.

While big data is a significant issue, Gartner analysts said the real issue is making sense of big data and finding patterns in it that help organizations make better business decisions.

Whether data is ‘big’ or ‘small,’ the real issue has always been making sense of it and using it to make business decisions. Did anyone ever contend otherwise?

As far as ‘big data,’ I think there are two not entirely obvious impacts it may have on analysis:

1) The streetlamp effect: We have all heard of or seen the cartoon with the guy searching for his car keys under a streetlamp. When someone stops to help and asks where he lost them, he points off into the darkness. When asked why he is searching here, the reply is “The light is better over here.”

With “big data,” there can be a tendency, having collected “big data,” to assume the answer must lie in its analysis. Perhaps so but having gathered “big data,” is no guarantee you have the right big data or that it is the data that can answer the question being posed. Start with your question and not the “big data” you happen to have on hand.

2) Similar to the first as data that does not admit to easy processing, data that is semantically diverse or simply not readily available/processable, may be ignored. Which may lead to a false sense of confidence in the data that is analyzed. This danger is particularly real when preliminary results with available data confirm current management plans or understandings.

Making sense out of data (big, small, or in-between) has always been the first step in its use in a business decision process. Even non-Gardner clients know that much.

June 23, 2011

Personal Analytics

Filed under: Analytics,Conferences,Data,Data Analysis — Patrick Durusau @ 1:49 pm

Personal Analytics

An O’Reilly Online Strata Conference.

Free

July 12, 2011

16:00 – 18:30am UTC

From the website:

It’s only in the past decade that we’ve become aware of how much of our lives is recorded. From phone companies to merchants, social networks to employers, everyone’s building a record of us―except us. That’s changing. Once, recording every aspect of your life might have seemed obsessive. Now, armed with the latest smartphones and comfortable with visualizations and analytics, life-logging is no longer fringe behavior. In this Strata OLC, we’ll look at the rapidly growing field of personal analytics. We’ll discuss tool stacks for recording lives, and hear surprising stories about what happens when introspection meets technology.

O’Reilly Strata Online is a fast-paced, web-based conference series tackling the impact of a data-driven, always-on world. It combines thorough tutorials, provocative panel discussions, real-world case studies, and deep-dives into technology stacks.

This could be fun, not to mention a model for mini-conferences perhaps for topic maps.

June 22, 2011

Weave – Web-based Analysis and Visualization Environment

Filed under: Analytics,Geographic Data,Visualization — Patrick Durusau @ 6:40 pm

Weave – Web-based Analysis and Visualization Environment

From the webpage:

Weave (BETA 1.0) is a new web-based visualization platform designed to enable visualization of any available data by anyone for any purpose. Weave is an application development platform supporting multiple levels of user proficiency – novice to advanced – as well as the ability to integrate, disseminate and visualize data at “nested” levels of geography.

Weave has been developed at the Institute for Visualization and Perception Research of the University of Massachusetts Lowell in partnership with the Open Indicators Consortium, a nine member national collaborative of public and nonprofit organizations working to improve access to more and higher quality data.

The installation videos are something to point at if you have users doing their own installations of MySQL, Java, Tomcat, or Flash for any reason.

I would quibble with the installation of Tomcat using “root” and “password,” as the username and password for the admin page of Tomcat. Good security is hard enough to teach without really bad examples of security practices in tutorial materials.

The visualization capabilities look quite nice.

Originally saw this in a tweet from Lutz Maicher.

June 20, 2011

MAD Skills: New Analysis Practices for Big Data

Filed under: Analytics,BigData,Data Integration,SQL — Patrick Durusau @ 3:33 pm

MAD Skills: New Analysis Practices for Big Data by Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton.

Abstract:

As massive data acquisition and storage becomes increasingly aff ordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present data-parallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

I found this passage very telling:

These desires for speed and breadth of data raise tensions with Data Warehousing orthodoxy. Inmon describes the traditional view:

There is no point in bringing data … into the data warehouse environment without integrating it. If the data arrives at the data warehouse in an unintegrated state, it cannot be used to support a corporate view of data. And a corporate view of data is one of the essences of the architected environment [13]

Unfortunately, the challenge of perfectly integrating a new data source into an “architected” warehouse is often substantial, and can hold up access to data for months – or in many cases, forever. The architectural view introduces friction into analytics, repels data sources from the warehouse, and as a result produces shallow incomplete warehouses. It is the opposite of the MAD ideal.

Marketing question for topic maps: Do you want a shallow, incomplete data warehouse?

Admittedly there is more to it, topic maps enable the integration of both data structures as well as the data itself. Both are subjects in the view of topic maps. Not to mention capturing the reasons why certain structures or data were mapped to other structures or data. I think the name for that is an audit trail.

Perhaps we should ask: Does your data integration methodology offer an audit trail?

(See MADLib for the source code growing out of this effort.)

May 2, 2011

Prediction API: Every app a smart app

Filed under: Analytics,Artificial Intelligence — Patrick Durusau @ 10:32 am

Prediction API: Every app a smart app by By Travis Green of the Google Prediction API Team.

From the post:

If you’re looking to make your app smarter and you think machine learning is more complicated than making three API calls, then you’re reading the right blog post.

Today, we are releasing v1.2 of the Google Prediction API, which makes it even easier for preview users to build smarter apps by accessing Google’s advanced machine learning algorithms through a RESTful web service.

I haven’t played with this but could be interested in hearing from someone who has.

April 6, 2011

Exploring Complex, Dynamic Graph Data

Filed under: Analytics,Graphs,Visualization — Patrick Durusau @ 6:22 pm

Chris Diehl has an interesting series,

Exploring Complex, Dynamic Graph Data, part 1

Exploring Complex, Dynamic Graph Data, part 2

Exploring Complex, Dynamic Graph Data, part 3

According to Chris, Exploratory Data Analysis (EDA) requires:

  • Persistence – Provides a non-volatile representation of the data we intend to explore.
  • Query – Supports filtering and transformation operations to condition the data for analysis.
  • Analysis – Enables the synthesis and execution of complex analytics on the data.
  • Visualization – Facilitates rapid composition of a range of visualizations to interpret results.

Check it out.

March 31, 2011

Unified analysis of streaming news

Filed under: Aggregation,Analytics,Clustering — Patrick Durusau @ 3:38 pm

Unified analysis of streaming news by Amr Ahmed, Qirong Ho, Jacob Eisenstein, and, Eric Xing Carnegie Mellon University, Pittsburgh, USA, and Alexander J. Smola and Choon Hui Teo of Yahoo! Research, Santa Clara, CA, USA.

News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.

From the article:

Such an approach combines the strengths of clustering and topic models. We use topics to describe the content of each cluster, and then we draw articles from the associated story. This is a more natural fit for the actual process of how news is created: after an event occurs (the story), several journalists write articles addressing various aspects of the story. While their vocabulary and their view of the story may differ, they will by necessity agree on the key issues related to a story (at least in terms of their vocabulary). Hence, to analyze a stream of incoming news we need to infer a) which (possibly new) cluster could have generated the article and b) which topic mix describes the cluster best.

I single out that part of the paper to remark that at first the authors say that the vocabulary for a story may vary and then in the next breath say that for key issues the vocabulary will agree on key issues.

Given the success of their results, it may be that news reporting is more homogeneous in its vocabulary than other forms of writing?

Perhaps news compression where duplicated content is suppressed but the “fact” of reportage is retained, that could make an interesting topic map.

March 18, 2011

MADLib

Filed under: Analytics,Machine Learning — Patrick Durusau @ 6:50 pm

MADLib

From the website:

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.

The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development.

Targeted at PostgreSQL and Greenplum.

March 12, 2011

Social Analytics on MongoDB

Filed under: Analytics,MongoDB,NoSQL — Patrick Durusau @ 6:48 pm

Social Analytics on MongoDB, Patrick Stokes of Buddy Media does a highly entertaining presentation on MongoDB and its adoption by Buddy Media.

Unfortunately the slides don’t display during the presentation.

Still, refreshing in the honesty about the development process.

PS: I have written to ask about where to find the slides.

Update

You can find the slides at: http://www.slideshare.net/pstokes2/social-analytics-with-mongodb

January 28, 2011

Unified Intelligence: Completing the Mosaic of Analytics

Filed under: Analytics,Data Analysis — Patrick Durusau @ 10:15 am

Unified Intelligence: Completing the Mosaic of Analytics

Tuesday, Feb. 15 @ 4 ET

From the announcement:

Seeing the big picture requires a convergence of both structured and unstructured data. While each side of that puzzle presents challenges, the unstructured world poses a wider range of issues that must be resolved before meaningful analysis can be done. However, many organizations are discovering that new technologies can be employed to process and transform this unwieldy data, such that it can be united with the traditional realm of business intelligence to bring new meaning and context to analytics.

Register for this episode of The Briefing Room to learn from veteran Analyst James Taylor about how companies can incorporate unstructured data into their decision systems and processes. Taylor will be briefed by Sid Probstein of Attivio, who will tout his company’s patented technology, the Active Intelligence Engine, which uses inverted indexing and a mathematical graph engine to extract, process and align unstructured data. A host of Attivio connectors allow integration with most analytical and many operational systems, including the capability for hierarchical XML data.

I am not real sure what a non-mathematical graph engine would look like but this could be fun.

It is also an opportunity to learn something about how others view the world.

« Newer Posts

Powered by WordPress