Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 1, 2013

Data Skepticism in Action

Filed under: Data — Patrick Durusau @ 8:01 pm

I want to call your attention to a headline I saw today:

Research: Big data pays off Summary: Tech Pro Research’s latest survey shows that 82 percent of those who have implemented big data have seen discernible benefits by Teena Hammond.

Some people will read only the summary.

That’s a bad idea, and here’s why:

First, the survey reached only 144 respondents worldwide.

Hmmm, current world population is approximately 7,182,895,100 (it will be higher by the time you check the link).

Not all of them IT people but does 144 sound like a high percentage of IT people to you?

Let’s see (all data from 2010):

Database Administrators: 110,800

IT Managers: 310,000

Programmers: 363,100

Systems Analysts: 544,000

Software Developers: 913,000

That’s what? Almost 2 million IT people just in the United States?

And the survey reached 144 worldwide?

But if you read the pie chart carefully, only 8% of the 144 have implemented Big Data.

I am assuming you have to implement Big Data to claim to see any benefits from Big Data.

Hmmm, 8% of 144 is 11.52 to let’s round that up to 12.

Twelve people reached by the survey have implemented Big Data.

Of those twelve, 82% “report seeing at least some payoff in terms of goals achieved.”

So, 82% of 12 = 9.84 or round to 10.

If the headline had read: Tech Pro Research’s latest survey shows that 10 people world wide, who have implemented big data have seen discernible benefits, would your reaction have been the same?

Yes? No difference? Don’t care?

If you are Tech Pro Research member you can get a free copy of the report that uses ten people to make conclusions about your world.

A Tech Pro Research membership is $299/year.

If you are paying $299/year for ten person survey results, follow my Donations and support this blog.

Suggestions on other posts or reports that need a data skeptical review?

Cloudera now supports Accumulo…

Filed under: Accumulo,Cloudera,NSA — Patrick Durusau @ 6:43 pm

Cloudera now supports Accumulo, the NSA’s take on HBase by Derrick Harris.

From the post:

Cloudera will be integrating with the Apache Accumulo database and, according to a press release, “devoting significant internal engineering resources to speed Accumulo’s development.” The National Security Agency created Accumulo and built in fine-grained authentication to ensure only authorized individuals could see ay given piece of data. Cloudera’s support could be bittersweet for Sqrrl, an Accumulo startup comprised of former NSA engineers and intelligence experts, which should benefit from a bigger ecosystem but whose sales might suffer if Accumulo makes its way into Cloudera’s Hadoop distribution.

I would think the bittersweet part would be the NSA’s supporting of a design that leaves them with document level security.

It’s great that they can control access to how many saucers are stolen from White House dinners every year but document security, other than at the grossest level, goes wanting.

Maybe they haven’t heard of SGML or XML?

If you don’t mind, mention XML in your phone calls every now and again. Maybe if enough people say it, then it will come up on the “big board.”

Apache Aurora

Filed under: Distributed Computing,Distributed Systems,Mesos — Patrick Durusau @ 6:26 pm

Apache Aurora

Apache Aurora entered incubation today!

From the webpage:

Aurora is a service scheduler used to schedule jobs onto Apache Mesos.

Oh, Apache Mesos?

From the webpage:

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.

All the wiring is still pretty close to the surface but that’s not going to last long.

Better to learn it now while people still think it is hard. 😉

Get Started with Hadoop

Filed under: BigData,Hadoop,Hortonworks — Patrick Durusau @ 6:16 pm

Get Started with Hadoop

If you want to avoid being a Gartner statistic or hear big data jokes involving the name of your enterprise, this is a page to visit.

Hortonworks, one of the leading contributors to the Hadoop ecosystem, has assembled resources targeted at developers, analysts and systems administrators.

There are videos, tutorials and even a Hadoop sandbox.

All of which are free.

The choice is yours: Spend enterprise funds and hope to avoid failure or spend some time and plan for success.

Recursive Deep Models for Semantic Compositionality…

Filed under: Machine Learning,Modeling,Semantic Vectors,Semantics,Sentiment Analysis — Patrick Durusau @ 4:12 pm

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank by Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts.

Abstract:

Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levels for both positive and negative phrases.

You will no doubt want to see the webpage with the demo.

Along with possibly the data set and the code.

I was surprised by “fine-grained sentiment labels” meaning:

  1. Positive
  2. Somewhat positive
  3. Neutral
  4. Somewhat negative
  5. Negative

But then for many purposes, subject recognition on that level of granularity may be sufficient.

Elasticsearch internals: an overview

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 2:50 pm

Elasticsearch internals: an overview by Njal Karevoll.

From the post:

This article gives an overview of the Elasticsearch internals. I will present a 10,000 foot view of the different modules that Elasticsearch is composed of and how we can extend or replace built-in functionality using plugins.

Using Freemind, Njal has created maps of the namespaces and modules of ElasticSearch for your exploration.

The full module view reminds me of SGML productions, except less complicated.

Unicode Standard, Version 6.3

Filed under: Unicode — Patrick Durusau @ 2:32 pm

Unicode Standard, Version 6.3

From the post:

The Unicode Consortium announces Version 6.3 of the Unicode Standard and with it, significantly improved bidirectional behavior. The updated Version 6.3 Unicode Bidirectional Algorithm now ensures that pairs of parentheses and brackets have consistent layout and provides a mechanism for isolating runs of text.

Based on contributions from major browser developers, the updated Bidirectional Algorithm and five new bidi format characters will improve the display of text for hundreds of millions of users of Arabic, Hebrew, Persian, Urdu, and many others. The display and positioning of parentheses will better match the normal behavior that users expect. By using the new methods for isolating runs of text, software will be able to construct messages from different sources without jumbling the order of characters. The new bidi format characters correspond to features in markup (such as in CSS). Overall, these improvements also bring greater interoperability and an improved ability for inserting text and assembling user interface elements.

The improvements come with new rigor: the Consortium now offers two reference implementations and greatly improved testing and test data.

In a major enhancement for CJK usage, this new version adds standardized variation sequences for all 1,002 CJK compatibility ideographs. These sequences address a well-known issue of the CJK compatibility ideographs — that they could change their appearance when any process normalized the text. Using the new standardized variation sequences allows authors to write text which will preserve the specific required shapes of these CJK ideographs, even under Unicode normalization.

Version 6.3 includes other improvements as well:

  • Improved Unihan data to better align with ISO/IEC 10646
  • Better support for Hebrew word break behavior and for ideographic space in line breaking

Get started with Unicode 6.3 today! http://www.unicode.org/versions/Unicode6.3.0/.

Now, there’s an interesting data set!

Much of the convenience you now experience with digital texts is due to the under-appreciated efforts of the Unicode project.

« Newer Posts

Powered by WordPress