Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 5, 2011

The cool aspects of Odiago WibiData

Filed under: Hadoop,HBase,Wibidata — Patrick Durusau @ 6:42 pm

The cool aspects of Odiago WibiData

From the post:

Christophe Bisciglia and Aaron Kimball have a new company.

  • It’s called Odiago, and is one of my gratifyingly more numerous tiny clients.
  • Odiago’s product line is called WibiData, after the justly popular We Be Sushi restaurants.
  • We’ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this is the place for — well, for the tech crunch.

WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is a data management and analytic execution layer.

Still in private beta (you can sign up for notice) but the post covers the infrastructure with enough detail to be enticing.

Just as a tease (on my part):

where you’d have a single value in a relational table, you might have the equivalent of a whole relational table (or at least selection/view) in WibiData-enhanced HBase. For example, if a user visits the same web page ten times, and each time 50 attributes are recorded (including a timestamp), all 500 data – to use the word “data” in its original “plural of datum” sense – would likely be stored in the same WibiData cell.

You need to go read the post to put that in context.

I keep thinking all the “good” names are gone and then something like WibiData shows up. 😉

I suspect there are going to be a number of lessons to learn from this combination of HBase and Hadoop.

October 23, 2011

HBase Coprocessors – Webinar – 4 November 2011

Filed under: HBase — Patrick Durusau @ 7:20 pm

HBase Coprocessors – Deploy shared functionality directly on the cluster 4 Nomber 2011, 10 AM PT by Lars George.

From the announcement:

The newly added feature of Coprocessors within HBase allows the application designer to move functionality closer to where the data resides. While this sounds like Stored Procedures as known in the RDBMS realm, they have a different set of properties. The distributed nature of HBase adds to the complexity of their implementation, but the client side API allows for an easy, transparent access to their functionality across many servers. This session explains the concepts behind coprocessors and uses examples to show how they can be used to implement data side extensions to the application code.

For background material, you probably want to review:

Advanced HBase by Lars George (Courtesy of Alex Popescu’s myNoSQL site) it takes until slide 72 or so to reach coprocessors but you will learn a lot of stuff along the way.

Extending Query support via Coprocessor endpoints, which summarizes the uses of coprocessors as:

Coprocessors can be used for

a) observing server side operations (like the administrative kinds such as Region splits, major-minor compactions , etc) , and

b) client side operations that are eventually triggered on to the Region servers (like CRUD operations).

Another use case is letting the end user to deploy his own code (some user defined functionality) and directly invoking it from the client interface (HTable). The later functionality is called as Coprocessor Endpoints. [I introduced some paragraphing to make this more readable.]

If you have a copy of HBase: The Definitive Guide, review pages 175-199.

October 22, 2011

Cloudera Training Videos

Filed under: Hadoop,HBase,Hive,MapReduce,Pig — Patrick Durusau @ 3:17 pm

Cloudera Training Videos

Cloudera has added several training videos on Hadoop and parts of the Hadoop ecosystem.

You will find:

  • Introduction to HBase – Todd Lipcon
  • Thinking at Scale
  • Introduction to Apache Pig
  • Introduction to Apache MapReduce and HDFS
  • Introduction to Apache Hive
  • Apache Hadoop Ecosystem
  • Hadoop Training Virtual Machine
  • Hadoop Training: Programming with Hadoop
  • Hadoop Training: MapReduce Algorithms

No direct links to the videos because new resources/videos will appear more quickly at the Cloudera site than I will be updating this list.

Now you have something to watch this weekend (Oct. 22-23, 2011) other than reports on and of the World Series! Enjoy!

September 7, 2011

Accumulo Proposal

Filed under: Accumulo,HBase,NoSQL — Patrick Durusau @ 6:49 pm

Accumulo Proposal

From the Apache incubator:

Abstract

Accumulo is a distributed key/value store that provides expressive, cell-level access labels.

Proposal

Accumulo is a sorted, distributed key/value store based on Google’s BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.

Background

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, CloudStore, and Cassandra. Accumulo began its development in 2008.

Rationale

There is a need for a flexible, high performance distributed key/value store that provides expressive, fine-grained access labels. The communities we expect to be most interested in such a project are government, health care, and other industries where privacy is a concern. We have made much progress in developing this project over the past 3 years and believe both the project and the interested communities would benefit from this work being openly available and having open development.

Further explanation of access labels and iterators:

Access Labels

Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user. The column visibilities are boolean AND and OR combinations of arbitrary strings (such as “(A&B)|C”) and authorizations are sets of strings (such as {C,D}).

Iterators

Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.

The use case for modifying data written to disk is unclear to me but I suppose the data “returned to the user” involves modification of data for security reasons.

Sponsored in part by the NSA, National Security Agency of the United States.

The access label line of thinking has implications for topic map merging. What if a similar mechanism were fashioned to permit or prevent “merging” based on the access of the user? (Where merging isn’t a file based activity.)

August 28, 2011

The Future of Hadoop

Filed under: Hadoop,HBase — Patrick Durusau @ 7:55 pm

The Future of Hadoop – with Doug Cutting and Jeff Hammerbacher

From the description:

With a community of over 500 contributors, Apache Hadoop and related projects are evolving at an ever increasing rate. Join the co-creator of Apache Hadoop, Doug Cutting, and Cloudera’s Chief Scientist, Jeff Hammerbacher, for a discussion of the most exciting new features being developed by the Apache Hadoop community.

The primary focus of the webinar will be the evolution from the Apache Hadoop kernel to the complete Apache Bigtop platform. We’ll cover important changes in the kernel, especially high availability for HDFS and the separation of cluster resource management and MapReduce job scheduling.

We’ll discuss changes to throughout the platform, including support for Snappy-based compression and the Avro data file format in all components, performance and security improvements across all components, and additional supported operating systems. Finally, we’ll discuss new additions to the platform, including Mahout for machine learning and HCatalog for metadata management, as well as important improvements to existing platform components like HBase and Hive.

Both the slides and the recording of this webinar are available but I would go for the recording.

One of the most informative and entertaining webinars I have seen, ever. Cites actual issue numbers and lays out how Hadoop is on the road to becoming a stack of applications that offer a range of data handling and analysis capabilities.

If you are interested in data processing/analysis at any scale, you need to see this webinar.

August 11, 2011

Apache Hadoop and HBase

Filed under: Hadoop,HBase — Patrick Durusau @ 6:31 pm

Apache Hadoop and HBase by Todd Lipcon.

Another introductory slide deck. Don’t know which one is going to click for any individual so including it here.

July 29, 2011

State of HBase

Filed under: HBase,NoSQL — Patrick Durusau @ 7:43 pm

State of HBase by Michael Stack (StumbleUpon).

From the abstract:

Attendees will learn about the current state of the HBase project. We’ll review what the community is contributing, some of the more interesting production installs, killer apps on HBase, the on-again, off-again HBase+HDFS love affair, and what the near-future promises. A familiarity with BigTable concepts and Hadoop is presumed.

Catch the latest news on HBase!

July 21, 2011

HBase at YFrog

Filed under: HBase — Patrick Durusau @ 6:26 pm

HBase at YFrog

Alex Popescu’s summary of slides on the use of HBase at YFrog.

Impressive numbers!

July 14, 2011

…20 Billion Events Per Day

Filed under: Analytics,HBase — Patrick Durusau @ 4:13 pm

Facebook’s New Realtime Analytics System: HBase to Process 20 Billion Events Per Day

The post covers the use of HBase with pointers to additional comments. Some of the additional analysis caught my eye:

Facebook’s Social Plugins are Roman Empire Management 101. You don’t have to conquer everyone to build an empire. You just have control everyone with the threat they could be conquered while making them realize, oh by the way, there’s lots of money to be made being friendly with Rome. This strategy worked for quite a while as I recall.

You’ve no doubt seen Social Plugins on websites out the wild. A social plugin lets you see what your friends have liked, commented on or shared on sites across the web. The idea is putting social plugins on a site makes the content more engaging. Your friends can see what you are liking and in turn websites can see what everyone is liking. Content that is engaging gives you more clicks, more likes, and more comments. For a business or brand, or even an individual, the more engaging the content is, the more people see it, the more it pops up in news feeds, the more it drives traffic to a site.

The formerly lone-wolf web, where content hunters stalked web sites silently and singly, has been turned into a charming little village, where everyone knows your name. That’s the power of social.

Turning content hunters into villagers is quite attractive.

I checked out the reference on Like buttons. You can use the Open Graph protocol but:

When your Web page represents a real-world entity, things like movies, sports teams, celebrities, and restaurants, use the Open Graph protocol to specify information about the entity.

Isn’t a web page at the wrong level of granularity?

This page has already talked about social plugins, Facebook, web pages, Like buttons, HBase, the Roman Empire and several other “entities.”

But:

og:url – The canonical, permanent URL of the page representing the entity. When you use Open Graph tags, the Like button posts a link to the og:url instead of the URL in the Like button code.

Opps. I have to either choose one entity or have the same URL for the Roman Empire as I do Facebook.

That doesn’t sound like a good solution.

Does it to you?

June 3, 2011

IBM InfoSphere BigInsights

Filed under: Avro,BigInsights,Hadoop,HBase,Lucene,Pig,Zookeeper — Patrick Durusau @ 2:32 pm

IBM InfoSphere BigInsights

Two items stand out in the usual laundry list of “easy administration” and “IBM supports open source” list of claims:

The Jaql query language. Jaql, a Query Language for JavaScript Object Notation (JSON), provides the capability to process both structured and non-traditional data. Its SQL-like interface is well suited for quick ramp-up by developers familiar with the SQL language and makes it easier to integrate with relational databases.

….

Integrated installation. BigInsights includes IBM value-added technologies, as well as open source components, such as Hadoop, Lucene, Hive, Pig, Zookeeper, Hbase, and Avro, to name a few.

I guess it must include a “few” things since the 64-bit Linux download is 398 MBs.

Just pointing out its availability. More commentary to follow.

May 21, 2011

HBase 0.90.3

Filed under: HBase — Patrick Durusau @ 5:15 pm

HBase 0.90.3

A bug-fix release of HBase.

May 12, 2011

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Filed under: Cassandra,CouchDB,HBase,MongoDB,NoSQL,Redis,Riak — Patrick Durusau @ 7:56 am

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Good thumb-nail comparison of the major features of all six (6) NoSQL databases by Kristóf Kovács.

Sorry to see that Neo4J didn’t make the comparison.

April 21, 2011

HBase Do’s and Don’ts

Filed under: HBase,NoSQL — Patrick Durusau @ 12:36 pm

HBase Do’s and Don’ts

From the post:

We at Cloudera are big fans of HBase. We love the technology, we love the community and we’ve found that it’s a great fit for many applications. Successful uses of HBase have been well documented and as a result, many organizations are considering whether HBase is a good fit for some of their applications. The impetus for my talk and this follow up blog post is to clarify some of the good applications for HBase, warn against some poor applications and highlight important steps to a successful HBase deployment.

Helpful review of HBase.

March 4, 2011

ApacheCon NA 2011

Filed under: Cassandra,Cloud Computing,Conferences,CouchDB,HBase,Lucene,Mahout,Solr — Patrick Durusau @ 7:17 am

ApacheCon NA 2011

Proposals: Be sure to submit your proposal no later than Friday, 29 April 2011 at midnight Pacific Time.

7-11 November 2011 Vancouver

From the website:

This year’s conference theme is “Open Source Enterprise Solutions, Cloud Computing, and Community Leadership”, featuring dozens of highly-relevant technical, business, and community-focused sessions aimed at beginner, intermediate, and expert audiences that demonstrate specific professional problems and real-world solutions that focus on “Apache and …”:

  • … Enterprise Solutions (from ActiveMQ to Axis2 to ServiceMix, OFBiz to Chemistry, the gang’s all here!)
  • … Cloud Computing (Hadoop, Cassandra, HBase, CouchDB, and friends)
  • … Emerging Technologies + Innovation (Incubating projects such as Libcloud, Stonehenge, and Wookie)
  • … Community Leadership (mentoring and meritocracy, GSoC and related initiatives)
  • … Data Handling, Search + Analytics (Lucene, Solr, Mahout, OODT, Hive and friends)
  • … Pervasive Computing (Felix/OSGi, Tomcat, MyFaces Trinidad, and friends)
  • … Servers, Infrastructure + Tools (HTTP Server, SpamAssassin, Geronimo, Sling, Wicket and friends)

March 1, 2011

NoSQL Databases: Why, what and when

NoSQL Databases: Why, what and when by Lorenzo Alberton.

When I posted RDBMS in the Social Networks Age I did not anticipate returning the very next day with another slide deck from Lorenzo. But, after viewing this slide deck, I just had to post it.

It is a very good overview of NoSQL databases and their underlying principles, with useful graphics as well (as opposed to the other kind).

I am going to have to study his graphic technique in hopes of applying it to the semantic issues that are at the core of topic maps.

February 20, 2011

HSearch NoSQL Search Engine Built on HBase

Filed under: HBase,HSearch,NoSQL — Patrick Durusau @ 10:51 am

HSearch NoSQL Search Engine Built on HBase

HSearch features include:

  • Multi-XML formats
  • Record and document level search access control
  • Continuous index updation
  • Parallel indexing using multi-machines
  • Embeddable inside application
  • A REST-ful Web service gateway that supports XML
  • Auto sharding
  • Auto replication

Original title and link: HSearch: NoSQL Search Engine Built on HBase (NoSQL databases © myNoSQL)

Another entry in the NoSQL arena.

I don’t recall but was parallel querying discussed for TMQL?

February 17, 2011

HBase and Lucene for realtime search

Filed under: HBase,Lucene — Patrick Durusau @ 6:48 am

HBase and Lucene for realtime search

From the post that starts this exciting thread:

I’m curious as to what a ‘good’ approach would be for implementing search in HBase (using Lucene) with the end goal being the integration of realtime search into HBase. I think the use case makes sense as HBase is realtime and has a write-ahead log, performs automatic partitioning, splitting of data, failover, redundancy, etc. These are all things Lucene does not have out of the box, that we’d essentially get for ‘free’.

For starters: Where would be the right place to store Lucene segments or postings? Eg, we need to be able to efficiently perform a linear iteration of the per-term posting list(s).

Thanks!

Jason Rutherglen

This could definitely have legs for exploring data sets, authoring topic maps or assuming a dynamic synonyms table, composed of conditions for synonymy, even acting as a topic map engine.

Will keep a close eye on this activity.

January 22, 2011

Advanced HBase – Post

Filed under: HBase,NoSQL — Patrick Durusau @ 7:14 pm

Advanced HBase by Lars George from Alex Popescu’s MyNoSQL blog.

January 20, 2011

HBase 0.90.0 Released: Over 1000 Fixes and Improvements – Post

Filed under: HBase,NoSQL — Patrick Durusau @ 6:21 am

HBase 0.90.0 Released: Over 1000 Fixes and Improvements

From Alex Popescu news that HBase 0.90.0 has been released!

HBase homepage

December 31, 2010

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison – Post

Filed under: Cassandra,CouchDB,HBase,NoSQL,Redis,Riak — Patrick Durusau @ 11:01 am

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Not enough detail for decision making but a useful overview nonetheless.

November 10, 2010

OpenTSDB

Filed under: HBase,NoSQL — Patrick Durusau @ 12:28 pm

OpenTSDB

From the website:

OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.

Thanks to HBase’s scalability, OpenTSDB allows you to collect many thousands of metrics from thousands of hosts and applications, at a high rate (every few seconds). OpenTSDB will never delete or downsample data and can easily store billions of data points. As a matter of fact, StumbleUpon uses it to keep track of hundred of thousands of time series and collects over 100 million data points per day in their main production cluster.

Imagine having the ability to quickly plot a graph showing the number of active worker threads in your web servers, the number of threads used by your database, and correlate this with your service’s latency (example below). OpenTSDB makes generating such graphs on the fly a trivial operation, while manipulating millions of data point for very fine grained, real-time monitoring.

Imagine how a busy sysadmin would react if those metrics were endowed with subject identity and participated in associations with system documentation.

Or metrics of a power distribution center had subject identity so they could tie into multiple emergency/maintenance networks?

Subjects are cheap, subject identity is useful.
(maybe I should make that my tag line, comments?)

***
I first saw this at OpenTSDB: A HBase Scalable Time Series Database by Alex Popescu

October 9, 2010

BigTable Model with Cassandra and HBase – Post

Filed under: Cassandra,HBase,NoSQL — Patrick Durusau @ 6:29 am

BigTable Model with Cassandra and HBase Non-hand-waving explanation of Cassandra and HBase.

Has anyone tried to column of values approach where subjectIdentifier or subjectLocator is a set of values?

July 22, 2010

Lily – the Scalable NoSQL Content Repository

Filed under: HBase,NoSQL,Solr — Patrick Durusau @ 7:02 am

Lily – the Scalable NoSQL Content Repository

A product prior to customers. What a marketing concept!

Sarcasm to one side, this is a significant development for scalable content storage using NoSQL and for topic maps.

The more data stored in Lily the less findable it will be, particularly across vocabularies.

Traditional blind mappings will work but they will also remain impervious to reliable sharing/scaling.

Topic maps need not be embedded in data storage applications but that could be a key marketing point for some customers. Something to keep in mind while evaluating Lily.

« Newer Posts

Powered by WordPress