Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 28, 2012

Mavuno: Hadoop-Based Text Mining Toolkit

Filed under: Mahout,Natural Language Processing — Patrick Durusau @ 10:54 pm

Mavuno: A Hadoop-Based Text Mining Toolkit

From the webpage:

Mavuno is an open source, modular, scalable text mining toolkit built upon Hadoop. It supports basic natural language processing tasks (e.g., part of speech tagging, chunking, parsing, named entity recognition), is capable of large-scale distributional similarity computations (e.g., synonym, paraphrase, and lexical variant mining), and has information extraction capabilities (e.g., instance and semantic relation mining). It can easily be adapted to new input formats and text mining tasks.

Just glancing at the documentation I am intrigued by the support for Java regular expressions. More on that this coming week.

I first saw this at myNoSQL.

Microsoft’s plan for Hadoop and big data

Filed under: BigData,Hadoop,Microsoft — Patrick Durusau @ 10:54 pm

Microsoft’s plan for Hadoop and big data by Edd Dumbill.

From the post:

Microsoft has placed Apache Hadoop at the core of its big data strategy. It’s a move that might seem surprising to the casual observer, being a somewhat enthusiastic adoption of a significant open source product.

The reason for this move is that Hadoop, by its sheer popularity, has become the de facto standard for distributed data crunching. By embracing Hadoop, Microsoft allows its customers to access the rapidly-growing Hadoop ecosystem and take advantage of a growing talent pool of Hadoop-savvy developers.

Microsoft’s goals go beyond integrating Hadoop into Windows. It intends to contribute the adaptions it makes back to the Apache Hadoop project, so that anybody can run a purely open source Hadoop on Windows.

If MS is taking the data integration road, isn’t that something your company needs to be thinking about?

There is all that data diversity that Hadoop processing is going to uncover, but I have some suggestions about that issue. 😉

Nothing but good can come of MS using Hadoop as an integration data appliance. MS customers will benefit and parts of MS won’t have to worry about stepping on each other. A natural outcome of hard coding into formats. But that is an issue for another day.

About the Performance of Map Reduce Jobs

Filed under: Amazon Web Services AWS,Hadoop,MapReduce — Patrick Durusau @ 10:53 pm

About the Performance of Map Reduce Jobs by Michael Kopp.

From the post:

One of the big topics in the BigData community is Map/Reduce. There are a lot of good blogs that explain what Map/Reduce does and how it works logically, so I won’t repeat it (look here, here and here for a few). Very few of them however explain the technical flow of things, which I at least need, to understand the performance implications. You can always throw more hardware at a map reduce job to improve the overall time. I don’t like that as a general solution and many Map/Reduce programs can be optimized quite easily, if you know what too look for. And optimizing a large map/reduce jobs can be instantly translated into ROI!

The Word Count Example

I went over some blogs and tutorials about performance of Map/Reduce. Here is one that I liked. While there are a lot of good tips out there, none, except the one mentioned, talk about the Map/Reduce program itself. Most dive right into the various hadoop options to improve distribution and utilization. While this is important, I think we should start the actual problem we try to solve, that means the Map/Reduce Job.

To make things simple I am using Amazons Elastic Map Reduce. In my setup I started a new Job Flow with multiple steps for every execution. The Job Flow consisted of one master node and two task nodes. All of them were using the Small Standard instance.

While AWS Elastic Map/Reduce has its drawbacks in terms of startup and file latency (Amazon S3 has a high volatility), it is a very easy and consistent way to execute Map/Reduce jobs without needing to setup your own hadoop cluster. And you only pay for what you need! I started out with the word count example that you see in every map reduce documentation, tutorial or Blog.

Yet another reason (other than avoiding outright failure) for testing your Map/Reduce jobs locally before in a pay-for-use environment. The better you understand the job and its requirements, the more likely you are to create an effective and cost-efficient solution.

YapMap: Breck’s Fun New Project to Improve Search

Filed under: Interface Research/Design,Search Interface,Searching — Patrick Durusau @ 7:32 pm

YapMap: Breck’s Fun New Project to Improve Search

From the post:

What I like about the user interface is that threads can be browsed easily–I have spent hours on remote controlled airplane forums reading every post because it is quite difficult to find relevant information within a thread. The color coding and summary views are quite helpful in eliminating irrelevant posts.

My first job is to get query spell checking rolling. Next is search optimized for the challenges of thread based postings. The fact that relevance of a post to a query is a function of a thread is very interesting. I will hopefully get to do some discourse analysis as well.

I will continue to run Alias-i/LingPipe. The YapMap involvement is just too fun a project to pass up given that I get to build a fancy search and discovery tool.

What do you think about the thread browsing capabilities?

I am sympathetic to the “reading every post” problem but I am not sure threading helps, at least not completely.

Doesn’t help with posters like myself who may make “off-thread” comments that may be the one you are looking for.

Comments about the interface?

Cloudera’s Hadoop Demo VM

Filed under: Hadoop — Patrick Durusau @ 7:31 pm

Cloudera’s Hadoop Demo VM

Cloudera has made Hadoop demo packages for VMware, KVM and VirtualBox.

I tried two of them this weekend and would not recommend either one of them.

1. VMware – After uncompressing the image, loads and runs easy enough (remember to up the RAM to 2 GB). The only problem comes when you try to run the Hadoop Tutorial as suggested at the image page. Path is wrong in the tutorial for the current release. Rather than 0.20.2-cdh3u1, it should read (in full) /usr/lib//hadoop-0.20/hadoop-0.20.2-cdh3u2-core.jar.

There are other path/directory issues, such as /usr/joe. No where to be seen in this the demo release.

BTW, the xterm defaults to a nearly unreadable blue color for directories. If you try to reset it, you will think there is no change. Try “xterm” to spawn a new xterm window. Your changes will appear there. Think about it for a minute and it will make sense.

2. VirtualBox – Crashes every time I run it. I have three other VMs that work just fine so I suspect it isn’t my setup.

Not encouraging and rather disappointing.

I normally just install Cloudera releases on my Ubuntu box and have always been pleased with the results. Hence the expectation of a good experience with the Demo VM’s.

Demo VM’s are to supposed to entice users to experience the full power of Hadoop, not drive them away.

I would either fix the Demo VM’s to work properly and have pre-installed directories and resources to illustrate the use of Hadoop or stop distributing them.

Just a little QA goes a long way.

Functional Programming with High Performance Actors

Filed under: Functional Programming,Graphs — Patrick Durusau @ 7:31 pm

Functional Programming with High Performance Actors

From the introduction:

But with current technologies we end up developing programs with an architecture specific to the problem at hand. And re-architect our programs as the requirements change. This is a labor-intensive approach that minimizes code reuse, particularly when using actors and refactoring method calls to be asynchronous messages or asynchronous messages become method calls. To maximize reuse, an actor should neither know nor care if the exchange of messages with another actor is synchronous or asynchronous. And if most message exchanges are synchronous, we can have many small actors working at a fairly low-level without slowing the program down to a crawl.

Another issue is flow control and its impact on throughput. Messaging is typically one-way, and takes extra effort to add flow control. But without some flow control, the systems we develop behave badly under load. Also, if you send one message at a time and do not proceed until a response is received, the throughput of most actor implementations will suffer a severe drop in throughput. What is needed is a messaging system which is more like method calls, while still having a reasonably high throughput rate.

I don’t think there is any doubt that hardware development has greatly out-stripped the ability to software to take full advantage of the additional processing power.

One possible paradigm shift may be extensive use of message passing.

Thoughts on how message passing can be applied to graph processing?

Pregel

Filed under: Graphs,Pregel — Patrick Durusau @ 7:31 pm

Pregel by Michael Nielsen.

From the post:

http://tm.durusau.net/wp-admin/post-new.php

In this post, I describe a simple but powerful framework for distributed computing called Pregel. Pregel was developed by Google, and is described in a 2010 paper written by seven Googlers. In 2009, the Google Research blog announced that the Pregel system was being used in dozens of applications within Google.

Pregel is a framework oriented toward graph-based algorithms. I won’t formally define graph-based algorithms here – we’ll see an example soon enough – but roughly speaking a graph-based algorithm is one which can be easily expressed in terms of the vertices of a graph, and their adjacent edges and vertices. Examples of problems which can be solved by graph-based algorithms include determining whether two vertices in a graph are connected, where there are clusters of connected vertices in a graph, and many other well-known graph problems. As a concrete example, in this post I describe how Pregel can be used to determine the PageRank of a web page.

What makes Pregel special is that it’s designed to scale very easily on a large-scale computer cluster. Typically, writing programs for clusters requires the programmer to get their hands dirty worrying about details of the cluster architecture, communication between machines in the cluster, considerations of fault-tolerance, and so on. The great thing about Pregel is that Pregel programs can be scaled (within limits) automatically on a cluster, without requiring the programmer to worry about the details of distributing the computation. Instead, they can concentrate on the algorithm they want to implement. In this, Pregel is similar to the MapReduce framework. Like MapReduce, Pregel gains this ability by concentrating on a narrow slice of problems. What makes Pregel interesting and different to MapReduce is that it is well-adapted to a somewhat different class of problems.

What class of problems would you say Pregel is “well-adapted” to solve?

I ask because I am unaware of any data structure that a graph is cannot represent. If there is an issue, it isn’t one of representation, at least in theory.

Is it a problem in practice/implementation?

The Harvard Library Innovation Laboratory at Harvard Law School

Filed under: Legal Informatics,Library — Patrick Durusau @ 7:30 pm

The Harvard Library Innovation Laboratory at Harvard Law School

The “Stuff We’re Looking At” sidebar is of particular interest. A wide range of resources and projects that may be of interest.

Any similar library labs/resources that you would suggest?

BTW, The Molecule of Data by Karen Coyle raises a number of points that I think are highly contestable if not provably wrong.

Take a listen and see what you think. I will be posting specific comments this coming week.

January 27, 2012

Cloud deployments, Heroku, Spring Data Neo4j and other cool stuff (Stockholm, Sweden)

Filed under: Cloud Computing,Heroku,Neo4j,Spring Data — Patrick Durusau @ 4:36 pm

Cloud deployments, Heroku, Spring Data Neo4j and other cool stuff

From the announcement:

We will meet up at The Hub (no not the github unfortunately, though that’s cool too). This time it will be a a visit by Peter Neubauer, VP Community at Neo Technology (and may be some other Neo4j hackers) who will talk about Cloud deployments, Heroku, Spring Data Neo4j etc. This will be a very interesting meetup as we will touch subjects connected to Python, Ruby, Java and what not. Laptops are optional but hey we wont stop you from hacking later :).

We also plan on doing a community brain storm session where we can talk about

  • what are the things that we would like to see Neo4j do, things that are missing, things that can be improved
  • how can we help spread the adoption of Neo4j. how to improve your learning

As usual we would love to see people contribute, so if you have some thing to show or share please let us know and we can modify the agenda. We will take 1 hour earlier on the Friday (from the usual 6:30, so we don’t come between you and your well deserved friday weekend)

This meetup invite will remain open till 31st of January 2012. So bring your friends, have some beer and discuss graphy things with us.

The RSVP closes 31 January 2012.

Notes, posts, and pointers to the same greatly appreciated!

NOSQL for bioinformatics: Bio4j, a real world use case using Neo4j (Madrid, Spain)

Filed under: Bioinformatics,Neo4j,NoSQL — Patrick Durusau @ 4:35 pm

NOSQL for bioinformatics: Bio4j, a real world use case using Neo4j

Monday, January 30, 2012, 7:00 PM

From the meeting notice:

The world of data is changing. Big Data and NOSQL are bringing new ways of looking at and understanding your data. Prominent in the trend is Neo4j, a graph database that elevates relationships to first-class citizens, uniquely offering a way to model and query highly connected data.

This opens a whole new world of possibilities for a wide range of fields, and bioinformatics is no exception. Quite the opposite, this paradigm provides bioinformaticians with a powerful and intuitive framework for dealing with biological data which by nature is incredibly interconnected.

We’ll give a quick overview of the NOSQL world today, introducing then Neo4j in particular. Afterwards we’ll move to real use cases focusing in Bio4j project.

I would really love to see this presentation, particularly the Bio4j part.

But, I won’t be in Madrid this coming Monday.

If you are, don’t miss this presentation! Take good notes and blog about it. The rest of us would appreciate it!

Analytics with MongoDB (commercial opportunity here)

Filed under: Analytics,Data,Data Analysis,MongoDB — Patrick Durusau @ 4:35 pm

Analytics with MongoDB

Interesting enough slide deck on analytics with MongoDB.

Relies on custom programming and then closes with this punchline (along with others, slide #41):

  • If you’re a business analyst you have a problem
    • better be BFF with some engineer 🙂

I remember when word processing required a lot of “dot” commands and editing markup languages with little or no editor support. Twenty years (has it been that long?) later and business analysts are doing word processing, markup and damned near print shop presentation without working close to the metal.

Can anyone name any products that have made large sums of money making it possible for business analysts and others to perform those tasks?

If so, ask yourself if you would like to have a piece of the action that frees business analysts from script kiddie engineers?

Even if a general application is out of reach at present, imagine writing access routines for common public data sites.

Create a market for the means to import and access particular data sets.

ROMA User-Customizable NoSQL Database in Ruby

Filed under: NoSQL,ROMA,Ruby — Patrick Durusau @ 4:34 pm

ROMA User-Customizable NoSQL Database in Ruby

From the presentation:

  • User-customizable NoSQL database in Ruby
  • Features
    • Key-value model
    • High scalability
    • High availability
    • Fault-tolerance
    • Better throughput
    • And…
  • To meet application-specific needs, ROMA provides
    • Plug-in architecture
    • Domain specific language (DSL) for Plug-in
  • ROMA enables meeting the above need in Rakuten Travel

The ROMA source code: http://github.com/roma/roma/

Reportedly has 70 million users and while that may not be “web scale,” it may scale enough to meet your needs. 😉

Of particular interest are the DSL capabilities. See slides 31-33. Declaring your own commands. Something for other projects to consider.

Countandra

Filed under: NoSQL,Semantic Diversity — Patrick Durusau @ 4:34 pm

Countandra

From the webpage:

Since Aryabhatta invented zero, Mathematicians such as John von Neuman have been in pursuit of efficient counting and architects have constantly built systems that computes counts quicker. In this age of social media, where 100s of 1000s events take place every second, we were inspired by twitter’s Rainbird project to develop distributed counting engine that can scale linearly.

Countandra is a hierarchical distributed counting engine on top of Cassandra (to increment/decrement hierarchical data) and Netty (HTTP Based Interface). It provides a complete http based interface to both posting events and getting queries. The syntax of a event posting is done in a FORMS compatible way. The result of the query is emitted in JSON to make it maniputable by browsers directly.

Features

  • Geographically distributed counting.
  • Easy Http Based interface to insert counts.
  • Hierarchical counting such as com.mywebsite.music.
  • Retrieves counts, sums and square in near real time.
  • Simple Http queries provides desired output in JSON format
  • Queries can be sliced by period such as LASTHOUR,LASTYEAR and so on for MINUTELY,HOURLY,DAILY,MONTHLY values
  • Queries can be classified for anything in hierarchy such as com, com.mywebsite or com.mywebsite.music
  • Open Source and Ready to Use!

Countandra illustrates that not every application need be a general purpose one. Countandra is designed to be a counting engine and to answer defined query types, nothing more.

There is a lesson there for semantic diversity solutions. It is better to attempt to solve part of the semantic diversity issue than to attempt a solution for everyone. At least partial solutions have a chance of being a benefit before being surpassed by changing technologies and semantics.

BTW, Countandra using a Java long for time values so in the words of the Unix Time Wikipedia entry:

In the negative direction, this goes back more than twenty times the age of the universe, and so suffices. In the positive direction, whether the approximately 293 billion representable years is truly sufficient depends on the ultimate fate of the universe, but it is certainly adequate for most practical purposes.

Rather than “suffices” and “most practical purposes” I would have said, “is adequate for present purposes” in both cases.

Seismic Data Science: Reflection Seismology and Hadoop

Filed under: Hadoop,Science — Patrick Durusau @ 4:32 pm

Seismic Data Science: Reflection Seismology and Hadoop by Josh Wills.

From the post:

When most people first hear about data science, it’s usually in the context of how prominent web companies work with very large data sets in order to predict clickthrough rates, make personalized recommendations, or analyze UI experiments. The solutions to these problems require expertise with statistics and machine learning, and so there is a general perception that data science is intimately tied to these fields. However, in my conversations at academic conferences and with Cloudera customers, I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.

A nice overview of areas of science using “big data” decades before the current flurry of activity. The use of Hadoop in reflection seismology is only one fuller example of that use.

The take away that I have from this post is that Hadoop skills are going to be in demand across business, science and one would hope, the humanities.

Alarum – World Wide Shortage of Logicians and Ontologists

Filed under: BigData,Linked Data,Logic,Ontology — Patrick Durusau @ 4:32 pm

Did you know there is an alarming shortage of logicians and ontologists around the world? Apparently everywhere, in all countries.

Came as a complete shock/surprise to me.

I was reading ‘Digital Universe’ to Add 1.8 Zettabytes in 2011 by Rich Miller which says:

More than 1.8 zettabytes of information will be created and stored in 2011, according to the latest IDC Digital Universe Study sponsored by EMC. That’s a mind-boggling figure, equivalent to 1.8 trillion gigabytes -enough information to fill 57.5 billion 32GB Apple iPads. It also illustrates the challenge in storing and managing all that data.

But then I remembered the “state of the Semantic Web” report of 31,634,213,770 triples.

I know it is apples and oranges to some degree but compare the figures for data and linked data:

Data 1,800,000,000,000,000,000,000
Triples 31,634,213,770

Not to mention that the semantics of data is constantly evolving. If not business and scientific data, recall that “texting” was unknown little more than a decade ago.

It is clear that we don’t have enough logicians and ontologists (who have yet to agree on a common upper ontology) to keep up with the increasing flow of data. For that matter, the truth is they have been constantly falling behind for centuries. Systems are proposed, cover some data, only to become data that has to be covered by subsequent systems.

Some options to deal with this crisis:

  • Universal Logician/Ontologist Conscription Act – All 18 year olds world wide have to spend 6 years in the LogoOnto Corps. First four years learning the local flavor of linked data and the last two years coding data.
  • Excess data to /dev/null – Pipe all non-Linked data to /dev/null until logicians/ontologists can catch up. Projected to be sometime after 5500, perhaps late 5500’s. (According to Zager and Evans.)
  • ???

There are other options. Propose yours and/or wait for some suggestions here next week!

Building a Scalable Web Crawler with Hadoop

Filed under: Hadoop,Webcrawler — Patrick Durusau @ 4:31 pm

Building a Scalable Web Crawler with Hadoop

Ahad Rana of Common Crawl presents an architectural view of a web crawler based on Hadoop.

You can access the data from Common Crawl.

But the architecture notes may be useful if you decide to crawl a sub-part of the web and/or you need to crawl “deep web” data in your organization.

Getting Started with Apache Cassandra (realistic data import example)

Filed under: Cassandra,NoSQL — Patrick Durusau @ 4:30 pm

Getting Started with Apache Cassandra

From the post:

If you haven’t begun using Apache Cassandra yet and you wanted a little handholding to help get you started, you’re in luck. This article will help you get your feet wet with Cassandra and show you the basics so you’ll be ready to start developing Cassandra applications in no time.

Why Cassandra?

Do you need a more flexible data model than what’s offered in the relational database world? Would you like to start with a database you know can scale to meet any number of concurrent user connections and/or data volume size and run blazingly fast? Have you been needing a database that has no single point of failure and one that can easily distribute data among multiple geographies, data centers, and the cloud? Well, that’s Cassandra.

Not to pick on Cassandra or this post in particular but have you noticed that introductory articles have you enter a trivial amount of data as a starting point? Which makes sense, you need to learn the basics but why not conclude with importing a real data set? Particularly for databases what “scale” so well.

For example, detail how to import campaign donations records from the Federal Election Commission in the United States. Which are written in COBOL format. That would give the user a better data set for CQL exercises.

A Full Table Scan of Indexing in NoSQL

Filed under: Indexing,NoSQL — Patrick Durusau @ 4:30 pm

A Full Table Scan of Indexing in NoSQL by Will LaForest (MongoDB).

One slide reads:

What Indexes Can Help Us Do

  • Find the “location” of data
    • Based upon a value
    • Based upon a range
    •  Geospatial
  • Fast checks for existence
    • Uniqueness enforcement
  • Sorting
  • Aggregation
    • Usually covering indexes

The next slide is titled: “Requisite Book Analogy” with an image of a couple of pages from an index.

So, let’s copy out some of those entries and see where they fit into Will’s scheme:

Bears, 75, 223
Beds, good, their moral influence, 184, 186
Bees, stationary civilisation of, 195
Beethoven on Handel, 18
Beginners in art, how to treat them, 195

The entry for Bears I think qualifies for “location of data based on a value.

And I see sorting, but those two are the only aspects of Will’s indexing that I see.

Do you see more?

What I do see is that the index is expressing relationships between subjects (“Beethoven on Handel”) and commenting on what information awaits a reader (“Beds, good, their moral influence”).

A NoSQL index could replicate the strings of these entries but without the richness of this index.

For example, consider the entry:

Aurora Borealis like pedal notes in Handel’s bass, 83

One expects that the entry on Handel to contain that reference as well as the one of “Beethoven on Handel.” (I have only the two pages in this image and as far as I know, I haven’t seen this particular index before.)

Question: How would you use the indexes in MongoDB to represent the richness of these two pages?

Question: Where did MongoDB (or other NoSQL) indexing fail?

Important to remember that indexes prior to the auto-generated shallowness of recent decades were highly skilled acts of authorship, that were a value-add for readers.

January 26, 2012

Sixth Annual Machine Learning Symposium

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 6:55 pm

Sixth Annual Machine Learning Symposium sponsored by the New York Academy of the Sciences.

There were eighteen (18) presentations and any attempt to summarize on my part would do injustice to one or more of them.

Post your comments and suggestions for which ones I should watch first. Thanks!

Employee productivity: 21 critical minutes (no-line-item (nli) in the budget?)

Filed under: Marketing,Search Engines,Searching — Patrick Durusau @ 6:54 pm

Employee productivity: 21 critical minutes by Gilles ANDRE.

From the post:

Twenty-one minutes a day. That’s how long employees spend each day searching for information they know exists but is hard to find. These 21 minutes cost their company the equivalent of €1,500 per year per employee. That’s an average of two whole working weeks. This particular Mindjet study is, of course, somewhat anecdotal and some research firms such as IDC put the figure as high as €10,000 per year. These findings signal a new challenge facing businesses: employees know that the information is there, but they cannot find it. This stalemate can become extremely costly and, in some cases, can even kill off a business. Are companies really aware of this problem?

(paragraph and graphic omitted)

So far, companies have responded to this rising tide of data by spending money. They have invested large, even enormous sums in solutions to store, secure and access their information – one of the key assets of their business. They have also invested heavily in a range of different applications to meet their operational needs. Yet these same applications have created vast information silos spanning their entire organisation. Interdepartmental communication is stifled and information travels like vehicles on the M25 during rush hour.

The link to Mindjet is to their corporate website and not to the study. Ironically I did search at the Mindjet site, the solution Polyspot suggests and came up empty for “21 minutes.” You would think that would be in the report somewhere as a string.

I suspect 21 minutes would be on the low side of lost employee productivity on a daily basis.

But it isn’t hard to discover why businesses have failed to address that loss in employee productivity.

Take out the latest annual report for your business with a line item budget in it. Examine it carefully and then answer the following question:

At what line item is lost employee productivity reported?

Now imagine that your CIO proposes to make information once found, found for all employees. A mixture of a search engine, indexing, topic map, with a process to keep it updated.

You don’t know the exact figures but do you think there would be a line item in the budget from such a project?

And, would there be metrics to determine if the project succeeded or failed?

Ah, so, if the business continues to lose employee productivity there is no metric for success or failure and it never shows up as a line item in the budget.

That is the safe position.

At least until the business collapses and/or is overtaken by other companies.

If you are interested in over taking no-line-item (nli) companies consider evolving search applications that incorporate topic maps.

Topic maps: Information once found, stays found.

Spring onto Heroku

Filed under: Heroku,Neo4j,Spring Data — Patrick Durusau @ 6:54 pm

Spring onto Heroku by Andreas Kollegger.

From the post:

Deploying your application into the cloud is a great way to scale from “wouldn’t it be cool if..” to giving interviews to Forbes, Fast Company, and Jimmy Fallon. Heroku makes it super easy to provision everything you need, including a Neo4j Add-on. With a few simple adjustments, your Spring Data Neo4j application is ready to take that first step into the cloud.

Let’s walk through the process, assuming this scenario:

Ready? OK, first let’s look at your application.

As one commenter noted, just in time for the Neo4j Challenge!

Measuring User Retention with Hadoop and Hive

Filed under: Hadoop,Hive — Patrick Durusau @ 6:53 pm

Measuring User Retention with Hadoop and Hive by Daniel Russo.

From the post:

The Hadoop ecosystem is comprised of numerous tech­nologies that can work together to provide a powerful and scalable mech­anism for analyzing and deriving insight from large quan­tities of data.

In an effort to showcase the flex­i­bility and raw power of queries that can be performed over large datasets stored in Hadoop, this post is written to demon­strate an example use case. The specific goal is to produce data related to user retention, an important metric for all product companies to analyze and understand.

Compelling demonstration of the power of Hadoop and Hive to measure raw user retention, in an “app” situation.

Question:

User retention isn’t a new issue, does anyone know what strategies were used before Hadoop and Hive to measure it?

The reason I ask is that prior analysis of user retention may point the way towards data or relationships it wasn’t possible to capture before.

For example, when an app falls into non-use or is uninstalled, what impact (if any) does that have on known “friends” and their use of the app?

Are there any patterns to non-use/uninstalls over short or long periods of time in identifiable groups? (A social behavior type question.)

Neo4j Internals

Filed under: Neo4j — Patrick Durusau @ 6:52 pm

Neo4j Internals

From the description:

At the Neo4j London user group, we’ve seen many talks on how to use Neo4j for exploiting connected data. But how does Neo4j make working with connected data so effective? In this presentation we’ll find out how as Neo4j hacker Tobias Lindaaker takes us on a guided tour through the Neo4j’s internals. We’ll discover how the internal data structures are leveraged to provide fast traversals, how live backups work, and how multiple servers synchronize in an HA cluster. As a Neo4j user you’ll find a working knowledge of the database will give you enough “mechanical sympathy” to make your data really fly. And after this talk you’ll feel confident contributing code that scratches your connected data itches.

Posting the slides for this presentation would be very helpful. Camera work is good but this is the sort of material that needs to be studied in detail.

Interesting comparison between Gremlin and Cypher. Gremlin as a DSL in Groovy has a full programming language available.

I can’t promise that this presentation will make you a better Neo4j user/developer, but it won’t hurt. 😉

New Opportunities for Connected Data (logic, contagion relationships and merging)

Filed under: Logic,Neo4j,NoSQL — Patrick Durusau @ 6:46 pm

New Opportunities for Connected Data by Ian Robinson, Neo Technologies, Inc.

An in depth discussion of relational, NoSQL and graph database views of the world.

I must admit to being surprised when James Frazer’s Golden Bough came up in the presentation. It was used quite effectively as an illustration but I have learned to not expect humanities references or examples in CS presentations. This was a happy exception.

I agree with Ian that the relational world view remains extremely useful but also that it limits the data that can be represented and queried.

Complex relationships between entities simply don’t come up with relational databases because they aren’t easy (if possible) to represent.

I would take Ian’s point a step further and point out that logic, as in RDF and the Semantic Web, is a similar constraint.

Logic can be very useful in any number of areas, just like relational databases, but it only represents a very small slice of the world. A slice of the world that can be represented quite artificially without contradictions, omissions, inconsistencies, or any of the other issues that make logic systems fall over clutching their livers.

BTW, topic mappers need to take a look at timemark 34.26. The representation of the companies who employ workers and the “contagion” relationships. (You will have to watch the video to find out why I say “contagion.” It is worth the time.) Does that suggest to you that I could point topics to a common node based on their possession of some property, say a subject identifier? Such that when I traverse any of those topics I can go to the common node and produce a “merged” result if desired?

I say that because any topic could point to more than one common node, depending upon the world view of an author. That could be very interesting in terms of comparing how authors would merge topics.

AWS HowTo: Using Amazon Elastic MapReduce with DynamoDB

Filed under: Amazon DynamoDB,Amazon Web Services AWS,MapReduce — Patrick Durusau @ 6:44 pm

AWS HowTo: Using Amazon Elastic MapReduce with DynamoDB by Adam Gray. Adam is a Product Manager on the Elastic MapReduce Team.

From the post:

Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data. That’s why we were so excited to provide out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware. Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMR’s highly parallelized environment to distribute the work across the number of servers of their choice. Further, as EMR uses a SQL-based engine for Hadoop called Hive, you need only know basic SQL while we handle distributed application complexities such as estimating ideal data splits based on hash keys, pushing appropriate filters down to DynamoDB, and distributing tasks across all the instances in your EMR cluster.

In this article, I’ll demonstrate how EMR can be used to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.

Time to get that AWS account!

Persisting relationship entities in Neo4j

Filed under: Neo4j,Spring Data — Patrick Durusau @ 6:43 pm

Persisting relationship entities in Neo4j by Sunil Prakash Inteti

From the post:

Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database. In Neo4j data gets stored in nodes connected to each other by relationship entities that carry its own properties. These relationships are very important in graphs and helps to traverse the graph and make decisions. This blog discusses the two ways to persist a relationship between nodes and also the scenario’s which suits their respective usage. Spring-data-neo4j by springsource gives us the flexibility of using the spring programming model when working with neo4j database. The code examples in this blog will be using spring-data-neo4j.

Excellent post that illustrates how a relationship is persisted can make a big difference in performance. Very much design guideline material.

Tenzing: A SQL Implementation On The MapReduce Framework

Filed under: MapReduce,SQL,Tenzing — Patrick Durusau @ 6:42 pm

Tenzing: A SQL Implementation On The MapReduce Framework by Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon and Michael Wong.

Abstract:

Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries.

Of the conclusions of the authors:

  • It is possible to create a fully functional SQL engine on top of the MapReduce framework, with extensions that go beyond SQL into deep analytics.
  • With relatively minor enhancements to the MapReduce framework, it is possible to implement a large number of optimizations currently available in commercial database systems, and create a system which can compete with commercial MPP DBMS in terms of throughput and latency.
  • The MapReduce framework provides a combination of high performance, high reliability and high scalability on cheap unreliable hardware, which makes it an excellent platform to build distributed applications that involve doing simple to medium complexity operations on large data volumes.
  • By designing the engine and the optimizer to be aware of the characteristics of heterogeneous data sources, it is possible to create a smart system which can fully utilize the characteristics of the underlying data sources.
  • the last one is of the most interest to me. Which one interests you the most?

    BTW, the authors mention:

    We are working on various other enhancements and believe we can cut this time down to less than 5 seconds end-to-end, which is fairly acceptable to the analyst community.

    I think the analyst community needs to use 2400 baud modems for a month or two. 😉

    Sub-5 second performance is sometimes useful, even necessary. But as a general requirement?

    Google: MoreSQL is Real

    Filed under: NoSQL — Patrick Durusau @ 6:41 pm

    Google: MoreSQL is Real by Williams Edwards.

    One comment on the post summarized it:

    Super rant that really crystallized my discomfort with the whole NoSQL business .. At the end of the day, it’s a ‘war’ between various APIs to access B/B+ trees !

    Well, but it is an enjoyable rant, so read it and see for yourself.

    I do think one of the advantages of all the hype has been an increase in at least considering different options and data structures. Some of them will be less useful than the ones that are common now, but it only take one substantial improvement to make it all worthwhile.

    Introduction to Graph Databases

    Filed under: Graphs,Neo4j — Patrick Durusau @ 7:37 am

    Introduction to Graph Databases

    Thursday, January 26 at 10:00 PST

    From the registration page:

    Join this webinar for a fast paced introduction to graph databases, taught by Emil Eifrem, CEO of Neo Technology.

    This webinar is for Java developers, but no previous knowledge of graph databases is required.

    Learn:

    • use cases for graph databases
    • specific coding techniques for working with a graph database

    I hate to post anything early in the day and so break “form” as it were but thought you might need time to register, etc. 😉

    January 25, 2012

    Introduction data structure for GraphDB

    Filed under: GraphDB,Neo4j — Patrick Durusau @ 3:32 pm

    Introduction data structure for GraphDB by Shunya Kimura.

    Detailed examination of the data structures that manage nodes and relationships between nodes. Highly recommended.

    « Newer PostsOlder Posts »

    Powered by WordPress