June « 2012 « Another Word For It

June 11, 2012

Announcing Revolution R Enterprise 6.0

Filed under: Data Mining,R — Patrick Durusau @ 4:22 pm

Just in case you missed the announcement:

Revolution Analytics is proud to announce the latest update to our enhanced, production-grade distribution of R, Revolution R Enterprise. This update expands the range of supported computation platforms, adds new Big Data predictive models, and updates to the latest stable release of open source R (2.14.2), which improves performance of the R interpreter by about 30%.

This release expands the range of big-data statistical analysis with support for Generalized Linear Models (GLM). Logistic (Binomial) Poisson, Gamma and Tweedie models are all supported with a high-performance C++ implementation, and you can also model any distribution in the GLM family with a custom link function written in R. Big Data GLM has been a common request from many of our customers, and beta testers have been blown away by the speed of the implementation. For example here's an example of a Tweedie regression on 8.5 million insurance claims in less than 2 and a half minutes (skip ahead to 1:10 for the demo):

I included the video because it is about as impressive as demos get.

Details about Revolution R Enterprise 6.0 follow in the post.

Comments Off

Real-time Analytics with HBase [Aggregation is a form of merging.]

Filed under: Aggregation,Analytics,HBase — Patrick Durusau @ 4:21 pm

Real-time Analytics with HBase

From the post:

Here are slides from another talk we gave at both Berlin Buzzwords and at HBaseCon in San Francisco last month. In this presentation Alex describes one approach to real-time analytics with HBase, which we use at Sematext via HBaseHUT. If you like these slides you will also like HBase Real-time Analytics Rollbacks via Append-based Updates.

The slides come in a long and short version. Both are very good but I suggest the long version.

I particularly liked the “Background: pre-aggregation” slide (8 in the short version, 9 in the long version).

Aggregation as a form of merging.

What information is lost as part of aggregation? (That assumes we know the aggregation process. Without that, can’t say what is lost.)

What information (read subjects/relationships) do we want to preserve through an aggregation process?

What properties should those subjects/relationships have?

(Those are topic map design/modeling questions.)

Comments Off

Scale, Structure, and Semantics

Filed under: Communication,Semantic Web,Semantics — Patrick Durusau @ 4:20 pm

Scale, Structure, and Semantics by Daniel Turkelang.

From the post:

This morning I had the pleasure to present a keynote address at the Semantic Technology & Business Conference (SemTechBiz). I’ve had a long and warm relationship with the semantic technology community — especially with Marco Neumann and the New York Semantic Web Meetup.

To give you a taste of the slides:

1. Knowledge representation is overrated.

2. Computation is underrated.

3. We have a communication problem.

I find it helpful to think of search/retrieval as asynchronous conversation.

If I can’t continue or find my place in or know what a conversation is about, there is a communication problem.

Comments Off

June 10, 2012

Deconstructing the Google Knowledge Graph

Filed under: Google Knowledge Graph,Identifiers — Patrick Durusau @ 8:17 pm

Deconstructing the Google Knowledge Graph

Mike Bergman has some interesting observations on the Google Knowledge Graph, first on its coverage and then on how it is constructing URLs for nodes in its graph.

I have to second his call for Google to release its identifiers via an API. That would be a real boon for common entities.

I say common entities because having “millions” of identifiers is fairly trivial when you consider the number of objects captured every night by optical astronomers alone. Or sequencing genomes.

Not to discount the value of a common identifier for Lady Gaga but uncommon entities need identifiers too.

Gabriel Hopmans pointed me to this post. (Morpheus)

Comments Off

Inferring General Relations between Network Characteristics from Specific Network Ensembles

Filed under: Networks,Sampling,Topology — Patrick Durusau @ 8:17 pm

Inferring General Relations between Network Characteristics from Specific Network Ensembles by Stefano Cardanobile, Volker Pernice, Moritz Deger, and Stefan Rotter.

Abstract:

Different network models have been suggested for the topology underlying complex interactions in natural systems. These models are aimed at replicating specific statistical features encountered in real-world networks. However, it is rarely considered to which degree the results obtained for one particular network class can be extrapolated to real-world networks. We address this issue by comparing different classical and more recently developed network models with respect to their ability to generate networks with large structural variability. In particular, we consider the statistical constraints which the respective construction scheme imposes on the generated networks. After having identified the most variable networks, we address the issue of which constraints are common to all network classes and are thus suitable candidates for being generic statistical laws of complex networks. In fact, we find that generic, not model-related dependencies between different network characteristics do exist. This makes it possible to infer global features from local ones using regression models trained on networks with high generalization power. Our results confirm and extend previous findings regarding the synchronization properties of neural networks. Our method seems especially relevant for large networks, which are difficult to map completely, like the neural networks in the brain. The structure of such large networks cannot be fully sampled with the present technology. Our approach provides a method to estimate global properties of under-sampled networks in good approximation. Finally, we demonstrate on three different data sets (C. elegans neuronal network, R. prowazekii metabolic network, and a network of synonyms extracted from Roget’s Thesaurus) that real-world networks have statistical relations compatible with those obtained using regression models.

The key insight is that sampling can provide the basis for reliable estimates of global properties of networks too large to fully model.

I first saw this at: Understanding Complex Relationships: How Global Properties of Networks Become Apparent Locally.

If you don’t already get one of the ScienceDaily newsletters, you should consider it.

Comments Off

NoSQL Standards [query languages – tuples anyone?]

Filed under: NoSQL,Standards — Patrick Durusau @ 8:17 pm

Andrew Oliver write at InfoWorld: The time for NoSQL standards is now – Like Larry Ellison’s yacht, the RDBMS is sailing into the sunset. But if NoSQL is to take its place, a standard query language and APIs must emerge soon.

A bit dramatic for my taste but a good overview of possible areas for standardization for NoSQL.

Problem: NoSQL query languages are tied to the base format/data structure of their implementation.

For that matter, you could say the same thing about SQL. The query language is tied to the data structure.

I am not sure how you can have a query language that isn’t tied to a notion of structure. Even a very abstract one. That a NoSQL implementation could map against its data structure.

Tuples anyone?

Pointers and resources welcome!

Comments Off

XML to Graph Converter

Filed under: Geoff,Graphs,Neo4j,XML — Patrick Durusau @ 8:17 pm

XML to Graph Converter

From the webpage:

XML data can easily be converted into a graph. Simply load paste the XML data into the left-hand side, convert into Geoff, then view the results in the Neo4j console.

I would have modeled the XML differently, but that is probably a markup prejudice.

Still, an impressive demonstration and worth your time to review.

Comments Off

Citizen Archivist Dashboard [“…help the next person discover that record”]

Filed under: Archives,Crowd Sourcing,Indexing,Tagging — Patrick Durusau @ 8:15 pm

Citizen Archivist Dashboard

What’s the common theme of these interfaces from the National Archives (United States)?

Tag – Tagging is a fun and easy way for you to help make National Archives records found more easily online. By adding keywords, terms, and labels to a record, you can do your part to help the next person discover that record. For more information about tagging National Archives records, follow “Tag It Tuesdays,” a weekly feature on the NARAtions Blog. [includes “missions” (sets of materials for tagging), rated as “beginner,” “intermediate,” and “advanced.” Or you can create your own mission.]

Transcribe – By contributing to transcriptions, you can help the National Archives make historical documents more accessible. Transcriptions help in searching for the document as well as in reading and understanding the document. The work you do transcribing a handwritten or typed document will help the next person discover and use that record.
The transcription tool features over 300 documents ranging from the late 18th century through the 20th century for citizen archivists to transcribe. Documents include letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files.

[A pilot project with 300 documents but one you should follow. Public transcription (crowd-sourced if you want the popular term) of documents has the potential to open up vast archives of materials.]

Edit Articles – Our Archives Wiki is an online space for researchers, educators, genealogists, and Archives staff to share information and knowledge about the records of the National Archives and about their research.
Here are just a few of the ways you may want to participate:

Create new pages and edit pre-existing pages

Share your research tips

Store useful information discovered during research

Expand upon a description in our online catalog

Check out the “Getting Started” page. When you’re ready to edit, you’ll need to log in by creating a username and password.

Upload & Share – Calling all researchers! Start sharing your digital copies of National Archives records on the Citizen Archivist Research group on Flickr today.
Researchers scan and photograph National Archives records every day in our research rooms across the country — that’s a lot of digital images for records that are not yet available online. If you have taken scans or photographs of records you can help make them accessible to the public and other researchers by sharing your images with the National Archives Citizen Archivist Research Group on Flickr.

Index the Census – Citizen Archivists, you can help index the 1940 census!
The National Archives is supporting the 1940 census community indexing project along with other archives, societies, and genealogical organizations. The release of the decennial census is one of the most eagerly awaited record openings. The 1940 census is available to search and browse, free of charge, on the National Archives 1940 Census web site. But, the 1940 census is not yet indexed by name.

You can help index the 1940 census by joining the 1940 census community indexing project. To get started you will need to download and install the indexing software, register as an indexing volunteer, and download a batch of images to transcribe. When the index is completed, the National Archives will make the named index available for free.

The common theme?

The tagging entry sums it up with: “…you can do your part to help the next person discover that record.”

That’s the “trick” of topic maps. Once a fact about a subject is found, you can preserve your “finding” for the next person.

Comments Off

June 9, 2012

The Power of Open Education Data [Semantic Content ~ 0]

Filed under: Education,Open Data — Patrick Durusau @ 7:19 pm

The Power of Open Education Data by Todd Park and Jim Shelton.

The title implies a description or example of the “power” of Open Education Data.

Here are ten examples of how this post disappoints:

…who pledged to provide…
…voting with their feet…
…can help with…
…as fuel to spur…
…seeks to (1) work with…
…and (2) collaborate with…
…will also include efforts…
…will enable them to create…
…will include work to develop…
…which can help fuel…

None of these have happened, just speculation on what might happen, maybe.

Let me call your attention to, Consumers and Credit Disclosures: Credit Cards and Credit Insurance (2002) by Thomas A. Durkin, a Federal Reserve study of the impact of the Truth in Lending Act, one of the “major” consumer victories of its day (1968).

From the conclusion:

Conclusively evaluating the direct effects of disclosure legislation like Truth in Lending on either consumer behavior or the functioning of the credit marketplace is never a simple matter because there are always competing explanations for observed phenomena. From consumer surveys over time, however, it seems likely that disclosures required by Truth in Lending have had a favorable effect on the ready availability of information on credit transactions.

Let me save some future federal reserve researcher time and effort and observe that with Open Education Data, there will be more information about the cost of higher education available.

What impact it had on behavior is unknown.

The Power of Open Education Data is a disservice to the data mining, open data, education and other communities. It is specious speculation, beneficial only to those seeking public office and the cronies they appoint.

Comments Off

Qi4j™

Filed under: Domain Driven Design,Qi4j — Patrick Durusau @ 7:16 pm

Qi4j™

From the webpage:

What is Qi4j™?

The short answer is that Qi4j™ is a framework for domain centric application development, including evolved concepts from AOP, DI and DDD.

Qi4j™ is an implementation of Composite Oriented Programming, using the standard Java 5 platform, without the use of any pre-processors or new language elements. Everything you know from Java 5 still applies and you can leverage both your experience and toolkits to become more productive with Composite Oriented Programming today.

Moreover, Qi4j™ enables Composite Oriented Programming on the Java platform, including both Java and Scala as primary languages as well as many of the plethora of languages running on the JVM as bridged languages.

Introducing Qi4j™

Qi4j™ is pronounced “chee for jay”. This website is out of scope to explain the many facets and history of Qi, so we refer the interested to read the lengthy article at Wikipedia. For us, Qi is the force/energy within the body, in this case the Java platform. Something that makes Java so much better, if it is found and channeled into a greater good.

We strongly recommend the background article found in the introduction.

Covering Qi4j in part because Emil Eifrem covers it in his “kicking ass” slides on Neo4j.

But also because I like domain oriented design.

Software should fit a domain and not have domains tormented to fit software.

That does create problems with what to standardize and what to enable to be unique. Not everyone can afford one-off software.

My suggestion would be to standardize the interchange of data to enable competition between unique capabilities of software, so long as it write back to a standard format.

From entries at the website it appears that Qi4j is emerging from a period of dormancy. Now would be a good time to contribute to the project.

Comments Off

Working with NoSQL Databases [MS TechNet]

Filed under: Microsoft,NoSQL — Patrick Durusau @ 7:16 pm

Working with NoSQL Databases

From Microsoft’s TechNet, an outline listing of NoSQL links and resources.

Has the advantage (over similar resources) of being in English, Deustch, Italian and Português.

Comments Off

Puppet

Filed under: Marketing,Systems Administration,Systems Research — Patrick Durusau @ 7:15 pm

Puppet

From “What is Puppet?”:

Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to patch management and compliance. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.

Puppet is available as both open source and commercial software. You can see the differences here and decide which is right for your organization.

How Puppet Works

Puppet uses a declarative, model-based approach to IT automation.

Define the desired state of the infrastructure’s configuration using Puppet’s declarative configuration language.

Simulate configuration changes before enforcing them.

Enforce the deployed desired state automatically, correcting any configuration drift.

Report on the differences between actual and desired states and any changes made enforcing the desired state.

Topic maps seem like a natural for systems administration.

They can capture the experience and judgement of sysadmins that aren’t ever part of printed documentation.

Make sysadmins your allies when introducing topic maps. Part of that will be understanding their problems and concerns.

Being able to intelligently discuss software like Puppet will be a step in the right direction. (Not to mention giving you ideas about topic map applications for systems administration.)

Comments Off

Distributed Systems Tracing with Zipkin [Sampling @ Twitter w/ UI]

Filed under: BigData,Distributed Systems,Sampling,Systems Research,Tracing — Patrick Durusau @ 7:15 pm

Distributed Systems Tracing with Zipkin

From the post:

Zipkin is a distributed tracing system that we created to help us gather timing data for all the disparate services involved in managing a request to the Twitter API. As an analogy, think of it as a performance profiler, like Firebug, but tailored for a website backend instead of a browser. In short, it makes Twitter faster. Today we’re open sourcing Zipkin under the APLv2 license to share a useful piece of our infrastructure with the open source community and gather feedback.

Hmmm, tracing based on the Dapper paper that comes with a web-based UI for a number of requests. Hard to beat that!

Thinking more about the sampling issue, what if I were to sample a very large stream of proxies and decided to only merge a certain percentage and pipe the rest to /dev/null?

For example, I have an UPI feed and that is my base set of “news” proxies. I have feeds from the various newspaper, radio and TV outlets around the United States. If the proxies from the non-UPI feeds are without some distance of the UPI feed proxies, they are simply discarded.

True, I am losing the information of which newspapers carried the stories, whose bylines consisted of changing the order of the words or dumbing them down, but those may not fall under my requirements.

I would rather than a few dozen very good sources than say 70,000 sources that say the same thing.

If you were testing for news coverage or the spread of news stories, your requirements might be different.

I first saw this at Alex Popescu’s myNoSQL.

Comments Off

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure [Data Sampling Lessons For “Big Data”]

Filed under: BigData,Distributed Systems,Sampling,Systems Research,Tracing — Patrick Durusau @ 7:14 pm

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure by Benjamin H. Sigelman, Luiz Andr´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.

Abstract:

Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.

Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.

The main goal of this paper is to report on our experience building, deploying and using the system for over two years, since Dapper’s foremost measure of success has been its usefulness to developer and operations teams. Dapper began as a self-contained tracing tool but evolved into a monitoring platform which has enabled the creation of many different tools, some of which were not anticipated by its designers. We describe a few of the analysis tools that have been built using Dapper, share statistics about its usage within Google, present some example use cases, and discuss lessons learned so far.

A very important paper for anyone working with large and complex systems.

With lessons on data sampling as well:

… we have found that a sample of just one out of thousands of requests provides sufficient information for many common uses of the tracing data.

You have to wonder in “data in the petabyte range” cases, how many of them could be reduced to gigabyte (or smaller) size with no loss in accuracy?

Which would reduce storage requirements, increase analysis speed, increase the complexity of analysis, etc.

Have you sampled your “big data” recently?

I first saw this at Alex Popescu’s myNoSQL.

Comments (1)

NoSQL Databases

Filed under: NoSQL — Patrick Durusau @ 7:14 pm

NoSQL Databases by Christof Strauch, Stuttgart Media University. (PDF, 149 pages)

An overview and introduction to NoSQL databases. According to a post on High Scalibility, Paper: NoSQL Databases – NoSQL Introduction and Overview, the paper was written between 2010-06 and 2011-02.

As High Scalibility notes, the paper is a bit dated but it remains a good general overview of the area.

It does omit graph databases entirely (except for some further reading in the bibliography). To be fair, even a summary of the work on graph databases would be at least as long as this paper, if not longer.

Comments Off

Hadoop Streaming Support for MongoDB

Filed under: Hadoop,Javascript,MapReduce,MongoDB,Python,Ruby — Patrick Durusau @ 7:13 pm

Hadoop Streaming Support for MongoDB

From the post:

MongoDB has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.

[graphic omitted]

Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.

[graphic omitted]

Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.

I like that, “…popular dynamic programming languages…” 😉

Any improvement to increase usability without religious conversion (using a programming language not your favorite) is a good move.

Comments Off

Graph DBs Kicking Ass: 16 Internet Years Later

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:12 pm

I saw a tweet of Emil Eifrem’s Neo4j or: why graph dbs kick ass from 2008 yesterday.

That’s roughly sixteen (16) years ago in Internet time.

What has changed/stayed the same?

To get you started:

Slide 2: “Community Experimentation”
Has only gotten richer. Approximately 20 graph DB projects, not to mention more specialized graph software.

Slide 4: “Trend 1: Data is getting more connected”
This is a current “big lie” of IT. Users have always seen data as richly connected. Computer representation of connectedness has been and is impoverished. Let’s put the blame where it belongs.

Slide 5: “Trend 2: …and more semi-structured”
Good thing we didn’t have journals, technical papers, newspapers, books, speeches, radio/TV back in the old days. Would have had “semi-structured” data.

Slide 6: Performance vs. Information Complexity
The opposite of “Smiling Bob’s” chart in the Enzyte commercial. It’s a funny chart (in both cases) but more for amusement than serious discussion. (Unless you are an Enzyte customer of course.)

Slide 8: Whiteboard friendly?
I am not sure what Emil has against blackboards. Or non-pictograph writing systems. 😉

Slide 11: Food Web of North Atlantic
If you like food webs, see: Marine Fisheries Food Webs, which is part of the online book: Our Ocean Planet: Oceanography in the 21st Century by Robert Stewart. (Warning: Oceanography isn’t my field. You need to ask a librarian about using this as a source.)

Slide 12: A Social Graph
I would add an STD graph:

Slide 43 Neo4j architecture gotchas

Focus on the domain (whiteboard friendly – domain first development)

Purpose of the domain layer:

“an adaptation of the generic node space to a type-safe, object-oriented abstraction expressed in the vocabulary of our domain.” (!)

I appreciate the candor but don’t “get” the gotcha?

How is the domain layer binding/limiting my use of the graph that is created?

A little larger helping of explanation please.

(Ah! From slide 14 Shut up and show us the code! to slide 28 Bonus Code: domain model

Domain model means Nodes, properties, traversals, all hard coded. Yikes! Fortunately, at least for nodes/properties, can update with Cypher)

What else would you update?

Perhaps we need a date property on nodes that represent resources on the Internet?

Any search brings up something old, something new, something borrowed and something blue.

Some offer sort by date/relevance and other “facets” but that’s not really the point is it?

I don’t want to do data mining, I want to do data finding. Not the same thing.

Perhaps graphs can capture data already found and preserve it for the next searcher?

Comments/suggestions?

Comments (1)

GraphConnect San Francisco

Filed under: Conferences,Graphs — Patrick Durusau @ 4:51 am

GraphConnect San Francisco

Submissions due: August 1, 2012

Author Notifications: September 1, 2012

Conference: November 5-6, 2012

From the webpage:

Graphs are Everywhere

The NOSQL movement has taken the world by storm, bringing a new coherency and meaning to connected data, and giving developers and technical leads the power to manage modern data at a new speed and size.

Now with the Google Knowledge Graph and Facebook’s Open Graph, graphs have reached a new level of relevance in today’s connected world.

GraphConnect San Francisco is a place where developers, technical decision makers, and thought leaders alike will convene to demonstrate and discuss the power of the graph ecosystem through graph databases, network analysis and social applications.

A young conference but one that has a lot of promise!

Since it hasn’t (yet) developed any conference habits (good or bad), it will be interesting to see if remote attendance/participation is possible. Something very “lite” weight. Streaming audio/video, posted slides, with announcement of a Twitter handle at the start of each presentation. With a local moderator to read/capture tweets or longer questions, that should work well enough.

Remote attendees won’t have the social advantages of being in San Francisco in early November (which are considerable) but that’s a cost of not attending.

Comments Off

June 8, 2012

Jetpants: a toolkit for huge MySQL topologies [“action” on MySQL]

Filed under: MySQL,SQL — Patrick Durusau @ 8:59 pm

Jetpants: a toolkit for huge MySQL topologies

From the webpage:

Tumblr is one of the largest users of MySQL on the web. At present, our data set consists of over 60 billion relational rows, adding up to 21 terabytes of unique relational data. Managing over 200 dedicated database servers can be a bit of a handful, so naturally we engineered some creative solutions to help automate our common processes.

Today, we’re happy to announce the open source release of Jetpants, Tumblr’s in-house toolchain for managing huge MySQL database topologies. Jetpants offers a command suite for easily cloning replicas, rebalancing shards, and performing master promotions. It’s also a full Ruby library for use in developing custom billion-row migration scripts, automating database manipulations, and copying huge files quickly to multiple remote destinations.

Dynamically resizable range-based sharding allows you to scale MySQL horizontally in a robust manner, without any need for a central lookup service or massive pre-allocation of tiny shards. Jetpants supports this range-based model by providing a fast way to split shards that are approaching capacity or I/O limitations. On our hardware, we can split a 750GB, billion-row pool in half in under six hours.

Jetpants can be obtained via GitHub or RubyGems.

Interested in this type of work? We’re hiring!

I am reminded of the line from The Blues Brothers film when Ray of Ray’s Music Exchange (played by Ray Charles) tells Jake and Elwood, “E-excuse me, uh, I don’t think there’s anything wrong with the action on this piano.”

Doesn’t look like there is anything wrong with the “action” on MySQL. 😉

Certainly worth taking a look.

I first saw this Alex Popescu’s myNoSQL.

Comments Off

Data Science Summit 2012

Filed under: BigData,Data,Data Science — Patrick Durusau @ 8:58 pm

Data Science Summit 2012

From Greenplum, videos from the most recent data summit:

Big Data

From Raw Data to Value Data by Michael Brown. Bob Flores, Jeremy Howard
The Promise & Peril in the Human/Technology Relationship by Jonathan Harris
Big Data Transformation by John Brownstein, Nora Denzel, Oren Etzioni, Tarek Kamil, Nate Silver
Analytics Maturity: Master or Novice? by Michael Chui
Economic, Political, & Societal Roles of Social Data by Jim Frederick
1,000 Node Hadoop Cluster
What do you think of Big Data?
Perpetual Ocean

Data Scientists

Data Visualization at the Point of Influence by Adam Bly
Tapping into the Pulse of the Data Science Movement by Michael Driscoll

Analytics

Navigating the Road from BI to Data Science by Piyanka Jain

Data Scientists

What We Can Predict About Prediction by Nate Silver
Code for America

Jobs

What is a Data Scientist?

Comments Off

Riak Handbook, Second Edition [$29 for 154 pages of content]

Filed under: NoSQL,Riak — Patrick Durusau @ 8:57 pm

Riak Handbook, Second Edition, by Mathias Meyer.

From the post:

Basho Technologies today announced the immediate availability of the second edition of Riak Handbook. The significantly updated Riak Handbook includes more than 43 pages of new content covering many of the latest feature enhancements to Riak, Basho’s industry-leading, open-source, distributed database. Riak Handbook is authored by former Basho developer and advocate, Mathias Meyer.

Riak Handbook is a comprehensive, hands-on guide to Riak. The initial release of Riak Handbook focused on the driving forces behind Riak, including Amazon Dynamo, eventual consistency and CAP Theorem. Through a collection of examples and code, Mathias’ Riak Handbook explores the mechanics of Riak, such as storing and retrieving data, indexing, searching and querying data, and sheds a light on Riak in production. The updated handbook expands on previously covered key concepts and introduces new capabilities, including the following:

An overview of Riak Control, a new Web-based operations management tool

Full coverage on pre- and post-commit hooks, including JavaScript and Erlang examples

An entirely new section on deploying Erlang code in a Riak cluster

Additional details on secondary indexes

Insight into load balancing Riak nodes

An introduction to network node planning

An introduction to Riak CS, includes Amazon S3 API compatibility

The updated Riak Handbook includes an entirely new section dedicated to popular use cases and is full of examples and code from real-time usage scenarios.

Mathias Meyer is an experienced software developer, consultant and coach from Berlin, Germany. He has worked with database technology leaders such as Sybase and Oracle. He entered into the world of NoSQL in 2008 and joined Basho Technologies in 2010.

I haven’t ordered a copy. The $29.00 for 154 odd pages of content seems a bit steep to me.

Comments Off

Apache Nutch 1.5 Released!

Filed under: Nutch,Search Engines,Searching — Patrick Durusau @ 8:55 pm

Apache Nutch 1.5 Released!

From the homepage:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes made in this version for a full breakdown of the 50 odd improvements the release boasts. The release is available here.

If you are looking for documentation, may I suggest the Nutch wiki?

Comments Off

June 7, 2012

Reducing Software Highway Friction

Filed under: Hadoop,Lucene,LucidWorks,Solr — Patrick Durusau @ 2:20 pm

Lucid Imagination Search Product Offered in Windows Azure Marketplace

From the post:

Ease of use and flexibility are two key business drivers that are fueling the rapid adoption of cloud computing. The ability to disconnect an application from its supporting architecture provides a new level of business agility that has never before been possible. To ease the move towards this new realm of computing, integrated platforms have begun emerge that make cloud computing easier to adopt and leverage.

Lucid Imagination, a trusted name in Search, Discovery and Analytics, today announced that its LucidWorks Cloud product has been selected by Microsoft Corp. to be offered as a Search-as-a-Service product in Microsoft’s Windows Azure Marketplace. LucidWorks Cloud is a full cloud service version of its LucidWorks Enterprise platform. LucidWorks Cloud delivers full open source Apache Lucene/Solr community innovation with support and maintenance from the world’s leading experts in open source search. An extensible platform architected for developers, LucidWorks Cloud is the only Solr distribution that provides security, abstraction and pre-built connectors for essential enterprise data sources – along with dramatic ease of use advantages in a well-tested, integrated and documented package.

Example use cases for LucidWorks Cloud include Search-as-a-Service for websites, embedding search into SaaS product offerings, and Prototyping and developing cloud-based search-enabled applications in general.

…..

Highlights of LucidWorks Cloud Search-as-a-Service

Sign-up for a plan and start building your search application in minute

Well-organized UI makes Apache Lucene/Solr innovation easier to consume and more adaptable to constant change

Create multiple search collections and manage them independently

Configure index and query settings, fields, stop words, synonyms for each collection

Built-in support for Hadoop, Microsoft SharePoint and traditional online content types

An open connector framework is available to customize access to other data sources

REST API automates and integrates search as a service with an application

Well-instrumented dashboard for infrastructure administration, monitoring and reporting

Monitored 24×7 by Lucid Development Operations insuring minimum downtime

Source: PR Newswire (http://s.tt/1dzre)

I find this deeply encouraging.

It is a step towards a diverse but reduced friction software highway.

The user community is not well served by uniform models for data, software or UIs.

The user community can be well served by a reduced friction software highway as they move data from application to application.

Microsoft has taken a large step towards a reduced friction software highway today. And it is appreciated!

Comments Off

I Dream of “Jini”

Filed under: Environment,Machine Learning,Smart-Phones — Patrick Durusau @ 2:20 pm

The original title reads: Argus Labs Celebrates The Launch Of The Beta Version Of Jini, The App That Goes Beyond The Check-In, And Unveils 2012 Roadmap For The First Time. See what you think:

Argus Labs, a deep data, machine learning and mobile start-up operating out of Antwerp (Belgium), will celebrate the closed beta of the mobile application the night before LeWeb 2012 at Tiger-Tiger, Haymarket in London’s West-End. From 18th June, registered users will be able to download and start evaluating the first version of the intelligent application, called Jini.

Jini is a personal advisor that helps discover unknown relations and hyper-personalised opportunities. Jini feels best when helping the user out in serendipitous moments, or propose things that respond to the affinity its user has with its environment. Having access to hot opportunities and continuously being ‘in the know’ means a user can boost the quality of offline life.

Jini aims to raise the bar for private social networks by going beyond the check-in, saving the user the effort of doing too many manual actions. Jini applies machine learning with ambient sensing technology, so that the user can focus exclusively on having an awesome social sharing and discovery experience on smart-phones.

During the London launch event users will be able to sign up and exclusively download the first beta release of the app. The number of beta users is limited, so be fast. Argus Labs love to pioneer and will also have some goodies in store for the first 250 beta-users of the app.

See the post for registration information.

I sense a contradiction between being “…continuously being ‘in the know’ means a user can boost the quality of offline life.” How am I going to be ‘in the know’ if I am offline?

Still, I suspect there are opportunities here to merge diverse data sets to provide users with “hyper-personalized opportunities,” so long as it doesn’t interrupt one “hyper-personalized” situation to advise of another, potential “hyper-personalized” opportunity.

That would be like a phone call from an ex-girlfriend at an inopportune time. Bad joss.

Comments Off

Principles of Data Mining

Filed under: Data Mining — Patrick Durusau @ 2:19 pm

Principles of Data Mining by David J. Hand , Heikki Mannila and Padhraic Smyth.

Description:

The growing interest in data mining is motivated by a common problem across disciplines: how does one store, access, model, and ultimately describe and understand very large data sets? Historically, different aspects of data mining have been addressed independently by different disciplines. This is the first truly interdisciplinary text on data mining, blending the contributions of information science, computer science, and statistics.

The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data mining algorithms and their application. The presentation emphasizes intuition rather than rigor. The second section, data mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The algorithms covered include trees and rules for classification and regression, association rules, belief networks, classical statistical models, nonlinear models such as neural networks, and local “memory-based” models. The third section shows how all of the preceding analysis fits together when applied to real-world data mining problems. Topics include the role of metadata, how to handle missing data, and data preprocessing.

Another high quality resource if you are learning data mining in a classroom or just adding to your skill set.

The wealth of data, resources such as this book, and free tools has made ignorance of data modeling a “shame on me” proposition.

Comments Off

PDF slides and R code examples on Data Mining and Exploration

Filed under: Data Mining,R — Patrick Durusau @ 2:18 pm

PDF slides and R code examples on Data Mining and Exploration by Yanchang Zhao.

A sampling:

Overview of Data Mining http://www.inf.ed.ac.uk/teaching/courses/dme/2012/slides/datamining_intro4up.pdf

Visualizing Data http://www.inf.ed.ac.uk/teaching/courses/dme/2012/slides/visualisation4up.pdf

Decision trees http://www.inf.ed.ac.uk/teaching/courses/dme/2012/slides/classification4up.pdf

More await your review!

Comments Off

Always label your axes

Filed under: Humor — Patrick Durusau @ 2:18 pm

Always label your axes by Nathan Yau.

It’s visual humor in part so skip over the Nathan’s blog to see the image. I’ll wait.

Maybe that will help you remember the rule!

Now if I just had something like that for documenting data.

Suggestions?

Comments Off

Data Prospecting

Filed under: Contest,Data Analysis — Patrick Durusau @ 2:17 pm

Derrick Harris writes in: Kaggle is now crowdsourcing big data creativity about a new product from Kaggle, Kaggle Prospect:

The Kaggle Prospect homepage says:

Kaggle Prospect is an open data exploration and problem identification platform that lets organizations with large datasets solicit proposals from the best minds in our 40,000 strong community of predictive modeling and machine learning experts. The experts will peer-review each others ideas’ and we’ll present you with the short list of what problems your data could answer.

If you are sitting on a gold mine of data, but aren’t sure where to start digging, Kaggle Prospect is the place to start.

Kaggle Prospect has a great deal of promise. Assuming enough users can pry data out of data silos for submission. 😉

If you are not familiar with Kaggle contests, see: Kaggle.

PS: I like the Kaggle headline:

We’re making data science a sport.™

Comments Off

Predictive Analytics: Decision Tree and Ensembles [part 5]

Filed under: Ensemble Methods,Machine Learning — Patrick Durusau @ 2:17 pm

Predictive Analytics: Decision Tree and Ensembles by Ricky Ho.

From the post:

Continue from my last post of walking down the list of machine learning technique. In this post, I will covered Decision Tree and Ensemble methods. We’ll continue using the iris data we prepare in this earlier post.

Ricky covers Decision Tree to illustrate early machine learning and continue under Ensemble methods to cover Random Forest and Gradient Boosted Trees.

Ricky’s next post will cover performance of the methods he has discussed in this series of posts.

Comments Off

Breaking Silos – Carrot or Stick?

Filed under: Data Governance,Data Integration,Data Silos,Silos — Patrick Durusau @ 2:17 pm

Alex Popescu, in Silos Are Built for a Reason quotes Greg Lowe saying:

In a typical large enterprise, there are competitions for resources and success, competing priorities and lots of irrelevant activities that are happening that can become distractions from accomplishing the goals of the teams.

Another reason silos are built has to do with affiliation. This is by choice, not by edict. By building groups where you share a shared set of goals, you effectively have an area of focus with a group of people interested in the same area and/or outcome.

There are many more reasons and impacts of why silos are built, but I simply wanted to establish that silos are built for a purpose with legitimate business needs in mind.

Alex then responds:

Legitimate? Maybe. Productive? I don’t really think so.

Greg’s original post is: Breaking down silos, what does that mean?

Greg asks about the benefits of breaking down silos:

Are the silos mandatory?

What would breaking down silos enable in the business?

What do silos do to your business today?

What incentive is there for these silos to go away?

Is your company prepared for transparency?

How will leadership deal with “Monday morning quarterbacks?”

As you can see, there are many benefits to silos as well as challenges. By developing a deeper understanding of the silos and why they get created, you can then have a better handle on whether the silos are beneficial or detrimental to the organization.

I would add to Greg’s question list:

Which stakeholders benefit from the silos?
What is that benefit?
It there a carrot or stick that out weighs that benefit? (in the view of the stakeholder)
Do you have the political capital to take the stakeholders on and win?

If your answer are:

List of names
List of benefits
Yes, list of carrots/sticks
No

Then you are in good company.

Intelligence silos persist despite the United States being at war with identifiable terrorist groups.

Generalized benefit or penalty for failure, isn’t a winning argument to break a data silo.

Specific benefits and penalties must matter to stakeholders. Then you have a chance to break a data silo.

Good luck!

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 11, 2012

June 10, 2012

June 9, 2012

June 8, 2012

June 7, 2012