Archive for the ‘Google BigQuery’ Category

Google BigQuery Public Datasets

Wednesday, March 30th, 2016

Google BigQuery Public Datasets

An amazing set of public datasets, from the post:

  • : A Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879.
  • : Data collected by the NYC Taxi and Limousine Commission (TLC) that includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015.
  • : A dataset that contains all stories and comments from Hacker News since its launch in 2006.
  • : A dataset published by the US Department of Health and Human Services that includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
  • : A dataset that contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes).
  • : This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and 2016, collected from over 9000 stations.

I can readily see myself loosing serious time in the GDELT Book Corpus!


Spending Time Rolling Your Own or Using Google Tools in Anger?

Wednesday, March 30th, 2016

The question: Spending Time Rolling Your Own or Using Google Tools in Anger? is one faced by many people who have watched computer technology evolve.

You could write your own blogging software or you can use one of the standard distributions.

You could write your own compiler or you can use one of the standard distributions.

You can install and maintain your own machine learning, big data apps, or you can use the tools offered by Google Machine Learning.

Tinkering with your local system until it is “just so” is fun, but it eats into billable time and honestly is a distraction.

Not promising I immersing in the Google-verse but an honest assessment of where to spend my time is in order.

Google takes Cloud Machine Learning service mainstream by Fausto Ibarra, Director, Product Management.

From the post:

Hundreds of different big data and analytics products and services fight for your attention as it’s one of the most fertile areas of innovation in our industry. And it’s no wonder; the most amazing consumer experiences are driven by insights derived from information. This is an area where Google Cloud Platform has invested almost two decades of engineering, and today at GCP NEXT we’re announcing some of the latest results of that work. This next round of innovation builds on our portfolio of data management and analytics capabilities by adding new products and services in multiples key areas:

Machine Learning:

We’re on a journey to create applications that can see, hear and understand the world around them. Today we’ve taken a major stride forward with the announcement of a new product family: Cloud Machine Learning. Cloud Machine Learning will take machine learning mainstream, giving data scientists and developers a way to build a new class of intelligent applications. It provides access to the same technologies that power Google Now, Google Photos and voice recognition in Google Search as easy to use REST APIs. It enables you to build powerful Machine Learning models on your data using the open-source TensorFlow machine learning library:

Big Data and Analytics:

Doing big data the cloud way means being more productive when building applications, with faster and better insights, without having to worry about the underlying infrastructure. To further this mission, we recently announced the general availability of Cloud Dataproc, our managed Apache Hadoop and Apache Spark service, and we’re adding new services and capabilities today:

Open Source:

Our Cloud Machine Learning offering leverages Google’s cutting edge machine learning and data processing technologies, some of which we’ve recently open sourced:

What, if anything, do you see as a serious omission in this version of the Google-verse?


BigQuery [first 1 TB of data processed each month is free]

Sunday, February 22nd, 2015

BigQuery [first 1 TB of data processed each month is free]

Apologies if this is old news to you but I saw a tweet by GoogleCloudPlatform advertising the “first 1 TB of data processed each month is free” and felt compelled to pass it on.

Like so much news on the Internet, if it is “new” to us, we assume it must be “new” to everyone else. (That is how the warnings of malware that will alter your DNA spread.)

It is a very temping offer.

Temping enough that I am going to spend some serious time looking at BigQuery.

What’s your query for BigQuery?

Jump-start your data pipelining into Google BigQuery

Monday, October 7th, 2013

Like they said at Woodstock, “if you don’t think ETL is all that weird,” wait, wasn’t that, “if you don’t think capitalism is all that weird?”

Maybe, maybe not. But in any event, Wally Yau has written guidance on getting the Google Compute Engine up and ready do to some ETL in Jump-start your data pipelining into Google BigQuery

Or if you have already “cooked” data there is another sample application, Automated File Loader for BigQuery, shows how to load data that will produce your desired results.

Both of these are from: Getting Started with Google BigQuery.

You do know that Google is located in the United States?

Got big JSON? BigQuery expands data import for large scale web apps

Tuesday, October 2nd, 2012

Got big JSON? BigQuery expands data import for large scale web apps by Ryan Boyd, Developer Advocate.

From the post:

JSON is the data format of the web. JSON is used to power most modern websites, is a native format for many NoSQL databases hosting top web applications, and provides the primary data format in many REST APIs. Google BigQuery, our cloud service for ad-hoc analytics on big data, has now added support for JSON and the nested/repeated structure inherent in the data format.

JSON opens the door to a more object-oriented view of your data compared to CSV, the original data format supported by BigQuery. It removes the need for duplication of data required when you flatten records into CSV. Here are some examples of data you might find a JSON format useful for:

  • Log files, with multiple headers and other name-value pairs.
  • User session activities, with information about each activity occurring nested beneath the session record.
  • Sensor data, with variable attributes collected in each measurement.

Nested/repeated data support is one of our most requested features. And while BigQuery’s underlying infrastructure supports it, we’d only enabled it in a limited fashion through M-Lab’s test data. Today, however, developers can use JSON to get any nested/repeated data into and out of BigQuery.

It had to happen. “Big Json” that is.

My question is when “Bigger Data” is going to catch on?

If you got far enough ahead, say six to nine months, you could copyright something like “Biggest Data” and start collecting fees when it comes into common usage.

Qlikview and Google BigQuery…

Sunday, September 23rd, 2012

Qlikview and Google BigQuery – Data Visualization for Big Data by Istvan Szegedi.

From the post:

Google have launched its BigQuery cloud service in May to support interactive analysis of massive datasets up to billions of rows. Shortly after this launch Qliktech, one of the market leaders in BI solutions who is known for its unique associative architecture based on colunm store, in-memory database demonstrated a Qlikview Google BigQuery application that provided data visualization using BigQuery as backend. This post is about how Qlikview and Google BigQuery can be intagrated to provide easy-to-use data analytics application for business users who work on large datasets.

A “big data” offering to limber you up for the coming week!

A Look At Google BigQuery

Monday, May 21st, 2012

A Look At Google BigQuery

Chris Webb writes:

Over the years I’ve written quite a few posts about Google’s BI capabilities. Google never seems to get mentioned much as a BI tools vendor but to me it’s clear that it’s doing a lot in this area and is consciously building up its capabilities; you only need to look at things like Fusion Tables (check out these recently-added features), Google Refine and of course Google Docs to see that it’s pursuing a self-service, information-worker-led vision of BI that’s very similar to the one that Microsoft is pursuing with PowerPivot and Data Explorer.

Earlier this month Google announced the launch of BigQuery and I decided to take a look. Why would a Microsoft BI loyalist like me want to do this, you ask? Well, there are a number of reasons:

Looks like an even handed report to me.

See what you think about it and BigQuery.

Google BigQuery and the Github Data Challenge

Wednesday, May 2nd, 2012

Google BigQuery and the Github Data Challenge

Deadline May 21, 2012

From the post:

Github has made data on its code repositories, developer updates, forks etc. from the public GitHub timeline available for analysis, and is offering prizes for the most interesting visualization of the data. Sounds like a great challenge for R programmers! The R language is currently the 26th most popular on GitHub (up from #29 in December), and it would be interesting to visualize the usage of R compared to other languages, for example. The deadline for submissions to the contest is May 21.

Interestingly, GitHub has made this data available on the Google BigQuery service, which is available to the public today. BigQuery was free to use while it was in beta test, but Google is now charging for storage of the data: $0.12 per gigabyte per month, up to $240/month (the service is limited to 2TB of storage – although there a Premier offering that supports larger data sizes … at a price to be negotiated). While members of the public can run SQL-like queries on the GitHub data for free, Google is charging subscribers to the service 3.5 cents per Gb processed in the query: this is measured by the source data accessed (although columns of data not referenced aren't counted); the size of the result set doesn't matter.

Watch your costs but thoughts on how you would visualize the data?

Google BigQuery Service: Big data analytics at Google speed

Tuesday, November 22nd, 2011

Google BigQuery Service: Big data analytics at Google speed

From the post:

Rapidly crunching terabytes of big data can lead to better business decisions, but this has traditionally required tremendous IT investments. Imagine a large online retailer that wants to provide better product recommendations by analyzing website usage and purchase patterns from millions of website visits. Or consider a car manufacturer that wants to maximize its advertising impact by learning how its last global campaign performed across billions of multimedia impressions. Fortune 500 companies struggle to unlock the potential of data, so it’s no surprise that it’s been even harder for smaller businesses.

We developed Google BigQuery Service for large-scale internal data analytics. At Google I/O last year, we opened a preview of the service to a limited number of enterprises and developers. Today we’re releasing some big improvements, and putting one of Google’s most powerful data analysis systems into the hands of more companies of all sizes.

  • We’ve added a graphical user interface for analysts and developers to rapidly explore massive data through a web application.
  • We’ve made big improvements for customers accessing the service programmatically through the API. The new REST API lets you run multiple jobs in the background and manage tables and permissions with more granularity.
  • Whether you use the BigQuery web application or API, you can now write even more powerful queries with JOIN statements. This lets you run queries across multiple data tables, linked by data that tables have in common.
  • It’s also now easy to manage, secure, and share access to your data tables in BigQuery, and export query results to the desktop or to Google Cloud Storage.

Did I remember to mention that this service is free? 😉 Customers will get 30-days notice when that is about to end.

Sorta like an early present isn’t it?

What did you do with Google BigQuery?