Archive for the ‘BigData’ Category

Forbes Vouches For Public Data Sources

Monday, February 26th, 2018

For Forbes readers, a demonstration with one of Bernard Marr’s Big Data And AI: 30 Amazing (And Free) Public Data Sources For 2018 (Forbes, Feb. 26, 2018), adds a ring of authenticity to your data. Marr and by extension, Forbes has vouched for these data sets.

Beats the hell out of opera, medieval boys choirs, or irises for your demonstration. 😉

These data sets show up everywhere but a reprint from Forbes to leave with your (hopefully) future client, sets your data set from others.

Tip: As interesting as it is, I’d skip the CERN Open Data unless you are presenting to physicists. Yes? Hint: Pick something relevant to your audience.

Hadoop® v3.0.0, Pre-1990 Documentation Practice

Saturday, December 16th, 2017

Apache® Hadoop® v3.0.0 General Availability

From the post:

Ubiquitous Open Source enterprise framework maintains decade-long leading role in $100B annual Big Data market

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, today announced Apache® Hadoop® v3.0.0, the latest version of the Open Source software framework for reliable, scalable, distributed computing.

Over the past decade, Apache Hadoop has become ubiquitous within the greater Big Data ecosystem by enabling firms to run and manage data applications on large hardware clusters in a distributed computing environment.

"This latest release unlocks several years of development from the Apache community," said Chris Douglas, Vice President of Apache Hadoop. "The platform continues to evolve with hardware trends and to accommodate new workloads beyond batch analytics, particularly real-time queries and long-running services. At the same time, our Open Source contributors have adapted Apache Hadoop to a wide range of deployment environments, including the Cloud."

"Hadoop 3 is a major milestone for the project, and our biggest release ever," said Andrew Wang, Apache Hadoop 3 release manager. "It represents the combined efforts of hundreds of contributors over the five years since Hadoop 2. I'm looking forward to how our users will benefit from new features in the release that improve the efficiency, scalability, and reliability of the platform."

Apache Hadoop 3.0.0 highlights include:

  • HDFS erasure coding —halves the storage cost of HDFS while also improving data durability;
  • YARN Timeline Service v.2 (preview) —improves the scalability, reliability, and usability of the Timeline Service;
  • YARN resource types —enables scheduling of additional resources, such as disks and GPUs, for better integration with machine learning and container workloads;
  • Federation of YARN and HDFS subclusters transparently scales Hadoop to tens of thousands of machines;
  • Opportunistic container execution improves resource utilization and increases task throughput for short-lived containers. In addition to its traditional, central scheduler, YARN also supports distributed scheduling of opportunistic containers; and 
  • Improved capabilities and performance improvements for cloud storage systems such as Amazon S3 (S3Guard), Microsoft Azure Data Lake, and Aliyun Object Storage System.

… (emphasis in original)

Ah, the Hadoop link.

Do you find it odd use of the leader in the “$100B annual Big Data market” is documented by string comments in scripts and code?

Do you think non-technical management benefits from the documentation so captured?

Or that documentation for field names, routines, etc., can be easily extracted?

If software is maturing in a $100B market, shouldn’t it have mature documentation capabilities as well?

Every NASA Image In One Archive – Crowd Sourced Index?

Monday, April 17th, 2017

NASA Uploaded Every Picture It Has to One Amazing Online Archive by Will Sabel Courtney.

From the post:

Over the last five decades and change, NASA has launched hundreds of men and women from the planet’s surface into the great beyond. But America’s space agency has had an emotional impact on millions, if not billions, of others who’ve never gone past the Karmann Line separating Earth from space, thanks to the images, audio, and video generated by its astronauts and probes. NASA has given us our best glimpses at distant galaxies and nearby planets—and in the process, helped up appreciate our own world even more.

And now, the agency has placed them all in one place for everyone to see:

No, viewing this site will not be considered an excuse for a late tax return. 😉

On the other hand, it’s an impressive bit of work, although a search only interface seems a bit thin to me.

The API docs don’t offer much comfort:

Name Description
q (optional) Free text search terms to compare to all 
indexed metadata.
center (optional) NASA center which published the media.
description(optional) Terms to search for in “Description” fields.
keywords (optional) Terms to search for in “Keywords” fields. 
Separate multiple values with commas.
location (optional) Terms to search for in “Location” fields.
media_type(optional) Media types to restrict the search to. 
Available types: [“image”, “audio”]. 
Separate multiple values with commas.
nasa_id (optional) The media asset’s NASA ID.
photographer(optional) The primary photographer’s name.
secondary_creator(optional) A secondary photographer/videographer’s name.
title (optional) Terms to search for in “Title” fields.
year_start (optional) The start year for results. Format: YYYY.
year_end (optional) The end year for results. Format: YYYY.

With no index, your results depend on your blind guessing the metadata entered by a NASA staffer.

Well, for “moon” I would expect “the Moon,” but the results are likely to include moons of other worlds, etc.

Indexing this collection has all the marks of a potential crowd sourcing project:

  1. Easy to access data
  2. Free data
  3. Interesting data
  4. Metadata


Unmet Needs for Analyzing Biological Big Data… [Data Integration #1 – Spells Market Opportunity]

Wednesday, February 15th, 2017

Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators by Lindsay Barone, Jason Williams, David Micklos.


In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principle investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multi-step workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

In particular, needs topic maps can address rank #1, #2, #6, #7, and #10, or as found by the authors:

A majority of PIs—across bioinformatics/other disciplines, larger/smaller groups, and the four NSF programs—said their institutions are not meeting nine of 13 needs (Figure 3). Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HP computing (71%) were the three greatest unmet needs. High performance computing was an unmet need for only 27% of PIs—with similar percentages across disciplines, different sized groups, and NSF programs.

or graphically (figure 3):

So, cloud, distributed, parallel, pipelining, etc., processing is insufficient?

Pushing undocumented and unintegratable data at ever increasing speeds is impressive but gives no joy?

This report will provoke another round of Esperanto fantasies, that is the creation of “universal” vocabularies, which if used by everyone and back-mapped to all existing literature, would solve the problem.

The number of Esperanto fantasies and the cost/delay of back-mapping to legacy data defeats all such efforts. Those defeats haven’t prevented repeated funding of such fantasies in the past, present and no doubt the future.

Perhaps those defeats are a question of scope.

That is rather than even attempting some “universal” interchange of data, why not approach it incrementally?

I suspect the PI’s surveyed each had some particular data set in mind when they mentioned data integration (which itself is a very broad term).

Why not seek out, develop and publish data integrations in particular instances, as opposed to attempting to theorize what might work for data yet unseen?

The need topic maps wanted to meet remains unmet. With no signs of lessening.

Opportunity knocks. Will we answer?

Repulsion On A Galactic Scale (Really Big Data/Visualization)

Tuesday, January 31st, 2017

Newly discovered intergalactic void repels Milky Way by Rol Gal.

From the post:

For decades, astronomers have known that our Milky Way galaxy—along with our companion galaxy, Andromeda—is moving through space at about 1.4 million miles per hour with respect to the expanding universe. Scientists generally assumed that dense regions of the universe, populated with an excess of galaxies, are pulling us in the same way that gravity made Newton’s apple fall toward earth.

In a groundbreaking study published in Nature Astronomy, a team of researchers, including Brent Tully from the University of Hawaiʻi Institute for Astronomy, reports the discovery of a previously unknown, nearly empty region in our extragalactic neighborhood. Largely devoid of galaxies, this void exerts a repelling force, pushing our Local Group of galaxies through space.

Astronomers initially attributed the Milky Way’s motion to the Great Attractor, a region of a half-dozen rich clusters of galaxies 150 million light-years away. Soon after, attention was drawn to a much larger structure called the Shapley Concentration, located 600 million light-years away, in the same direction as the Great Attractor. However, there has been ongoing debate about the relative importance of these two attractors and whether they suffice to explain our motion.

The work appears in the January 30 issue of Nature Astronomy and can be found online here.

Additional images, video, and links to previous related productions can be found at

If you are looking for processing/visualization of data on a galactic scale, this work by Yehuda Hoffman, Daniel Pomarède, R. Brent Tully & Hélène M. Courtois, hits the spot!

It is also a reminder that when you look up from your social media device, there is a universe waiting to be explored.

Q&A Cathy O’Neil…

Wednesday, January 4th, 2017

Q&A Cathy O’Neil, author of ‘Weapons of Math Destruction,’ on the dark side of big data by Christine Zhang.

From the post:

Cathy O’Neil calls herself a data skeptic. A former hedge fund analyst with a PhD in mathematics from Harvard University, the Occupy Wall Street activist left finance after witnessing the damage wrought by faulty math in the wake of the housing crash.

In her latest book, “Weapons of Math Destruction,” O’Neil warns that the statistical models hailed by big data evangelists as the solution to today’s societal problems, like which teachers to fire or which criminals to give longer prison terms, can codify biases and exacerbate inequalities. “Models are opinions embedded in mathematics,” she writes.

Great interview that hits enough high points to leave you wanting to learn more about Cathy and her analysis.

On that score, try:

Read her mathbabe blog.

Follow @mathbabedotorg.

Read Weapons of math destruction : how big data increases inequality and threatens democracy.

Try her new business: ORCAA [O’Neil Risk Consulting and Algorithmic Auditing].

From the ORCAA homepage:

ORCAA’s mission is two-fold. First, it is to help companies and organizations that rely on time and cost-saving algorithms to get ahead of this wave, to understand and plan for their litigation and reputation risk, and most importantly to use algorithms fairly.

The second half of ORCAA’s mission is this: to develop rigorous methodology and tools, and to set rigorous standards for the new field of algorithmic auditing.

There are bright line cases, sentencing, housing, hiring discrimination where “fair” has a binding legal meaning. And legal liability for not being “fair.”

Outside such areas, the search for “fairness” seems quixotic. Clients are entitled to their definitions of “fair” in those areas.

Merry Christmas To All Astronomers! (Pan-STARRS)

Tuesday, December 20th, 2016

The Panoramic Survey Telescopes & Rapid Response System (Pan-STARRS) dropped its data release on December 19, 2016.

Realizing you want to jump straight to the details, check out: PS1 Data Processing procedures.

There is far more to be seen but here’s a shot of the sidebar:


Jim Gray favored the use of astronomical data because it was “big” (this was before “big data” became marketing hype) and it is free.


AI Cultist On Justice System Reform

Wednesday, June 8th, 2016

White House Challenges Artificial Intelligence Experts to Reduce Incarceration Rates by Jason Shueh.

From the post:

The U.S. spends $270 billion on incarceration each year, has a prison population of about 2.2 million and an incarceration rate that’s spiked 220 percent since the 1980s. But with the advent of data science, White House officials are asking experts for help.

On Tuesday, June 7, the White House Office of Science and Technology Policy’s Lynn Overmann, who also leads the White House Police Data Initiative, stressed the severity of the nation’s incarceration crisis while asking a crowd of data scientists and artificial intelligence specialists for aid.

“We have built a system that is too large, and too unfair and too costly — in every sense of the word — and we need to start to change it,” Obermann said, speaking at a Computing Community Consortium public workshop.

She argued that the U.S., a country that has the highest amount incarcerated citizens in the world, is in need of systematic reforms with both data tools to process alleged offenders and at the policy level to ensure fair and measured sentences. As a longtime counselor, advisor and analyst for the Justice Department and at the city and state levels, Overman said she has studied and witnessed an alarming number of issues in terms of bias and unwarranted punishments.

For instance, she said that statistically, while drug use is about equal between African Americans and Caucasians, African Americans are more likely to be arrested and convicted. They also receive longer prison sentences compared to Caucasian inmates convicted of the same crimes.

Other problems, Oberman said, are due to inflated punishments that far exceed the severity of crimes. She recalled her years spent as an assistant public defender for Florida’s Miami-Dade County Public Defender’s Office as an example.

“I represented a client who was looking at spending 40 years of his life in prison because he stole a lawnmower and a weedeater from a shed in a backyard,” Obermann said, “I had another person who had AIDS and was offered a 15-year sentence for stealing mangos.”

Data and digital tools can help curb such pitfalls by increasing efficiency, transparency and accountability, she said.
… (emphasis added)

Spotting a cultist tip: Before specifying criteria for success or even understanding a problem, a cultist announces the approach that will succeed.

Calls like this one are a disservice to legitimate artificial intelligence research, to say nothing of experts in criminal justice (unlike Lynn Overmann), who have struggled for decades to improve the criminal justice system.

Yes, Overmann has experience in the criminal justice system, both in legal practice and at a policy level, but that makes her no more of an expert on criminal justice reform than having multiple flat tires makes me an expert on tire design.

Data is not, has not been, nor will it ever be a magic elixir that solves undefined problems posed to it.

White House sponsored AI cheer leading is a disservice to AI practitioners, experts in the field of criminal justice reform and more importantly, to those impacted by the criminal justice system.

Substitute meaningful problem definitions for the AI pom-poms if this is to be more than resume padding and currying favor with contractors project.

300 Terabytes of Raw Collider Data

Saturday, April 23rd, 2016

CERN Just Dropped 300 Terabytes of Raw Collider Data to the Internet by Andrew Liptak.

From the post:

Yesterday, the European Organization for Nuclear Research (CERN) dropped a staggering amount of raw data from the Large Hadron Collider on the internet for anyone to use: 300 terabytes worth.

The data includes a 100 TB “of data from proton collisions at 7 TeV, making up half the data collected at the LHC by the CMS detector in 2011.” The release follows another infodump from 2014, and you can take a look at all of this information through the CERN Open Data Portal. Some of the information released is simply the raw data that CERN’s own scientists have been using, while another segment is already processed, with the anticipated audience being high school science courses.

It’s not the same as having your own cyclotron in the backyard with a bubble chamber but its the next best thing!

If you have been looking for “big data” to stretch your limits, this fits the bill nicely.

1880 Big Data Influencers in CSV File

Friday, April 8th, 2016

If you aren’t familiar with Right Relevance, you are missing an amazing resource for cutting through content clutter.

Starting at the default homepage:


You can search for “big data” and the default result screen appears:


If you switch to “people,” the following screen appears:


The “topic score” line moves, so you can require a higher or lesser score for inclusion in the listing. That is helpful if you want only the top people, articles, etc. on a topic or want to reach deeper into the pool of data.

As of yesterday, if you set the “topic score” to the range 70 to 98, the number of people influencers was 1880.

The interface allows you to follow and/or tweet to any of those 1880 people, but only one at a time.

I submitted feedback to Right Relevance on Monday of this week pointing out how useful lists of Twitter handles could be for creating Twitter seed lists, etc., but have not gotten a response.

Part of my query to Right Relevance concerned the failure of a web scraper to match the totals listed in the interface (a far lower number of results than expected).

In the absence of an answer, I continue to experiment with the Web Scraper extension for Chrome to extract data from the site.

Caveat: In order to set the delay for requests in Web Scraper, I have found the settings under “Scrape” ineffectual:


In order to induce enough delay to capture the entire list, I set the delay in the exported sitemap (in JSON) and then imported it into another sitemap. Could have reached the same point by setting the delay under the top selector, which was also set to SelectorElementScroll.

To successfully retrieve the entire list, that delay setting was 16000 miliseconds.

There may be more performant solutions but since it ran in a separate browser tab and notified me of completion, time wasn’t an issue.

I created a sitemap that obtains the user’s name, Twitter handle and number of Twitter followers, bigdata-right-relevance.txt.

Oh, the promised 1880-big-data-influencers.csv. (File renamed post-scraping due to naming constraints in Web Scraper.)

At best I am a casual user of Web Scraper so suggestions for improvements, etc., are greatly appreciated.

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

Thursday, April 7th, 2016

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil.


From the description at Amazon:

We live in the age of the algorithm. Increasingly, the decisions that affect our lives—where we go to school, whether we get a car loan, how much we pay for health insurance—are being made not by humans, but by mathematical models. In theory, this should lead to greater fairness: Everyone is judged according to the same rules, and bias is eliminated. But as Cathy O’Neil reveals in this shocking book, the opposite is true. The models being used today are opaque, unregulated, and uncontestable, even when they’re wrong. Most troubling, they reinforce discrimination: If a poor student can’t get a loan because a lending model deems him too risky (by virtue of his race or neighborhood), he’s then cut off from the kind of education that could pull him out of poverty, and a vicious spiral ensues. Models are propping up the lucky and punishing the downtrodden, creating a “toxic cocktail for democracy.” Welcome to the dark side of Big Data.

Tracing the arc of a person’s life, from college to retirement, O’Neil exposes the black box models that shape our future, both as individuals and as a society. Models that score teachers and students, sort resumes, grant (or deny) loans, evaluate workers, target voters, set parole, and monitor our health—all have pernicious feedback loops. They don’t simply describe reality, as proponents claim, they change reality, by expanding or limiting the opportunities people have. O’Neil calls on modelers to take more responsibility for how their algorithms are being used. But in the end, it’s up to us to become more savvy about the models that govern our lives. This important book empowers us to ask the tough questions, uncover the truth, and demand change.

Even if you have qualms about Cathy’s position, you have to admit that is a great book cover!

When I was in law school, I had F. Hodge O’Neal for corporation law. He is the O’Neal in O’Neal and Thompson’s Oppression of Minority Shareholders and LLC Members, Rev. 2d.

The publisher’s blurb is rather generous in saying:

Cited extensively, O’Neal and Thompson’s Oppression of Minority Shareholders and LLC Members shows how to take appropriate steps to protect minority shareholder interests using remedies, tactics, and maneuvers sanctioned by federal law. It clarifies the underlying cause of squeeze-outs and suggests proven arrangements for avoiding them.

You could read Oppression of Minority Shareholders and LLC Members that way but when corporate law is taught with war stories from the antics of the robber barons forward, you get the impression that isn’t why people read it.

Not that I doubt Cathy’s sincerity, on the contrary, I think she is very sincere about her warnings.

Where I disagree with Cathy is in thinking democracy is under greater attack now or that inequality is any greater problem than before.

If you read The Half Has Never Been Told: Slavery and the Making of American Capitalism by Edward E. Baptist:


carefully, you will leave it with deep uncertainty about the relationship of American government, federal, state and local to any recognizable concept of democracy. Or for that matter to the “equality” of its citizens.

Unlike Cathy as well, I don’t expect that shaming people is going to result in “better” or more “honest” data analysis.

What you can do is arm yourself to do battle on behalf of your “side,” both in terms of exposing data manipulation by others and concealing your own.

Perhaps there is room in the marketplace for a book titled: Suppression of Unfavorable Data. More than hiding data, what data to not collect? How to explain non-collection/loss? How to collect data in the least useful ways?

You would have to write it as a how to avoid these very bad practices but everyone would know what you meant. Could be the next business management best seller.

People NOT Technology Produce Data ROI

Monday, February 15th, 2016

Too many tools… not enough carpenters! by Nicholas Hartman.

From the webpage:

Don’t let your enterprise make the expensive mistake of thinking that buying tons of proprietary tools will solve your data analytics challenges.

tl;dr = The enterprise needs to invest in core data science skills, not proprietary tools.

Most of the world’s largest corporations are flush with data, but frequently still struggle to achieve the vast performance increases promised by the hype around so called “big data.” It’s not that the excitement around the potential of harvesting all that data was unwarranted, but rather these companies are finding that translating data into information and ultimately tangible value can be hard… really hard.

In your typical new tech-based startup the entire computing ecosystem was likely built from day one around the need to generate, store, analyze and create value from data. That ecosystem was also likely backed from day one with a team of qualified data scientists. Such ecosystems spawned a wave of new data science technologies that have since been productized into tools for sale. Backed by mind-blowingly large sums of VC cash many of these tools have set their eyes on the large enterprise market. A nice landscape of such tools was recently prepared by Matt Turck of FirstMark Capital (host of Data Driven NYC, one of the best data science meetups around).

Consumers stopped paying money for software a long time ago (they now mostly let the advertisers pay for the product). If you want to make serious money in pure software these days you have to sell to the enterprise. Large corporations still spend billions and billions every year on software and data science is one of the hottest areas in tech right now, so selling software for crunching data should be a no-brainer! Not so fast.

The problem is, the enterprise data environment is often nothing like that found within your typical 3-year-old startup. Data can be strewn across hundreds or thousands of systems that don’t talk to each other. Devices like mainframes are still common. Vast quantities of data are generated and stored within these companies, but until recently nobody ever really envisioned ever accessing — let alone analyzing — these archived records. Often, it’s not initially even clear how the all data generated by these systems directly relates to a large blue chip’s core business operations. It does, but a lack of in-house data scientists means that nobody is entirely even sure what data is really there or how it can be leveraged.

I would delete “proprietary” from the above because non-proprietary tools create data problems just as easily.

Thus I would re-write the second quote as:

Tools won’t replace skilled talent, and skilled talent doesn’t typically need many particular tools.

I substituted “particular” tools to avoid religious questions about particular non-proprietary tools.

Understanding data, recognizing where data integration is profitable and where it is a dead loss, creating tests to measure potential ROI, etc., are all tasks of a human data analyst and not any proprietary or non-proprietary tool.

That all enterprise data has some intrinsic value that can be extracted if it were only accessible is an article of religious faith, not business ROI.

If you want business ROI from data, start with human analysts and not the latest buzzwords in technological tools.

What is this Hadoop thing? [What’s Missing From This Picture?]

Friday, January 29th, 2016

Kirk Borne posted this image to Twitter today:


You have seen this image or similar ones. About Hadoop, Big Data, non-IT stuff, etc. You can probably recite the story with your eyes closed, even when you are drunk or stoned. 😉

But today I realized not only who is in the image but who’s missing. Especially in a Hadoop/Big Data context.

Who’s in the image? Customers. They are the blind actors who would not recognize Hadoop in a closet with the light on. They have no idea what relevance Hadoop has to their data and/or any possible benefit to their business problems.

Who’s not in the image? Marketers. They are just out of view in this image. Once they learn a customer has data, they have the solution, Hadoop. “What do you want Hadoop to do exactly?” marketers ask before directly a customer to a particular part of the Hadoop/elephant.

Lo and behold, data salvation is at hand! May the IT gods be praised! We are going to have big data with Hadoop, err, ah, pushing, dragging, well, we’ll get to the specifics later.

The crux of every business problem is a business and not technological need.

You may not be able to store full bandwidth teleconference videos but if you don’t do any video teleconferencing, that’s not really your problem.

If you are already running three shifts making as many HellFire missiles as you can, there isn’t much point in building a recommendation system for your sales department to survey your customers.

Go into every IT conversation with a list of your business needs and require that proposed solutions address those needs, in defined and measurable ways.

You can avoid feeling up an elephant while someone loots your wallet.

Big data ROI in 2016: Put up, or shut up [IT vendors to “become” consumer-centric?]

Friday, January 15th, 2016

Big data ROI in 2016: Put up, or shut up by David Weldon.

From the post:

When it comes to data analytics investments, this is the year to put up, or shut up.

That is the take of analysts at Forrester Research, who collectively expect organizations to take a hard look at their data analytics investments so far, and see some very real returns on those investments. If strong ROI can’t be shown, some data initiatives may see the plug pulled on those projects.

These sober warnings emerge from Forrester’s top business trends forecast for 2016. Rather than a single study or survey on top trends, the Forrester forecast combines the results of 35 separate studies. Carrie Johnson, senior vice president of research at Forrester, discussed the highlights with Information Management, including the growing impatience at many organizations that big data produce big results, and where organizations truly are on the digital transformation journey.

“I think one of the key surprises is that folks in the industry assume that everyone is farther along than they are,” Johnson explains. “Whether it’s with digital transformation, or a transformation to become a customer-obsessed firm, there are very few companies pulling off those initiatives at a wholesale level very well. Worse, many companies in the year ahead will continue to flail a bit with one-off projects and bold-on strategies, versus true differentiation through transformation.”

Asked why this misconception exists, Johnson notes that “Vendors do tend to paint a rosier picture of adoption in general because it behooves them. Also, every leader in an organization sees their problems, and then sees an article or sees the use of an app by a competitor and thinks, ‘my gosh, these companies are so far ahead of where we are.’ The reality may be that that app may have been an experiment by a really savvy team in the organization, but it’s not necessarily representative of a larger commitment by the organization, both financially and through resources.”

It’s not the first time you have heard data ROI discussed on this blog but when Forrester Research says it, it sounds more important. Moreover, their analysis is the result of thirty-five separate studies.

Empirical verification (the studies) are good to have but you don’t have to have an MBA to realize businesses that make decisions on some basis other than ROI, aren’t businesses very long. Or, at least not profitable businesses.

David’s conclusion makes it clear that your ROI is your responsibility:

The good news: “We believe that this is the year that IT leaders — and CIOs in particular … embrace a new way of investing in and running technology that is customer-centric….”

If lack of clarity and a defined ROI for IT is a problem at your business, well, it’s your money.

Data Is Not The Same As Truth:…

Friday, January 15th, 2016

Data Is Not The Same As Truth: Interpretation In The Big Data Era by Kalev Leetaru.

From the post:

One of the most common misconceptions of the “big data” world is that from data comes irrefutable truth. Yet, any given piece of data records only a small fragment of our existence and the same piece of data can often support multiple conclusions depending on how it is interpreted. What does this mean for some of the major trends of the data world?

The notion of data supporting multiple conclusions was captured perhaps most famously in 2013 with a very public disagreement between New York Times reporter John Broder and Elon Musk, CEO of Tesla after Broder criticized certain aspects of the vehicle’s performance during a test drive. Using remotely monitored telemetry data the company recorded during Broder’s test drive, Musk argued that Broder had taken certain steps to purposely minimize the car’s capabilities. Broder, in turn, cited the exact same telemetry data to support his original arguments. How could such polar opposite conclusions be supported by the exact same data?

Read Kalev’s post to see how the same data could support both sides of that argument, not to mention many others.

Recitation of that story from memory should be a requirement of every data science related program as a condition of graduation.

Internalizing that story might quiet some of the claims made for “bigdata” and software to process “bigdata.”

It may be the case that actionable insights can be gained from data, big or small, that your company collects.

However, absence an examination of the data and your needs for analysis of that data, the benefits of processing that data remain unknown.

Think of it this way:

Would you order a train full of a new material that has no known relationship to your current products with no idea how it could be used?

Until someone can make a concrete business case for the use of big or small data, that is exactly what an investment in big data processing is to you today.

If after careful analysis, the projected ROI from specific big data processing has been demonstrated to your satisfaction, go for it.

But until then, keep both hands on your wallet when you hear a siren’s song about big data.

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Thursday, January 7th, 2016

PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

Let’s reverse the order of the announcement, to be in reader-friendly order:


Press kit

Release Notes

What’s New in 9.5

Edit: I moved my comments above the fold as it were:

Just so you know, PostgreSQL 9.5 documentation, XMLEXISTS says:

Also note that the SQL standard specifies the xmlexists construct to take an XQuery expression as first argument, but PostgreSQL currently only supports XPath, which is a subset of XQuery.

Apologies, you will have to scroll for the subsection, there was no anchor at

If you are looking to make a major contribution to PostgreSQL, note that XQuery is on the todo list.

Now for all the stuff that you will skip reading anyway. 😉

(I would save the prose for use in reports to management about using or transitioning to PostgreSQL 9.5.)

7 JANUARY 2016: The PostgreSQL Global Development Group announces the release of PostgreSQL 9.5. This release adds UPSERT capability, Row Level Security, and multiple Big Data features, which will broaden the user base for the world’s most advanced database. With these new capabilities, PostgreSQL will be the best choice for even more applications for startups, large corporations, and government agencies.

Annie Prévot, CIO of the CNAF, the French Child Benefits Office, said, “The CNAF is providing services for 11 million persons and distributing 73 billion Euros every year, through 26 types of social benefit schemes. This service is essential to the population and it relies on an information system that must be absolutely efficient and reliable. The CNAF’s information system is satisfyingly based on the PostgreSQL database management system.”


A most-requested feature by application developers for several years, “UPSERT” is shorthand for “INSERT, ON CONFLICT UPDATE”, allowing new and updated rows to be treated the same. UPSERT simplifies web and mobile application development by enabling the database to handle conflicts between concurrent data changes. This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL.

Developed over the last two years by Heroku programmer Peter Geoghegan, PostgreSQL’s implementation of UPSERT is significantly more flexible and powerful than those offered by other relational databases. The new ON CONFLICT clause permits ignoring the new data, or updating different columns or relations in ways which will support complex ETL (Extract, Transform, Load) toolchains for bulk data loading. And, like all of PostgreSQL, it is designed to be absolutely concurrency-safe and to integrate with all other PostgreSQL features, including Logical Replication.

Row Level Security

PostgreSQL continues to expand database security capabilities with its new Row Level Security (RLS) feature. RLS implements true per-row and per-column data access control which integrates with external label-based security stacks such as SE Linux. PostgreSQL is already known as “the most secure by default.” RLS cements its position as the best choice for applications with strong data security requirements, such as compliance with PCI, the European Data Protection Directive, and healthcare data protection standards.

RLS is the culmination of five years of security features added to PostgreSQL, including extensive work by KaiGai Kohei of NEC, Stephen Frost of Crunchy Data, and Dean Rasheed. Through it, database administrators can set security “policies” which filter which rows particular users are allowed to update or view. Data security implemented this way is resistant to SQL injection exploits and other application-level security holes.

Big Data Features

PostgreSQL 9.5 includes multiple new features for bigger databases, and for integrating with other Big Data systems. These features ensure that PostgreSQL continues to have a strong role in the rapidly growing open source Big Data marketplace. Among them are:

BRIN Indexing: This new type of index supports creating tiny, but effective indexes for very large, “naturally ordered” tables. For example, tables containing logging data with billions of rows could be indexed and searched in 5% of the time required by standard BTree indexes.

Faster Sorts: PostgreSQL now sorts text and NUMERIC data faster, using an algorithm called “abbreviated keys”. This makes some queries which need to sort large amounts of data 2X to 12X faster, and can speed up index creation by 20X.

CUBE, ROLLUP and GROUPING SETS: These new standard SQL clauses let users produce reports with multiple levels of summarization in one query instead of requiring several. CUBE will also enable tightly integrating PostgreSQL with more Online Analytic Processing (OLAP) reporting tools such as Tableau.

Foreign Data Wrappers (FDWs): These already allow using PostgreSQL as a query engine for other Big Data systems such as Hadoop and Cassandra. Version 9.5 adds IMPORT FOREIGN SCHEMA and JOIN pushdown making query connections to external databases both easier to set up and more efficient.

TABLESAMPLE: This SQL clause allows grabbing a quick statistical sample of huge tables, without the need for expensive sorting.

“The new BRIN index in PostgreSQL 9.5 is a powerful new feature which enables PostgreSQL to manage and index volumes of data that were impractical or impossible in the past. It allows scalability of data and performance beyond what was considered previously attainable with traditional relational databases and makes PostgreSQL a perfect solution for Big Data analytics,” said Boyan Botev, Lead Database Administrator, Premier, Inc.

HOBBIT – Holistic Benchmarking of Big Linked Data

Saturday, December 26th, 2015

HOBBIT – Holistic Benchmarking of Big Linked Data

From the “about” page:

HOBBIT is driven by the needs of the European industry. Thus, the project objectives were derived from the needs of the European industry (represented by our industrial partners) in combination with the results of prior and ongoing efforts including BIG, BigDataEurope, LDBC Council and many more. The main objectives of HOBBIT are:

  1. Building a family of industry-relevant benchmarks,
  2. Implementing a generic evaluation platform for the Big Linked Data value chain,
  3. Providing periodic benchmarking results including diagnostics to further the improvement of BLD processing tools,
  4. (Co-)Organizing challenges and events to gather benchmarking results as well as industry-relevant KPIs and datasets,
  5. Supporting companies and academics during the creation of new challenges or the evaluation of tools.

As we found in Avoiding Big Data: More Business Intelligence Than You Would Think, 3/4 of businesses cannot extract value from data they already possess, making any investment in “big data” a sure loser for them.

Which makes me wonder about what “big data” the HOBBIT project intends to use for benchmarking “Big Linked Data?”

Then I saw on the homepage:

The HOBBIT partners such as TomTom, USU, AGT and others will provide more than 25 trillions of sensor data to be bechmarked within the HOBBIT project.

“…25 trillions of sensor data….?” sounds odd until you realize that TomTom is:

TomTom founded in 1991 is a world leader of products for in-car location and navigation products.

OK, so the “Big Linked Data” in question isn’t random “linked data,” but a specialized kind of “linked data.”

That’s less risky than building a human brain with no clear idea of where to start, but it addresses a narrow window on linked data.

The HOBBIT Kickoff meeting Luxembourg 18-19 January 2016 announcement still lacks a detailed agenda.

Why Big Data Fails to Detect Terrorists

Thursday, December 17th, 2015

Kirk Borne tweeted a link to his presentation, Big Data Science for Astronomy & Space and more specifically to slides 24 and 25 on novelty detection, surprise discovery.

Casting about for more resources to point out, I found Novelty Detection in Learning Systems by Stephen Marsland.

The abstract for Stephen’s paper:

Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen. It is a useful technique in cases where an important class of data is under-represented in the training set. This means that the performance of the network will be poor for those classes. In some circumstances, such as medical data and fault detection, it is often precisely the class that is under-represented in the data, the disease or potential fault, that the network should detect. In novelty detection systems the network is trained only on the negative examples where that class is not present, and then detects inputs that do not fits into the model that it has acquired, that it, members of the novel class.

This paper reviews the literature on novelty detection in neural networks and other machine learning techniques, as well as providing brief overviews of the related topics of statistical outlier detection and novelty detection in biological organisms.

The rest of the paper is very good and worth your time to read but we need not venture beyond the abstract to demonstrate why big data cannot, by definition, detect terrorists.

The root of the terrorist detection problem summarized in the first sentence:

Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen.

So, what are the inputs of a terrorist that differ from the inputs usually seen?

That’s a simple enough question.

Previously committing a terrorist suicide attack is a definite tell but it isn’t a useful one.

Obviously the TSA doesn’t know because it has never caught a terrorist, despite its profile and wannabe psychics watching travelers.

You can churn big data 24×7 but if you don’t have a baseline of expected inputs, no input is going to stand out from the others.

The San Bernardino were not detected, because the inputs didn’t vary enough for the couple to stand out.

Even if they had been selected for close and unconstitutional monitoring of their etraffic, bank accounts, social media, phone calls, etc., there is no evidence that current data techniques would have detected them.

Before you invest or continue paying for big data to detect terrorists, ask the simple questions:

What is your baseline from which variance will signal a terrorist?

How often has it worked?

Once you have a dead terrorist, you can start from the dead terrorist and search your big data, but that’s an entirely different starting point.

Given the weeks, months and years of finger pointing following a terrorist attack, speed really isn’t an issue.

20 Big Data Repositories You Should Check Out [Data Source Checking?]

Wednesday, December 16th, 2015

20 Big Data Repositories You Should Check Out by Vincent Granville.

Vincent lists some additional sources along with a link to Bernard Marr’s original selection.

One of the issues with such lists is that they are rarely maintained.

For example, Bernard listed:


Free, comprehensive social media data is hard to come by – after all their data is what generates profits for the big players (Facebook, Twitter etc) so they don’t want to give it away. However Topsy provides a searchable database of public tweets going back to 2006 as well as several tools to analyze the conversations.

But if you follow, you will find it points to:

Use Search on your iPhone, iPad, or iPod touch

With iOS 9, Search lets you look for content from the web, your contacts, apps, nearby places, and more. Powered by Siri, Search offers suggestions and updates results as you type.

That sucks doesn’t it? Expecting to be able to search public tweets back to 2006, along with analytical tools and what you get is a kiddie guide to search on a malware honeypot.

For a fuller explanation or at least the latest news on Topsy, check out: Apple shuts down Twitter analytics service Topsy by Sam Byford, dated December 16, 2015 (that’s today as I write this post).

So, strike Topsy off your list of big data sources.

Rather than bare lists, what big data needs is a curated list of big data sources that does more than list sources. Those sources need to be broken down to data sets to enable big data searchers to find all the relevant data sets and retrieve only those that remain accessible.

Like “link checking” but for big data resources. Data Source Checking?

That would be the “go to” place for big data sets and as bad as I hate advertising, a high traffic area for advertising to make it cost effective if not profitable.

If You Can’t See ROI, Don’t Invest

Wednesday, December 9th, 2015

Simple enough: If you can’t identify and quantify an ROI from an investment, don’t invest.

That applies buying raw materials, physical machinery and plant, advertising and….big data processing.

Larisa Bedgood writes in Why 96% of Companies Fail With Marketing Data Insights:

At a time in our history when there is more data than ever before, the overwhelming majority of companies have yet to see the full potential of better data insights. PwC and Iron Mountain recently released a survey on how well companies are gaining value from information. The results showed a huge disconnect in the information that is available to companies and the actual insights being derived from it.

According to survey findings:

  • Only 4% of businesses can extract full value from the information they hold
  • 43% obtain very little benefit from their data
  • 23% derive no benefit whatsoever
  • 22% don’t apply any type of data analytics to the information they have

The potential of utilizing data can equate intro very big wins and even greater revenue. Take a look at this statistic based on research by McKinsey:

Unlike most big data vendor literature, Larisa captures the #1 thing you should do before investing in big or small data management:

1. Establish an ROI


Establishing a strong return on investment (ROI) will help get new data projects off the ground. Begin by documenting any problems caused by incorrect data, including missed opportunities, and wasted marketing spend. This doesn’t have to be a time intensive project, but gather as much supporting documentation as possible to justify the investment. (emphasis added)

An added advantage of establishing an ROI prior to investment is you will have the basis for judging the success of a data management project. Did the additional capabilities of data analysis/management in fact lead to the expected ROI?

To put it another way, a big data project may be “successful” in the sense that it was completed on time, on budget and it performs exactly as specified, but if it isn’t meeting your ROI projections, the project overall is a failure.

From a profit making business perspective, there is no other measure of success or failure than meeting or failing to meet an expected ROI goal.

Everyone else may be using X or Y technology, but if there is no ROI for you, why bother?

You can see my take on the PwC and Iron Mountain at: Avoiding Big Data: More Business Intelligence Than You Would Think.

Racist algorithms: how Big Data makes bias seem objective

Sunday, December 6th, 2015

Racist algorithms: how Big Data makes bias seem objective by Cory Doctorow.

From the post:

The Ford Foundation’s Michael Brennan discusses the many studies showing how algorithms can magnify bias — like the prevalence of police background check ads shown against searches for black names.

What’s worse is the way that machine learning magnifies these problems. If an employer only hires young applicants, a machine learning algorithm will learn to screen out all older applicants without anyone having to tell it to do so.

Worst of all is that the use of algorithms to accomplish this discrimination provides a veneer of objective respectability to racism, sexism and other forms of discrimination.

Cory has a good example of “hidden” bias in data analysis and has suggestions for possible improvement.

Although I applaud the notion of “algorithmic transparency,” the issue of bias in algorithms may be more subtle than you think.

Lauren J. Young reports in Computer Scientists Find Bias in Algorithms that the bias problem can be especially acute with self-improving algorithms. Algorithms, like users have experiences and those experiences can lead to bias.

Lauren’s article is a good introduction to the concept of bias in algorithms, but for the full monty, see: Certifying and removing disparate impact by Michael Feldman, et al.


What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process.

When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses.

We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.

Bear in mind that disparate impact is only one form of bias for a selected set of categories. And that bias can be introduced prior to formal data analysis.

Rather than say data or algorithms can be made unbiased, say rather that known biases can be reduced to acceptable levels, for some definition of acceptable.

Big Data Ethics?

Saturday, December 5th, 2015

Ethics are a popular topic in big data and related areas, as I was reminded by Sam Ransbotham’s The Ethics of Wielding an Analytical Hammer.

Here’s a big data ethics problem.

In order to select individuals based on some set of characteristics, habits, etc., we first must define the selection criteria.

Unfortunately, we don’t have a viable profile for terrorists, which explain in part why they can travel under their actual names, with their own identification and not be stopped by the authorities.

So, here’s the ethical question: Is it ethical for contractors and data scientists to offer data mining services to detect terrorists when there is no viable profile for a terrorist?

For all the hand wringing about ethics, basic honesty seems to be in short supply when talking about big data and the search for terrorists.


4 Tips to Learn More About ACS Data [$400 Billion Market, 3X Big Data]

Saturday, November 14th, 2015

4 Tips to Learn More About ACS Data by Ari Lamstein.

From the post:

One of the highlights of my recent east coast trip was meeting Ezra Haber Glenn, the author of the acs package in R. The acs package is my primary tool for accessing census data in R, and I was grateful to spend time with its author. My goal was to learn how to “take the next step” in working with the census bureau’s American Community Survey (ACS) dataset. I learned quite a bit during our meeting, and I hope to share what I learned over the coming weeks on my blog.

Today I’ll share 4 tips to help you get started in learning more. Before doing that, though, here is some interesting trivia: did you know that the ACS impacts how over $400 billion is allocated each year?

If the $400 billion got your attention, follow the tips in Ari’s post first, look for more posts in that series second, then visit the American Community Survey (ACS) website.

For comparison purposes, keep in mind that Forbes projects the Big Data Analytics market in 2015 to be a paltry $125 Billion.

The ACS data market is over 3 times larger ($400 Billion (ACS) versus $125 Billion (BigData) for 2015.

Suddenly, ACS data and R look quite attractive.

NSF: BD Spokes (pronounced “hoax”) initiative

Thursday, November 5th, 2015

Big Announcements in Big Data by Tom Kalil, Jim Kurose, and Fen Zhao.

From the webpage:

As a part of the Administration’s Big Data Research and Development Initiative and to accelerate the emerging field of data science, NSF announced four awards this week, totaling more than $5 million, to establish four Big Data Regional Innovation Hubs (BD Hubs) across the nation.

Covering all 50 states and including commitments from more than 250 organizations—from universities and cities to foundations and Fortune 500 corporations—the BD Hubs constitute a “big data brain trust” that will conceive, plan, and support big data partnerships and activities to address regional and national challenges.

The “BD Hubs” are: Georgia Institute of Technology, University of North Carolina, Columbia University, University of Illinois at Urbana-Champaign, University of California, San Diego, University of California, Berkeley, and University of Washington.

Let’s see, out of $5 million, that is almost $715,000 for each “BD Hub.” Given administrative overhead, I don’t think you are going to see much:

…improve[ment] our ability to extract knowledge and insights from large and complex collections of data, but also help accelerate the pace of discovery in science and engineering, strengthen our national security, and fuel the growth and development of Smart Cities in America

Perhaps from the BD Spokes (pronounced “hoax”) initiative which covers particular subject areas for each region?

If you can stomach reading Big Data Regional Innovation Hubs: Establishing Spokes to Advance Big Data Applications (BD Spokes), you will discover that the funding for the “spokes” consists of 9 grants of up to $1,000,000.00 (over 3 years) and 10 planning grants of up to $100,000 (for one year).

Total price tag: $10 million.

BTW, the funding summary includes this helpful note:

All proposals to this solicitation must include a letter of collaboration from a BD Hub coordinating institution. Any proposals not including a letter of collaboration from a BD Hub coordinating institution will be returned without review. No exceptions will be made. (emphasis in original)

Would you care to wager on the odds that “a letter of collaboration from a BD Hub coordinating institution” isn’t going to be free?

For comparison purposes and to explain why I suggest you pronounce “Spokes” as “hoax,” consider that in 2014, Google spent $11 billion, Microsoft $5.3 billion, Amazon $4.9 billion and Facebook $1.8 billion, on data center construction.

If the top four big data players are spending billions (that’s with a “b”) on data center construction alone, how does a paltry $15 million (hoax plus the centers):

…improve our ability to extract knowledge and insights from large and complex collections of data, but also help accelerate the pace of discovery in science and engineering, strengthen our national security, and fuel the growth and development of Smart Cities in America


Reminds me of the EC [WPA] Brain Project. The report for year two is summarized:

As the second year of its Ramp-Up Phase draws to a close, the HBP is well-placed to continue its investigations into neuroscience, medicine, and computing. With the Framework Partnership Agreement in place, and preparations underway for the first Specific Grant Agreement, the coordination of the Project is shifting into a higher gear.

Two years into a ten year project and “coordination of the Project is shifting into a higher gear.” (no comment seems snide enough)

My counter-proposal would be that the government buy $10 million (or more) worth of time on Azure/AWS and hold an open lottery for $100,000 increments, with the only requirement that all code and data be under an Apache license and accessible to all on the respective cloud service.

That would buy far more progress on big data issues than the BD Spokes (pronounced “hoax”) initiative.

Avoiding Big Data: More Business Intelligence Than You Would Think

Monday, October 26th, 2015

Observing that boosters of “big data” are in a near panic about the slow adoption of “big data” technologies requires no reference.

A recent report from Iron Mountain and PWC may shed some light on the reasons for slow adoption of “big data:”


If you are in the 66% that extracts little or no value from your data, it makes no business sense buy into “big data” when you can’t derive value data already.

Does anyone seriously disagree with that statement? Other than people marketing services whether the client benefits or not.

The numbers get even worse:

From the executive summary:

We have identified a large ‘misguided majority’ – three in four businesses (76%) that are either constrained by legacy, culture, regulatory data issues or simply lack any understanding of the potential value held by their information. They have little comprehension of the commercial benefits to be gained and have therefore not made the investment required to obtain the information advantage.

Now we are up to 3/4 of the market that could not benefit from “big data” tools if they dropped from the sky tomorrow.

To entice you to download Seizing the Information Advantage (the full report):

Typical attributes and behaviours of the mis-guided majority

  • Information and exploitation of value from information is not a priority for senior leadership
  • An information governance oversight body, if it exists, is dominated by IT
  • Limited appreciation of how to exploit their information or the business benefits of doing so
  • Progress is allowed to be held back by legacy issues, regulatory issues and resources
  • Where resources are deployed to exploit information, this is often IT led, and is not linked to the overall business strategy
  • Limited ability to identify, manage and merge large amounts of data sources
  • Analytical capability may exist in the business but is not focused on information value
  • Excessive use of Excel spreadsheets with limited capacity to extract insight

Hmmm, 8 attributes and behaviours of the mis-guided majority (76%) and how many of those issues are addressed by big data technology?

Err, one. Yes?

Limited ability to identify, manage and merge large amounts of data sources

The other seven (7) attributes or behaviours that impede business from deriving value from data have little or no connection to big data technology.

Those are management, resources and social issues that no big data technology can address.

Avoidance of adoption of big data technology reveals a surprising degree of “business intelligence” among those surveyed.

A number of big data technologies will be vital to business growth, but if and only if the management and human issues are addressed that will enable their effective use.

Put differently, investment in big data technologies without addressing related management and human issues is a waste of resources. (full stop)

The report wasn’t all that easy to track down on the Iron Mountain site so here are some useful links:

Executive Summary

Seizing the Information Advantage (“free” but you have to give up your contact information)

Inforgraphic Summary

I first saw this at: 96% of Businesses Fail to Unlock Data’s Full Value by Bob Violino. Bob did not include a link to the report or sufficient detail to be useful.

46-billion-pixel Photo is Largest Astronomical Image of All Time

Monday, October 26th, 2015

46-billion-pixel Photo is Largest Astronomical Image of All Time by Suzanne Tracy.

From the post:

With 46 billion pixels, a 194 gigabyte file size and numerous stars, a massive new Milky Way photo has been assembled from astronomical observation data gathered over a five-year period.

Astronomers headed by Prof. Dr. Rolf Chini have been monitoring our Galaxy in a search for objects with variable brightness. The researchers explain that these phenomena may, for example, include stars in front of which a planet is passing, or may include multiple systems where stars orbit each other and where the systems may obscure each other at times. The researchers are analyzing how the brightness of stars changes over long stretches of time.

Now, using an online tool, any interested person can

  • view the complete ribbon of the Milky Way at a glance
  • zoom in and inspect specific areas
  • use an input window, which provides the position of the displayed image section, to search for specific objects. (i.e. if the user types in “Eta Carinae,” the tool moves to the respective star; entering the search term “M8” leads to the lagoon nebula.)

You can view the entire Milky Way photo at and read more on the search for variable objects at

Great project and a fun read for anyone interested in astronomy!

For big data types, confirmation that astronomy remains in the lead with regard to making big data and the power to process that big data freely available to all comers.

I first saw this in a tweet by Kirk Borne.

“Big data are not about data,” Djorgovski says. “It’s all about discovery.” [Not re-discovery]

Thursday, October 8th, 2015

I first saw this quote in a tweet by Kirk Borne. It is the concluding line from George Djorgovski looks for knowledge hidden in data by Rebecca Fairley Raney.

From the post:

When you sit down to talk with an astronomer, you might expect to learn about galaxies, gravity, quasars or spectroscopy. George Djorgovski could certainly talk about all those topics.

But Djorgovski, a professor of astronomy at the California Institute of Technology, would prefer to talk about data.

The AAAS Fellow has spent more than three decades watching scientists struggle to find needles in massive digital haystacks. Now, he is director of the Center for Data-Driven Discovery at Caltech, where staff scientists are developing advanced data analysis techniques and applying them to fields as disparate as plant biology, disaster response, genetics and neurobiology.

The descriptions of the projects at the center are filled with esoteric phrases like “hyper-dimensional data spaces” and “datascape geometry.”

Astronomy was “always advanced as a digital field,” Djorgovski says, and in recent decades, important discoveries in the field have been driven by novel uses of data.

Take the discovery of quasars.

In the early 20th century, astronomers using radio telescopes thought quasars were stars. But by merging data from different types of observations, they discovered that quasars were rare objects that are powered by gas that spirals into black holes in the center of galaxies.

Quasars were discovered not by a single observation, but by a fusion of data.

It is assumed by Djorgovski and his readers that future researchers won’t have to start from scratch when researching quasars. They can but don’t have to re-mine all the data that supported their original discovery or their association with black holes.

Can you say the same for discoveries you make in your data? Are those discoveries preserved for others or just tossed back into the sea of big data?

Contemporary searching is a form of catch-n-release. You start with your question and whether it takes a few minutes or an hour, you find something resembling an answer to your question.

The data is then tossed back to await the next searcher who has the same or similar question.

How are you capturing your search results to benefit the next searcher?

Are You Deep Mining Shallow Data?

Monday, September 21st, 2015

Do you remember this verse of Simple Simon?

Simple Simon went a-fishing,

For to catch a whale;

All the water he had got,

Was in his mother’s pail.


Shallow data?

To illustrate, fill in the following statement:

My mom makes the best _____.

Before completing that statement, you resolved the common noun, “mom,” differently that I did.

The string carries no clue as to the resolution of “mom” by any reader.

The string also gives no clues as to how it would be written in another language.

With a string, all you get is the string, or in other words:

All strings are shallow.

That applies to the strings we use to add depth to strings but we will reach that issue shortly.

One of the few things that RDF got right was:

…RDF puts the information in a formal way that a machine can understand. The purpose of RDF is to provide an encoding and interpretation mechanism so that resources can be described in a way that particular software can understand it; in other words, so that software can access and use information that it otherwise couldn’t use. (quote from Wikipedia on RDF)

In addition to the string, RDF posits an identifier in the form of a URI which you can follow to discover more information about that portion of string.

Unfortunately RDF was burdened by the need for all new identifiers to replace those already in place, an inability to easily distinguish identifier URIs from URIs that lead to subjects of conversation, and encoding requirements that reduced the population of potential RDF authors to a righteous remnant.

Despite its limitations and architectural flaws, RDF is evidence that strings are indeed shallow. Not to mention that if we could give strings depth, their usefulness would be greatly increased.

One method for imputing more depth to strings is natural language processing (NLP). Modern NLP techniques are based on statistical analysis of large data sets and are the most accurate for very common cases. The statistical nature of NLP makes application of those techniques to very small amounts of text or ones with unusual styles of usage problematic.

The limits of statistical techniques isn’t a criticism of NLP but rather an observation that depending on the level of accuracy desired and your data, such techniques may or may not be useful.

What is acceptable for imputing depth to strings in movie reviews is unlikely to be thought so when deciphering a manual for disassembling an atomic weapon. The question isn’t whether NLP can impute depth to strings but whether that imputation is sufficiently accurate for your use case.

Of course, RDF and NLP aren’t the only two means for imputing depth to strings.

We will take up another method for giving strings depth tomorrow.

Value of Big Data Depends on Identities in Big Data

Tuesday, September 15th, 2015

Intel Exec: Extracting Value From Big Data Remains Elusive by George Leopold.

From the post:

Intel Corp. is convinced it can sell a lot of server and storage silicon as big data takes off in the datacenter. Still, the chipmaker finds that major barriers to big data adoption remain, most especially what to do with all those zettabytes of data.

“The dirty little secret about big data is no one actually knows what to do with it,” Jason Waxman, general manager of Intel’s Cloud Platforms Group, asserted during a recent company datacenter event. Early adopters “think they know what to do with it, and they know they have to collect it because you have to have a big data strategy, of course. But when it comes to actually deriving the insight, it’s a little harder to go do.”

Put another way, industry analysts rate the difficulty of determining the value of big data as far outweighing considerations like technological complexity, integration, scaling and other infrastructure issues. Nearly two-thirds of respondents to a Gartner survey last year cited by Intel stressed they are still struggling to determine the value of big data.

“Increased investment has not led to an associated increase in organizations reporting deployed big data projects,” Gartner noted in its September 2014 big data survey. “Much of the work today revolves around strategy development and the creation of pilots and experimental projects.”


It may just be me, but “determing value,” “risk and governance,” and “integrating multiple data sources,” the top three barriers to use of big data, all depend on knowing the identities represented in big data.

The trivial data integration demos that share “customer-ID” fields, don’t inspire a lot of confidence about data integration when “customer-ID” maybe identified in as many ways as there are data sources. And that is a minor example.

It would be very hard to determine the value you can extract from data when you don’t know what the data represents, its accuracy (risk and governance), and what may be necessary to integrate it with other data sources.

More processing power from Intel is always welcome but churning poorly understood big data faster isn’t going to create value. Quite the contrary, investment in more powerful hardware isn’t going to be favorably reflected on the bottom line.

Investment in capturing the diverse identities in big data will empower easier valuation of big data, evaluation of its risks and uncovering how to integrate diverse data sources.

Capturing diverse identities won’t be easy, cheap or quick. But not capturing them will leave the value of Big Data unknown, its risks uncertain and integration a crap shoot when it is ever attempted.

Your call.

Posts from 140 #DataScience Blogs

Sunday, September 13th, 2015

Kirk Borne posted a link to:, referring to it as:

Recent posts from 150+ #DataScience Blogs worldwide, curated by @dsguidebiz #BigData #Analytics

By count of the sources listed on, the number of sources is 140, as of September 13, 2015.

A wealth of posts and videos!

Everyone who takes advantage of this listing, however, will have to go through the same lists of posts by category.

That repetition, even with searching, seems like a giant time sink to me.