Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 12, 2014

Open Sourcing Cubert: A High Performance Computation Engine for Complex Big Data Analytics

Filed under: Analytics,BigData,Cubert — Patrick Durusau @ 4:01 pm

Open Sourcing Cubert: A High Performance Computation Engine for Complex Big Data Analytics by Maneesh Varshney and Srinivas Vemuri.

From the post:

Cubert was built with the primary focus on better algorithms that can maximize map-side aggregations, minimize intermediate data, partition work in balanced chunks based on cost-functions, and ensure that the operators scan data that is resident in memory. Cubert has introduced a new paradigm of computation that:

  • organizes data in a format that is ideally suited for scalable execution of subsequent query processing operators
  • provides a suite of specialized operators (such as MeshJoin, Cube, Pivot) using algorithms that exploit the organization to provide significantly improved CPU and resource utilization

Cubert was shown to outperform other engines by a factor of 5-60X even when the data set sizes extend into 10s of TB and cannot fit into main memory.

The Cubert operators and algorithms were developed to specifically address real-life big data analytics needs:

  • Complex Joins and aggregations frequently arise in the context of analytics on various user level metrics which are gathered on a daily basis from a user facing website. Cubert provides the unique MeshJoin algorithm that can process data sets running into terabytes over large time windows.
  • Reporting workflows are distinct from ad-hoc queries by virtue of the fact that the computation pattern is regular and repetitive, allowing for efficiency gains from partial result caching and incremental processing, a feature exploited by the Cubert runtime for significantly improved efficiency and resource footprint.
  • Cubert provides the new power-horse CUBE operator that can efficiently (CPU and memory) compute additive, non-additive (e.g. Count Distinct) and exact percentile rank (e.g. Median) statistics; can roll up inner dimensions on-the-fly and compute multiple measures within a single job.
  • Cubert provides novel algorithms for graph traversal and aggregations for large-scale graph analytics.

Finally, Cubert Script is a developer-friendly language that takes out the hints, guesswork and surprises when running the script. The script provides the developers complete control over the execution plan (without resorting to low-level programming!), and is extremely extensible by adding new functions, aggregators and even operators.

and the source/documentation:

Cubert source code and documentation

The source code is open sourced under Apache v2 License and is available at https://github.com/linkedin/Cubert

The documentation, user guide and javadoc are available at http://linkedin.github.io/Cubert

The abstractions for data organization and calculations were present in the following paper:
Execution Primitives for Scalable Joins and Aggregations in Map Reduce”, Srinivas Vemuri, Maneesh Varshney, Krishna Puttaswamy, Rui Liu. 40th International Conference on Very Large Data Bases (VLDB), Hangzhou, China, Sept 2014. (PDF)

Another advance in the processing of big data!

Now if we could just see a similar advance in the identification of entities/subjects/concepts/relationships in big data.

Nothing wrong with faster processing but a PB of poorly understood data is a PB of poorly understood data no matter how fast you process it.

November 8, 2014

Big Data Driving Data Integration at the NIH

Filed under: BigData,Data Integration,Funding,NIH — Patrick Durusau @ 5:17 pm

Big Data Driving Data Integration at the NIH by David Linthicum.

From the post:

The National Institutes of Health announced new grants to develop big data technologies and strategies.

“The NIH multi-institute awards constitute an initial investment of nearly $32 million in fiscal year 2014 by NIH’s Big Data to Knowledge (BD2K) initiative and will support development of new software, tools and training to improve access to these data and the ability to make new discoveries using them, NIH said in its announcement of the funding.”

The grants will address issues around Big Data adoption, including:

  • Locating data and the appropriate software tools to access and analyze the information.
  • Lack of data standards, or low adoption of standards across the research community.
  • Insufficient polices to facilitate data sharing while protecting privacy.
  • Unwillingness to collaborate that limits the data’s usefulness in the research community.

Among the tasks funded is the creation of a “Perturbation Data Coordination and Integration Center.” The center will provide support for data science research that focuses on interpreting and integrating data from different data types and databases. In other words, it will make sure the data moves to where it should move, in order to provide access to information that’s needed by the research scientist. Fundamentally, it’s data integration practices and technologies.

This is very interesting from the standpoint that the movement into big data systems often drives the reevaluation, or even new interest in data integration. As the data becomes strategically important, the need to provide core integration services becomes even more important.

The NIH announcement. NIH invests almost $32 million to increase utility of biomedical research data, reads in part:

Wide-ranging National Institutes of Health grants announced today will develop new strategies to analyze and leverage the explosion of increasingly complex biomedical data sets, often referred to as Big Data. These NIH multi-institute awards constitute an initial investment of nearly $32 million in fiscal year 2014 by NIH’s Big Data to Knowledge (BD2K) initiative, which is projected to have a total investment of nearly $656 million through 2020, pending available funds.

With the advent of transformative technologies for biomedical research, such as DNA sequencing and imaging, biomedical data generation is exceeding researchers’ ability to capitalize on the data. The BD2K awards will support the development of new approaches, software, tools, and training programs to improve access to these data and the ability to make new discoveries using them. Investigators hope to explore novel analytics to mine large amounts of data, while protecting privacy, for eventual application to improving human health. Examples include an improved ability to predict who is at increased risk for breast cancer, heart attack and other diseases and condition, and better ways to treat and prevent them.

And of particular interest:

BD2K Data Discovery Index Coordination Consortium (DDICC). This program will create a consortium to begin a community-based development of a biomedical data discovery index that will enable discovery, access and citation of biomedical research data sets.

Big data driving data integration. Who knew? 😉

The more big data the greater the pressure for robust data integration.

Sounds like they are playing the topic maps tune.

Terms of Service

Filed under: BigData,Cybersecurity,Privacy,Security,WWW — Patrick Durusau @ 11:53 am

Terms of Service: understanding our role in the world of Big Data by Michael Keller and Josh Neufeld.

Caution: Readers of Terms of Service will discover they are products and only incidentally consumers of digital services. Surprise, dismay, depression, and despair are common symptoms post-reading. You have been warned.

Al Jazeera uses a comic book format to effectively communicate privacy issues raised by Big Data, the Internet of Things, the Internet, and “free” services.

The story begins with privacy concerns over scanning of Gmail content (remember that?) and takes the reader up to present and likely future privacy concerns.

I quibble with the example of someone being denied a loan because they failed to exercise regularly. The authors innocently assume that banks make loans with the intention of being repaid. That’s the story in high school economics but a long way from how lending works in practice.

The recent mortgage crisis in the United States was caused by banks inducing borrowers to over state their incomes, financing a home loan and its down payment, etc. Banks don’t keep such loans but package them as securities which they then foist off onto others. Construction companies make money building the houses, local government gain tax revenue, etc. Basically a form of churn.

But the authors are right that in some theoretical economy loans could be denied because of failure to exercise. Except that would exclude such a large market segment in the United States. Did you know they are about to change the words “…the land of the free…” to “…the land of the obese…?”

That is a minor quibble about what is overall a great piece of work. In only forty-six (46) pages it brings privacy issues into a sharper focus than many longer and more turgid works.

Do you know of any comparable exposition on privacy and Big Data/Internet?

Suggest it for conference swag/holiday present. Write to Terms-of-Service.

I first saw this in a tweet by Gregory Piatetsky.

November 4, 2014

What Is Big Data?

Filed under: BigData — Patrick Durusau @ 8:10 pm

What Is Big Data? by Jenna Dutcher.

From the post:

“Big data.” It seems like the phrase is everywhere. The term was added to the Oxford English Dictionary in 2013 and appeared in Merriam-Webster’s Collegiate Dictionary in 2014. Now, Gartner’s just-released 2014 Hype Cycle shows “big data” passing the “peak of inflated expectations” and moving on its way down into the “trough of disillusionment.” Big data is all the rage. But what does it actually mean?

A commonly repeated definition cites the three Vs: volume, velocity, and variety. But others argue that it’s not the size of data that counts, but the tools being used or the insights that can be drawn from a dataset.

Jenna collected forty (40) different responses to the question: “What is Big Data?”

If you don’t see one you agree with at Jenna’s post, feel free to craft your own as a comment to this post.

If a large number of people mean almost but not quite the same thing by “big data,” does that give you a clue as to a persistent problem in IT? And in relationships between IT and other departments?

I first saw this in a tweet by Lutz Maicher.

Tessera

Filed under: BigData,Hadoop,R,RHIPE,Tessera — Patrick Durusau @ 7:20 pm

Tessera

From the webpage:

The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end with simple R commands.

Divide and Recombine (D&R)

Tessera is powered by Divide and Recombine. In D&R, we seek meaningful ways to divide the data into subsets, apply statistical methods to each subset independently, and recombine the results of those computations in a statistically valid way. This enables us to use the existing vast library of methods available in R – no need to write scalable versions

DATADR

The datadr R package provides a simple interface to D&R operations. The interface is back end agnostic, so that as new distributed computing technology comes along, datadr will be able to harness it. Datadr currently supports in-memory, local disk / multicore, and Hadoop back ends, with experimental support for Apache Spark. Regardless of the back end, coding is done entirely in R and data is represented as R objects.

TRELLISCOPE

Trelliscope is a D&R visualization tool based on Trellis Display that enables scalable, flexible, detailed visualization of data. Trellis Display has repeatedly proven itself as an effective approach to visualizing complex data. Trelliscope, backed by datadr, scales Trellis Display, allowing the analyst to break potentially very large data sets into many subsets, apply a visualization method to each subset, and then interactively sample, sort, and filter the panels of the display on various quantities of interest.
trelliscope

RHIPE

RHIPE is the R and Hadoop Integrated Programming Environment. RHIPE allows an analyst to run Hadoop MapReduce jobs wholly from within R. RHIPE is used by datadr when the back end for datadr is Hadoop. You can also perform D&R operations directly through RHIPE , although in this case you are programming at a lower level.

Quite an impressive package for R and “big data.”

I first saw this in a tweet by Christophe Lalanne.

November 3, 2014

Using Apache Spark and Neo4j for Big Data Graph Analytics

Filed under: BigData,Graphs,Hadoop,HDFS,Spark — Patrick Durusau @ 8:29 pm

Using Apache Spark and Neo4j for Big Data Graph Analytics by Kenny Bastani.

From the post:


Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.

Still, where does all that data come from? Where does it go when the analysis is done?

Graph databases

I’ve been working with graph database technologies for the last few years and I have yet to become jaded by its powerful ability to combine both the transformation of data with analysis. Graph databases like Neo4j are solving problems that relational databases cannot.

Graph processing at scale from a graph database like Neo4j is a tremendously valuable power.

But if you wanted to run PageRank on a dump of Wikipedia articles in less than 2 hours on a laptop, you’d be hard pressed to be successful. More so, what if you wanted the power of a high-performance transactional database that seamlessly handled graph analysis at this scale?

Mazerunner for Neo4j

Mazerunner is a Neo4j unmanaged extension and distributed graph processing platform that extends Neo4j to do big data graph processing jobs while persisting the results back to Neo4j.

Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark’s GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.

Mazerunner is an alpha release with page rank as its only algorithm.

It has a great deal of potential so worth your time to investigate further.

October 30, 2014

Pinned Tabs: myNoSQL

Filed under: BigData,NoSQL — Patrick Durusau @ 1:53 pm

Alex Popescu & Ana-Maria Bacalu have added a new feature at myNoSQL called “Pinned Tabs.”

The feature started on 28 Oct. 2014 and consists of very short (2-3 sentence descriptions) with links on NoSQL, BigData, etc. topics.

Today’s “pinned tabs” included:

03: If you don’t test for the possible failures, you might be in for a surprise. Stripe has tried a more organized chaos monkey attack and discovered a scenario in which their Redis cluster is losing all the data. They’ll move to Amazon RDS PostgreSQL. From an in-memory smart key-value engine to a relational database.

Game Day Exercises at Stripe: Learning from kill -9

04: How a distributed database should really behave in front of massive failures. Netflix recounts their recent experience of having 218 Cassandra nodes rebooted without losing availability. At all.

How Netflix Handled the Reboot of 218 Cassandra Nodes

Curated news saves time and attention span!

Enjoy!

October 29, 2014

AsterixDB: Better than Hadoop? Interview with Mike Carey

Filed under: AsterixDB,BigData,Hadoop — Patrick Durusau @ 3:24 pm

AsterixDB: Better than Hadoop? Interview with Mike Carey by Roberto V. Zicari.

The first two questions should be enough incentive to read the full interview and get your blood pumping in the middle of the week:

Q1. Why build a new Big Data Management System?

Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.

To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:

  • a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
  • a full query language with at least the expressive power of SQL;
  • support for data storage, data management, and automatic indexing;
  • support for a wide range of query sizes, with query processing cost being proportional to the given query;
  • support for continuous data ingestion, hence the accumulation of Big Data;
  • the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
  • built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.

So that’s what we set out to do.

Q2. What was wrong with the current Open Source Big Data Stack?

Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”

We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.

One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.

Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”

Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)

I knew words would fail me if I tried to describe the AsterixDB logo so I simply reproduce the logo:

asterickdb logo

Read the interview in full and then grab a copy of AsterixDB.

The latest beta release is 0.8.6. The software appears under the Apache Software 2.0 license.

October 27, 2014

Think Big Challenge 2014 [Census Data – Anonymized]

Filed under: BigData,Census Data — Patrick Durusau @ 2:02 pm

Think Big Challenge 2014 [Census Data – Anonymized]

The Think Big Challenge 2014 closed October 19, 2014, but the data sets for that challenge remain available.

From the data download page:

This subdirectory contains a small extract of the data set (1,000 records). There are two data sets provided:

A complete set of records from after the year 1820 is available for download from Amazon S3 at The full data set is available for download from Amazon S3 at https://s3.amazonaws.com/think.big.challenge/AncestryPost1820Data.gz as a 127MB gzip file.

A sample of records pre-1820 for use in the data science “Learning of Common Ancestors” challenge. This can be downloaded at https://s3.amazonaws.com/think.big.challenge/AncestryPre1820Sample.gz as a 4MB gzip file.

The records have been pre-processed:

The contest data set includes both publicly availabl[e] records (e.g., census data) and user-contributed submissions on Ancestry.com. To preserve user privacy, all surnames present in the data have been obscured with a hash function. The hash is constructed such that all occurrences of the same string will result in the same hash code.

Reader exercise: You can find multiple ancestors of yours in these records with different surnames and compare those against the hash function results. How many you will need to reverse the hash function and recover all the surnames? Use other ancestors of yours to check your results.

Take a look at the original contest tasks for inspiration. What other online records would you want to merge with these? Thinking local newspapers? What about law reporters?

Enjoy!

I first saw this mentioned on Danny Bickson’s blog as: Interesting dataset from Ancestry.com.


Update: I meant to mention Risks of Not Understanding a One-Way Function by Bruce Schneier, to get you started on the deanonymization task. Apologies for the omission.

If you are interested in cryptography issues, following Bruce Schneier’s blog should be on your regular reading list.

October 23, 2014

Avoiding “Hive” Confusion

Filed under: BigData,Bioinformatics,Genomics,Hive — Patrick Durusau @ 6:46 pm

Depending on your community, when you hear “Hive,” you think “Apache Hive:”

The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

But, there is another “Hive,” which handles large datasets:

High-performance Integrated Virtual Environment (HIVE) is a specialized platform being developed/implemented by Dr. Simonyan’s group at FDA and Dr. Mazumder’s group at GWU where the storage library and computational powerhouse are linked seamlessly. This environment provides web access for authorized users to deposit, retrieve, annotate and compute on HTS data and analyze the outcomes using web-interface visual environments appropriately built in collaboration with research scientists and regulatory personnel.

I ran across this potential source of confusion earlier today and haven’t run it completely to ground but wanted to share some of what I have found so far.

Inside the HIVE, the FDA’s Multi-Omics Compute Architecture by Aaron Krol.

From the post:

“HIVE is not just a conventional virtual cloud environment,” says Simonyan. “It’s a different system that virtualizes the services.” Most cloud systems store data on multiple servers or compute units until users want to run a specific application. At that point, the relevant data is moved to a server that acts as a node for that computation. By contrast, HIVE recognizes which storage nodes contain data selected for analysis, then transfers executable code to those nodes, a relatively small task that allows computation to be performed wherever the data is stored. “We make the computations on exactly the machines where the data is,” says Simonyan. “So we’re not moving the data to the computational unit, we are moving computation to the data.”

When working with very large packets of data, cloud computing environments can sometimes spend more time on data transfer than on running code, making this “virtualized services” model much more efficient. To function, however, it relies on granular and readily-accessed metadata, so that searching for and collecting together relevant data doesn’t consume large quantities of compute time.

HIVE’s solution is the honeycomb data model, which stores raw NGS data and metadata together on the same network. The metadata — information like the sample, experiment, and run conditions that produced a set of NGS reads — is stored in its own tables that can be extended with as many values as users need to record. “The honeycomb data model allows you to put the entire database schema, regardless of how complex it is, into a single table,” says Simonyan. The metadata can then be searched through an object-oriented API that treats all data, regardless of type, the same way when executing search queries. The aim of the honeycomb model is to make it easy for users to add new data types and metadata fields, without compromising search and retrieval.

Popular consumption piece so next you may want to visit the HIVE site proper.

From the webpage:

HIVE is a cloud-based environment optimized for the storage and analysis of extra-large data, like Next Generation Sequencing data, Mass Spectroscopy files, Confocal Microscopy Images and others.

HIVE uses a variety of advanced scientific and computational visualization graphics, to get the MOST from your HIVE experience you must use a supported browser. These include Internet Explore 8.0 or higher (Internet Explorer 9.0 is recommended), Google Chrome, Mozilla Firefox and Safari.

A few exemplary analytical outputs are displayed below for your enjoyment. But before you can take advantage of all that HIVE has to offer and create these objects for yourself, you’ll need to register.

With A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE) by Tsung-Jung Wu, et al., you are starting to approach the computational issues of interest for data integration.

From the article:

The forementioned cooperation is difficult because genomics data are large, varied, heterogeneous and widely distributed. Extracting and converting these data into relevant information and comparing results across studies have become an impediment for personalized genomics (11). Additionally, because of the various computational bottlenecks associated with the size and complexity of NGS data, there is an urgent need in the industry for methods to store, analyze, compute and curate genomics data. There is also a need to integrate analysis results from large projects and individual publications with small-scale studies, so that one can compare and contrast results from various studies to evaluate claims about biomarkers.

See also: High-performance Integrated Virtual Environment (Wikipedia) for more leads to the literature.

Heterogeneous data is still at large and people are building solutions. Rather than either/or, what do you think topic maps could bring as a value-add to this project?

I first saw this in a tweet by ChemConnector.

October 21, 2014

Big Data: 20 Free Big Data Sources Everyone Should Know

Filed under: BigData — Patrick Durusau @ 10:07 am

Big Data: 20 Free Big Data Sources Everyone Should Know Bernard Marr.

From the post:

I always make the point that data is everywhere – and that a lot of it is free. Companies don’t necessarily have to build their own massive data repositories before starting with big data analytics. The moves by companies and governments to put large amounts of information into the public domain have made large volumes of data accessible to everyone.

Any company, from big blue chip corporations to the tiniest start-up can now leverage more data than ever before. Many of my clients ask me for the top data sources they could use in their big data endeavour and here’s my rundown of some of the best free big data sources available today.

I didn’t see anything startling but it is a good top 20 list for a starting point. Would make a great start on a one to two page big data cheat sheet. Will have to give some thought to that idea.

October 13, 2014

The Dirty Little Secret of Cancer Research

Filed under: BigData,Bioinformatics,Genomics,Medical Informatics — Patrick Durusau @ 8:06 pm

The Dirty Little Secret of Cancer Research by Jill Neimark.

From the post:


Across different fields of cancer research, up to a third of all cell lines have been identified as imposters. Yet this fact is widely ignored, and the lines continue to be used under their false identities. As recently as 2013, one of Ain’s contaminated lines was used in a paper on thyroid cancer published in the journal Oncogene.

“There are about 10,000 citations every year on false lines—new publications that refer to or rely on papers based on imposter (human cancer) celllines,” says geneticist Christopher Korch, former director of the University of Colorado’s DNA Sequencing Analysis & Core Facility. “It’s like a huge pyramid of toothpicks precariously and deceptively held together.”

For all the worry about “big data,” where is the concern over “big bad data?”

Or is “big data” too big for correctness of the data to matter?

Once you discover that a paper is based on “imposter (human cancer) celllines,” how do you pass that information along to anyone who attempts to cite the article?

In other words, where do you write down that data about the paper, where the paper is the subject in question?

And how do you propagate that data across a universe of citations?

The post ends on a high note of current improvements but it is far from settled how to prevent reliance on compromised research.

I first saw this in a tweet by Dan Graur

October 11, 2014

Spark Breaks Previous Large-Scale Sort Record

Filed under: BigData,Hadoop,Spark — Patrick Durusau @ 10:28 am

Spark Breaks Previous Large-Scale Sort Record by Reynold Xin.

From the post:

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes.

To evaluate these improvements, we decided to participate in the Sort Benchmark. With help from Amazon Web Services, we participated in the Daytona Gray category, an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes. Using Spark on 206 EC2 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

Additionally, while no official petabyte (PB) sort competition exists, we pushed Spark further to also sort 1 PB of data (10 trillion records) on 190 machines in under 4 hours. This PB time beats previously reported results based on Hadoop MapReduce (16 hours on 3800 machines). To the best of our knowledge, this is the first petabyte-scale sort ever done in a public cloud.

Bottom line: Sorted 100 TB of data in 23 minutes, beat old record of 72 minutes, on fewer machines.

Read Reynold’s post and then get thee to Apache Spark!

I first saw this in a tweet by paco nathan.

October 3, 2014

Open Challenges for Data Stream Mining Research

Filed under: BigData,Data Mining,Data Streams,Text Mining — Patrick Durusau @ 4:58 pm

Open Challenges for Data Stream Mining Research, SIGKDD Explorations, Volume 16, Number 1, June 2014.

Abstract:

Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, over-looking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

Under entity stream mining, the authors describe the challenge of aggregation:

The first challenge of entity stream mining task concerns information summarization: how to aggregate into each entity e at each time point t the information available on it from the other streams? What information should be stored for each entity? How to deal with differences in the speeds of individual streams? How to learn over the streams efficiently? Answering those questions in a seamless way would allow us to deploy conventional stream mining methods for entity stream mining after aggregation.

Sounds remarkably like an issue for topic maps doesn’t it? Well, not topic maps in the sense that every entity has an IRI subjectIdentifier but in the sense that merging rules define the basis on which two or more entities are considered to represent the same subject.

The entire issue is on “big data” and if you are looking for research “gaps,” it is a great starting point. Table of Contents: SIGKDD explorations, Volume 16, Number 1, June 2014.

I included the TOC link because for reasons only known to staff at the ACM, the articles in this issue don’t show up in the library index. One of the many “features” of the ACM Digital Library.

In addition to the committee which oversees the Digital Library being undisclosed to members and available for contact only by staff.

October 1, 2014

Continuum Analytics Releases Anaconda 2.1

Filed under: Anaconda,BigData,Python — Patrick Durusau @ 4:18 pm

Continuum Analytics Releases Anaconda 2.1 by Corinna Bahr.

From the post:

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, announced today the release of the latest version of Anaconda, its free, enterprise-ready collection of libraries for Python.

Anaconda enables big data management, analysis, and cross-platform visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 2.1, adds a new version of the Anaconda Launcher and PyOpenSSL, as well as updates NumPy, Blaze, Bokeh, Numba, and 50 other packages.

Available on Windows, Mac OS X and Linux, Anaconda includes more than 195 of the most popular numerical and scientific Python libraries used by scientists, engineers and data analysts, with a single integrated and flexible installer. It also allows for the mixing and matching of different versions of Python (2.6, 2.7, 3.3, 3.4), NumPy, SciPy, etc., and the ability to easily switch between these environments.

See the post for more details, check the change log, or, what the hell, download the most recent version of Anaconda.

Remember, it’s open source so you can see “…where it keeps its brain.” Be wary of results based on software that operates behind a curtain.

BTW, check out the commercial services and products from Continuum Analytics if you need even more firepower for your data processing.

September 30, 2014

IPython Cookbook released

Filed under: BigData,Programming,Python — Patrick Durusau @ 3:58 pm

IPython Cookbook released by Cyrille Rossant.

From the post:

My new book, IPython Interactive Computing and Visualization Cookbook, has just been released! A sequel to my previous beginner-level book on Python for data analysis, this new 500-page book is a complete advanced-level guide to Python for data science. The 100+ recipes cover not only interactive and high-performance computing topics, but also data science methods in statistics, data mining, machine learning, signal processing, image processing, network analysis, and mathematical modeling.

Here is a glimpse of the topics addressed in this book:

  • IPython notebook, interactive widgets in IPython 2+
  • Best practices in interactive computing: version control, workflows with IPython, testing, debugging, continuous integration…
  • Data analysis with pandas, NumPy/SciPy, and matplotlib
  • Advanced data visualization with seaborn, Bokeh, mpld3, d3.js, Vispy
  • Code profiling and optimization
  • High-performance computing with Numba, Cython, GPGPU with CUDA/OpenCL, MPI, HDF5, Julia
  • Statistical data analysis with SciPy, PyMC, R
  • Machine learning with scikit-learn
  • Signal processing with SciPy, image processing with scikit-image and OpenCV
  • Analysis of graphs and social networks with NetworkX
  • Geographic Information Systems in Python
  • Mathematical modeling: dynamical systems, symbolic mathematics with SymPy

All of the code is freely available as IPython notebooks on the book’s GitHub repository. This repository is also the place where you can signal errata or propose improvements to any part of the book.

It’s never too early to work on your “wish list” for the holidays! 😉

Or to be person who tweaks the code (or data).

September 28, 2014

Big Data – A curated list of big data frameworks, resources and tools

Filed under: BigData,Curation,Software — Patrick Durusau @ 4:28 pm

Big Data – A curated list of big data frameworks, resources and tools by Andrea Mostosi.

From the post:

“Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.

Four hundred and eighty-four (484) resources by my count.

An impressive collection but HyperGraphDB is missing from this list.

Others that you can name off hand?

I don’t think the solution to the many partial “Big Data” lists of software, techniques and other resources is to create yet another list of the same. That would be a duplicated (and doomed) effort.

You?

Suggestions?

August 17, 2014

Bizarre Big Data Correlations

Filed under: BigData,Correlation,Humor,Statistics — Patrick Durusau @ 3:16 pm

Chance News 99 reported the following story:

The online lender ZestFinance Inc. found that people who fill out their loan applications using all capital letters default more often than people who use all lowercase letters, and more often still than people who use uppercase and lowercase letters correctly.

ZestFinance Chief Executive Douglas Merrill says the company looks at tens of thousands of signals when making a loan, and it doesn’t consider the capital-letter factor as significant as some other factors—such as income when linked with expenses and the local cost of living.

So while it may take capital letters into consideration when evaluating an application, it hasn’t held a loan up because of it.

Submitted by Paul Alper

If it weren’t an “online lender,” ZestFinance could take into account applications signed in crayon. 😉

Chance News collects stories with a statistical or probability angle. Some of them can be quite amusing.

August 15, 2014

John Chambers: Interfaces, Efficiency and Big Data

Filed under: BigData,Interface Research/Design,R — Patrick Durusau @ 10:07 am

John Chambers: Interfaces, Efficiency and Big Data

From the description:

At useR! 2014, John Chambers was generous enough to provide us with insight into the very early stages of user-centric interactive data exploration. He explains, step by step, how his insight to provide an interface into algorithms, putting the user first has placed us on the fruitful path which analysts, statisticians, and data scientists enjoy to this day. In his talk, John Chambers also does a fantastic job of highlighting a number of active projects, new and vibrant in the R ecosystem, which are helping to continue this legacy of “a software interface into the best algorithms.” The future is bright, and new and dynamic ideas are building off these thoughtful, well measured, solid foundations of the past.

To understand why this past is so important, I’d like to provide a brief view of the historical context that underpins these breakthroughs. In 1976, John Chambers was concerned with making software supported interactive numerical analysis a reality. Let’s talk about what other advances were happening in 1976 in the field of software and computing:

You should read the rest of the back story before watching the keynote by Chambers.

Will your next interface build upon the collective experience with interfaces or will it repeat some earlier experience?

I first saw this in John Chambers: Interfaces, Efficiency and Big Data by David Smith.

August 14, 2014

Model building with the iris data set for Big Data

Filed under: BigData,Data — Patrick Durusau @ 7:09 pm

Model building with the iris data set for Big Data by Joseph Rickert.

From the post:

For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)

Some key features of the airlines data set are:

  • It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables.
  • The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.)
  • There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example)
  • The data set is tidy, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes)

Joseph reviews what may become the iris data set of “big data,” airline data.

Its variables:

Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) – 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes

Source: http://stat-computing.org/dataexpo/2009/the-data.html

Waiting for the data set to download. Lots of questions suggest themselves. For example, variation or lack thereof in the use of fields 25-29.

Enjoy!

I first saw this in a tweet by David Smith.

August 9, 2014

VLDB – Volume 7, 2013-2014

Filed under: BigData,Database — Patrick Durusau @ 8:47 pm

Proceedings of the Very Large Data Bases, Volume 7, 2013-2014.

You are likely already aware of the VLDB proceedings but after seeing the basis for Summingbird:… [VLDB 2014], I was reminded that I should have a tickler to check updates on the VLDB proceedings every month. August of 2014 (Volume 7, No. 12) landed a few days ago and it looks quite good.

Two tidbits to tease you into visiting:

Akash Das Sarma, Yeye He, Surajit Chaudhuri: ClusterJoin: A Similarity Joins Framework using Map-Reduce. 1059 – 1070.

Norases Vesdapunt, Kedar Bellare, Nilesh Dalvi: Crowdsourcing Algorithms for Entity Resolution. 1071 – 1082.

I count twenty-six (26) articles in issue 12 and eighty (80) in issue 13.

Just in case you have run out of summer reading material. 😉

Supercomputing frontiers and innovations

Filed under: BigData,HPC,Parallel Programming,Supercomputing — Patrick Durusau @ 7:29 pm

Supercomputing frontiers and innovations (New Journal)

From the homepage:

Parallel scientific computing has entered a new era. Multicore processors on desktop computers make parallel computing a fundamental skill required by all computer scientists. High-end systems have surpassed the Petaflop barrier, and significant efforts are devoted to the development of the next generation of hardware and software technologies towards Exascale systems. This is an exciting time for computing as we begin the journey on the road to exascale computing. ‘Going to the exascale’ will mean radical changes in computing architecture, software, and algorithms – basically, vastly increasing the levels of parallelism to the point of billions of threads working in tandem – which will force radical changes in how hardware is designed and how we go about solving problems. There are many computational and technical challenges ahead that must be overcome. The challenges are great, different than the current set of challenges, and exciting research problems await us.

This journal, Supercomputing Frontiers and Innovations, gives an introduction to the area of innovative supercomputing technologies, prospective architectures, scalable and highly parallel algorithms, languages, data analytics, issues related to computational co-design, and cross-cutting HPC issues as well as papers on supercomputing education and massively parallel computing applications in science and industry.

This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge. We hope you find this journal timely, interesting, and informative. We welcome your contributions, suggestions, and improvements to this new journal. Please join us in making this exciting new venture a success. We hope you will find Supercomputing Frontiers and Innovations an ideal venue for the publication of your team’s next exciting results.

Becoming “massively parallel” isn’t going to free “computing applications in science and industry” from semantics. If anything, the more complex applications become, the easier it will be to mislay semantics, to the user’s peril.

Semantic efforts that did not scale for applications in the last decade face even dimmer prospects in the face of “big data” and massively parallel applications.

I suggest we move the declaration of semantics closer to or at the authors of content/data. At least as a starting point for discussion/research.

Current issue.

August 8, 2014

Juju Charm (HPCC Systems)

Filed under: BigData,HPCC,Unicode — Patrick Durusau @ 1:44 pm

HPCC Systems from LexisNexis Celebrates Third Open-Source Anniversary, And Releases 5.0 Version

From the post:

LexisNexis® Risk Solutions today announced the third anniversary of HPCC Systems®, its open-source, enterprise-proven platform for big data analysis and processing for large volumes of data in 24/7 environments. HPCC Systems also announced the upcoming availability of version 5.0 with enhancements to provide additional support for international users, visualization capabilities and new functionality such as a Juju charm that makes the platform easier to use.

“We decided to open-source HPCC Systems three years ago to drive innovation for our leading technology that had only been available internally and allow other companies and developers to experience its benefits to solve their unique business challenges,” said Flavio Villanustre, Vice President, Products and Infrastructure, HPCC Systems, LexisNexis.

….

5.0 Enhancements
With community contributions from developers and analysts across the globe, HPCC Systems is offering translations and localization in its version 5.0 for languages including Chinese, Spanish, Hungarian, Serbian and Brazilian Portuguese with other languages to come in the future.
Additional enhancements include:
• Visualizations
• Linux Ubuntu Juju Charm Support
• Embedded language features
• Apache Kafka Integration
• New Regression Suite
• External Database Support (MySQL)
• Web Services-SQL

The HPCC Systems source code can be found here: https://github.com/hpcc-systems
The HPCC Systems platform can be found here: http://hpccsystems.com/download/free-community-edition

Just in time for the Fall upgrade season! 😉

While reading the documentation I stumbled across: Unicode Indexing in ECL, last updated January 09, 2014.

From the page:

ECL’s dafault indexing logic works great for strings and numbers, but can encounter problems when indexing Unicode data. In some cases, unicode indexes don’t return all matching recordsfor a query. For example, If you have a Unicode field “ufield” in a dataset and select dataset(ufield BETWEEN u’ma’ AND u’me’), it would bring back records for ‘mai’,’Mai’ and ‘may’. However a query on the index for that dataset, idx(ufield BETWEEN u’ma’ AND u’me’), only brings back a record for ‘mai’.

This is a result of the way unicode fields are sorted for indexing. Sorting compares the values of two fields byte by byte to see if a field matches or is less than or greater than another value. Integers are stored in bigendian format, and signed numbers have an offset added to create an absolute value range.

Unicode fields are different. When compared/sorted in datasets, the comparisons are performed using the ICU locale sensitive comparisons to ensure correct ordering. However, index lookup operations need to be fast and therefore the lookup operations perform binary comparisons on fixed length blocks of data. Equality checks will return data correctly, but queries involving between, > or < may fail.

If you are considering HPCC, be sure to check your indexing requirements with regard to Unicode.

August 3, 2014

550 talks related to big data [GPU Technology Conference]

Filed under: BigData,GPU — Patrick Durusau @ 6:55 pm

550 talks related to big data by Amy.

Amy has forged links to forty-four (44) topic areas at the GPU Technology Conference 2014.

Definitely a post to bookmark!

You may remember that GPUs were what Bryan Thompson and others are using to achieve 3 Billion Traversed Edges Per Second (TEPS) for graph work. Non-kiddie graph work.

Enjoy!

July 23, 2014

Awesome Big Data

Filed under: BigData,Computer Science — Patrick Durusau @ 4:00 pm

Awesome Big Data by Onur Akpolat.

From the webpage:

A curated list of awesome big data frameworks, ressources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.

Your contributions are always welcome!

#awesome-bigdata

Great list of projects.

Curious to see if it develops enough community support to sustain the curation of the listing.

Finding resource collections like this one is so haphazard on the WWW that often times authors are duplicating the work of others. Not intentionally, just unaware of a similar resource.

Similar to the repeated questions that appear on newsgroups and email lists about basic commands or flaws in programs. The answer probably already exists in an archive or FAQ, but how is a new user to find it?

The social aspects of search and knowledge sharing are likely as important, if not more so, than the technologies we use to implement them.

Suggestions for reading on the social aspects of search and knowledge sharing?

June 24, 2014

DAMEWARE:…

Filed under: Astroinformatics,BigData,Data Mining — Patrick Durusau @ 6:16 pm

DAMEWARE: A web cyberinfrastructure for astrophysical data mining by Massimo Brescia, et al.

Abstract:

Astronomy is undergoing through a methodological revolution triggered by an unprecedented wealth of complex and accurate data. The new panchromatic, synoptic sky surveys require advanced tools for discovering patterns and trends hidden behind data which are both complex and of high dimensionality. We present DAMEWARE (DAta Mining & Exploration Web Application REsource): a general purpose, web-based, distributed data mining environment developed for the exploration of large datasets, and finely tuned for astronomical applications. By means of graphical user interfaces, it allows the user to perform classification, regression or clustering tasks with machine learning methods. Salient features of DAMEWARE include its capability to work on large datasets with minimal human intervention, and to deal with a wide variety of real problems such as the classification of globular clusters in the galaxy NGC1399, the evaluation of photometric redshifts and, finally, the identification of candidate Active Galactic Nuclei in multiband photometric surveys. In all these applications, DAMEWARE allowed to achieve better results than those attained with more traditional methods. With the aim of providing potential users with all needed information, in this paper we briefly describe the technological background of DAMEWARE, give a short introduction to some relevant aspects of data mining, followed by a summary of some science cases and, finally, we provide a detailed description of a template use case.

Despite the progress made in the creation of DAMEWARE, the authors conclude in part:

The harder problem for the future will be heterogeneity of platforms, data and applications, rather than simply the scale of the deployed resources. The goal should be to allow scientists to explore the data easily, with sufficient processing power for any desired algorithm to efficiently process it. Most existing ML methods scale badly with both increasing number of records and/or of dimensionality (i.e., input variables or features). In other words, the very richness of astronomical data sets makes them difficult to analyze….

The size of data sets is an issue, but heterogeneity issues with platforms, data and applications are several orders of magnitude more complex.

I remain curious when that is going to dawn on the the average “big data” advocate.

June 12, 2014

Feds and Big Data

Filed under: BigData,Government,Government Data — Patrick Durusau @ 8:25 am

Federal Agencies and the Opportunities and Challenges of Big Data by Nicole Wong.


June 19, 2014
1:00 pm – 5:00 pm
Webcast: http://www.ustream.tv/GeorgetownLive

From the post:

On June 19, the Obama Administration will continue the conversation on big data as we co-host our fourth big data conference, this time with the Georgetown University McCourt School of Public Policy’s Massive Data Institute.  The conference, “Improving Government Performance in the Era of Big Data; Opportunities and Challenges for Federal Agencies”,  will build on prior workshops at MIT, NYU, and Berkeley, and continue to engage both subject matter experts and the public in a national discussion about the future of data innovation and policy.

Drawing from the recent White House working group report, Big Data: Seizing Opportunities, Preserving Values, this event will focus on the opportunities and challenges posed by Federal agencies’ use of data, best practices for sharing data within and between agencies and other partners, and measures the government may use to ensure the protection of privacy and civil liberties in a big data environment.

You can find more information about the workshop and the webcast here.

We hope you will join us!

Nicole Wong is U.S. Deputy Chief Technology Officer at the White House Office of Science & Technology Policy

Approximately between 1:30 – 2:25 p.m., Panel One: Open Data and Information Sharing, Moderator: Nick Sinai, Deputy U.S. Chief Technology Officer.

Could be some useful intelligence on how sharing of data is viewed now. Perhaps you could throttle back a topic map to be just a little ahead of where agencies are now. So it would not look like such a big step.

Yes?

May 28, 2014

Microsoft Research’s Naiad Project

Filed under: BigData,Microsoft,Naiad — Patrick Durusau @ 3:28 pm

Solve the Big Data Problems of the Future: Join Microsoft Research’s Naiad Project by Tara Grumm.

From the post:

Over the past decade, general-purpose big data platforms like Hadoop have brought distributed computing into the mainstream. As people have become accustomed to processing their data in the cloud, they have become more ambitious, wanting to do things like graph analysis, machine learning, and real-time stream processing on their huge data sources.

Naiad is designed to solve this more challenging class of problems: it adds support for a few key primitives – maintaining state, executing loops, and reacting to incoming data – and provides high-performance infrastructure for running them in a scalable distributed system.

The result is the best of both worlds. Naiad runs simple programs just as fast as existing general-purpose platforms, and complex programs as fast as specialized systems for graph analysis, machine learning, and stream processing. Moreover, as a general-purpose system, Naiad lets you compose these different applications together, enabling mashups (such as computing a graph algorithm over a real-time sliding window of a social media firehose) that weren’t possible before.

Who should use Naiad?

We’ve designed Naiad to be accessible to a variety of different users. You can get started right away with Naiad by writing programs using familiar declarative operators based on SQL and LINQ.

For power users, we’ve created low-level interfaces to make it possible to extend Naiad without sacrificing any performance. You can plug in optimized data structures and algorithms, and build new domain-specific languages on top of Naiad. For example, we wrote a graph processing layer on top of Naiad that has performance comparable with (and often better than) specialized systems designed only to process graphs.

Big data geeks and open source supporters should take a serious look at the Naiad Project.

It will take a while but the real question in the future will be how well you can build upon a continuous data substrate.

Or as Harvey Logan says in Butch Cassidy and the Sundance Kid,

Rules? In a knife fight? No rules!

I would prepare accordingly.

May 26, 2014

Ethics and Big Data

Filed under: BigData,Ethics,Tweets — Patrick Durusau @ 6:52 pm

Ethical research standards in a world of big data by Caitlin M. Rivers and Bryan L. Lewis.

Abstract:

In 2009 Ginsberg et al. reported using Google search query volume to estimate influenza activity in advance of traditional methodologies. It was a groundbreaking example of digital disease detection, and it still remains illustrative of the power of gathering data from the internet for important research. In recent years, the methodologies have been extended to include new topics and data sources; Twitter in particular has been used for surveillance of influenza-like-illnesses, political sentiments, and even behavioral risk factors like sentiments about childhood vaccination programs. As the research landscape continuously changes, the protection of human subjects in online research needs to keep pace. Here we propose a number of guidelines for ensuring that the work done by digital researchers is supported by ethical-use principles. Our proposed guidelines include: 1) Study designs using Twitter-derived data should be transparent and readily available to the public. 2) The context in which a tweet is sent should be respected by researchers. 3) All data that could be used to identify tweet authors, including geolocations, should be secured. 4) No information collected from Twitter should be used to procure more data about tweet authors from other sources. 5) Study designs that require data collection from a few individuals rather than aggregate analysis require Institutional Review Board (IRB) approval. 6) Researchers should adhere to a user’s attempt to control his or her data by respecting privacy settings. As researchers, we believe that a discourse within the research community is needed to ensure protection of research subjects. These guidelines are offered to help start this discourse and to lay the foundations for the ethical use of Twitter data.

I am curious who is going to follow this suggested code of ethics?

Without long consideration, obviously not the NSA, FBI, CIA, DoD, or any employee of the United States government.

Ditto for the security services in any country plus their governments.

Industry players are well known for their near perfect recidivism rate on corporate crime so not expecting big data ethics there.

Drug cartels? Anyone shipping cocaine in multi-kilogram lots is unlikely to be interested in Big Data ethics.

That rather narrows the pool of prospective users of a code of ethics for big data doesn’t it?

I first saw this in a tweet by Ed Yong.

May 23, 2014

Data Analytics Handbook

Filed under: Analytics,BigData,Data Analysis — Patrick Durusau @ 3:58 pm

Data Analytics Handbook

The “handbook” appears in three parts, the first of which you download, while links to parts 2 and 3 are emailed to you for participating in a short survey. The survey collects your name, email address, educational background (STEM or not), and whether you are interested in a new resource that is being created to teach data analysis.

Let’s be clear up front that this is NOT a technical handbook.

Rather all three parts are interviews with:

Part 1: Data Analysts + Data Scientists

Part 2: CEO’s + Managers

Part 3: Researchers + Academics

Technical handbooks abound but this is one of the few (only?) books that covers the “soft” side of data analytics. By the “soft” side I mean the people and personal relationships that make up the data analytics industry. Technical knowledge is a must but being able to work well with others is as if not more important.

The interviews are wide ranging and don’t attempt to provide cut-n-dried answers. Readers will need to be inspired by and adapt the reported experiences to their own circumstances.

Of all the features of the books, I suspect I liked the “Top 5 Take Aways” the best.

In the interest of full disclosure, that maybe because part 1 reported:

2. The biggest challenge for a data analyst isn’t modeling, it’s cleaning and collecting

Data analysts spend most of their time collecting and cleaning the data required for analysis. Answering questions like “where do you collect the data?”, “how do you collect the data?”, and “how should you clean the data?”, require much more time than the actual analysis itself.

Well, when someone puts your favorite hobby horse at #2, see how you react. 😉

I first saw this in a tweet by Marin Dimitrov.

« Newer PostsOlder Posts »

Powered by WordPress