300 Data journalism blogs [1 Feedly OPML File]

February 27th, 2015

Data journalism blogs by Winny De Jong.

From the post:

At the News Impact Summit in Brussels I presented my workflow for getting ideas. Elsewhere on the blog a recap including interesting links. The RSS reader Feedly is a big part of my setup: together with Pocket its my most used app. Both are true lifesavers when reading is your default.

Since a lot op people of the News Summit audience use Feedly as well, I made this page to share my Feedly OPML file. If you’re not sure what an OPML file is read this page at Feedly.com.

Download my Feedly OPML export containing 300+ data journalism related sites here

Now that is a great way to start the weekend!

With a file of three hundred (300) data blogs!

Enjoy!

Comparing supervised learning algorithms

February 27th, 2015

Comparing supervised learning algorithms by Kevin Markham.

From the post:

In the data science course that I instruct, we cover most of the data science pipeline but focus especially on machine learning. Besides teaching model evaluation procedures and metrics, we obviously teach the algorithms themselves, primarily for supervised learning.

Near the end of this 11-week course, we spend a few hours reviewing the material that has been covered throughout the course, with the hope that students will start to construct mental connections between all of the different things they have learned. One of the skills that I want students to be able to take away from this course is the ability to intelligently choose between supervised learning algorithms when working a machine learning problem. Although there is some value in the “brute force” approach (try everything and see what works best), there is a lot more value in being able to understand the trade-offs you’re making when choosing one algorithm over another.

I decided to create a game for the students, in which I gave them a blank table listing the supervised learning algorithms we covered and asked them to compare the algorithms across a dozen different dimensions. I couldn’t find a table like this on the Internet, so I decided to construct one myself! Here’s what I came up with:

Eight (8) algorithms compared across a dozen (12) dimensions.

What algorithms would you add? Comments to add or take away?

Looks like the start of a very useful community resource.

Po’ Boy MapReduce

February 27th, 2015

po-boy-mapreduce

Posted by Mirko Krivanek as What Is MapReduce?, credit @Tgrall

Have You Tried DRAKON Comrade? (Russian Space Program Specification Language)

February 27th, 2015

DRAKON

From the webpage:

DRAKON is a visual language for specifications from the Russian space program. DRAKON is used for capturing requirements and building software that controls spacecraft.

The rules of DRAKON are optimized to ensure easy understanding by human beings.

DRAKON is gaining popularity in other areas beyond software, such as medical textbooks. The purpose of DRAKON is to represent any knowledge that explains how to accomplish a goal.


DRAKON Editor is a free tool for authoring DRAKON flowcharts. It also supports sequence diagrams, entity-relationship and class diagrams.

With DRAKON Editor, you can quickly draw diagrams for:

  • software requirements and specifications;
  • documenting existing software systems;
  • business processes;
  • procedures and rules;
  • any other information that tells “how to do something”.

DRAKON Editor runs on Windows, Mac and Linux.

The user interface of DRAKON Editor is extremely simple and straightforward.

Software developers can build real programs with DRAKON Editor. Source code can be generated in several programming languages, including Java, Processing.org, D, C#, C/C++ (with Qt support), Python, Tcl, Javascript, Lua, Erlang, AutoHotkey and Verilog

I note with amusement that the DRAKON editor has no “save” button. Rest easy! DRAKON saves all input automatically, removing the need for a “save” button. About time!

Download DRAKON editor.

I am in the middle of an upgrade so look for sample images next week.

Banning p < .05 In Psychology [Null Hypothesis Significance Testing Procedure (NHSTP)]

February 27th, 2015

The recent banning of the Null Hypothesis Significance Testing Procedure (NHSTP) in psychology should be a warning to would be data scientists that even “well established” statistical procedures may be deeply flawed.

Sorry, you may not have seen the news. In Basic and Applied Social Psychology (BASP), Banning Null Hypothesis Significance Testing Procedure (NHSTP) (2015) David Trafimow and Michael Marks write

The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid, and thus authors would be not required to perform it (Trafimow, 2014). However, to allow authors a grace period, the Editorial stopped short of actually banning the NHSTP. The purpose of the present Editorial is to announce that the grace period is over. From now on, BASP is banning the NHSTP.

You may be more familiar with seeing p < .05 rather than Null Hypothesis Significance Testing Procedure (NHSTP).

David Trafimow cites in the 2014 editorial warning about NHSTP his earlier work, Hypothesis Testing and Theory Evaluation at the Boundaries: Surprising Insights From Bayes’s Theorem (2003) as justifying non-use and the later ban of NHSTP.

His argument is summarized in the introduction:

Despite a variety of different criticisms, the standard nullhypothesis significance-testing procedure (NHSTP) has dominated psychology over the latter half of the past century. Although NHSTP has its defenders when used “properly” (e.g., Abelson, 1997; Chow, 1998; Hagen, 1997; Mulaik, Raju, & Harshman, 1997), it has also been subjected to virulent attacks (Bakan, 1966; Cohen, 1994; Rozeboom, 1960; Schmidt, 1996). For example, Schmidt and Hunter (1997) argue that NHSTP is “logically indefensible and retards the research enterprise by making it difficult to develop cumulative knowledge” (p. 38). According to Rozeboom (1997), “Null-hypothesis significance testing is surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students” (p. 336). The most important reason for these criticisms is that although one can calculate the probability of obtaining a finding given that the null hypothesis is true, this is not equivalent to calculating the probability that the null hypothesis is true given that one has obtained a finding. Thus, researchers are in the position of rejecting the null hypothesis even though they do not know its probability of being true (Cohen, 1994). One way around this problem is to use Bayes’s theorem to calculate the probability of the null hypothesis given that one has obtained a finding, but using Bayes’s theorem carries with it some problems of its own, including a lack of information necessary to make full use of the theorem. Nevertheless, by treating the unknown values as variables, it is possible to conduct some analyses that produce some interesting conclusions regarding NHSTP. These analyses clarify the relations between NHSTP and Bayesian theory and quantify exactly why the standard practice of rejecting the null hypothesis is, at times, a highly questionable procedure. In addition, some surprising findings come out of the analyses that bear on issues pertaining not only to hypothesis testing but also to the amount of information gained from findings and theory evaluation. It is possible that the implications of the following analyses for information gain and theory evaluation are as important as the NHSTP debate.

The most important lines for someone who was trained with the null hypothesis as an undergraduate many years ago:

The most important reason for these criticisms is that although one can calculate the probability of obtaining a finding given that the null hypothesis is true, this is not equivalent to calculating the probability that the null hypothesis is true given that one has obtained a finding. Thus, researchers are in the position of rejecting the null hypothesis even though they do not know its probability of being true (Cohen, 1994).

If you don’t know the probability of the null hypothesis, any conclusion you draw is on very shaky grounds.

Do you think any of the big data “shake-n-bake” mining/processing services are going to call that problem to your attention? True enough, such services may “empower” users but if “empowerment” means producing meaningless results, no thanks.

Trafimow cites Jacob Cohen’s The Earth is Round (p < .05) (1994) in his 2003 work. Cohen is angry and in full voice as only a senior scholar can afford to be.

Take the time to read both Trafimow and Cohen. Many errors are lurking outside your door but that will help you recognize this one.

Making Master Data Management Fun with Neo4j – Part 1, 2

February 27th, 2015

Making Master Data Management Fun with Neo4j – Part 1 by Brian Underwood.

From Part 1:

Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a graph database (Neo4j) and an interesting dataset (developer-oriented collaboration sites) to make MDM an enjoyable experience. This approach will allow you to quickly and sensibly merge data from different sources into a consistent picture and query across the data efficiently to answer your most pressing questions.

To start I’ll just be importing one data source: StackOverflow questions tagged with neo4j and their answers. In future blog posts I will discuss how to integrate other data sources into a single graph database to provide a richer view of the world of Neo4j developers’ online social interactions.

I’ve created a GraphGist to explore questions about the imported data, but in this post I’d like to briefly discuss the process of getting data from StackOverflow into Neo4j.

Part 1 imports data from Stackover flow into Neoj.

Making Master Data Management Fun with Neo4j – Part 2 imports Github data:

All together I was able to import:

  • 6,337 repositories
  • 6,232 users
  • 11,011 issues
  • 474 commits
  • 22,676 comments

In my next post I’ll show the process of how I linked the orignal StackOveflow data with the new GitHub data. Stay tuned for that, but in the meantime I’d also like to share the more technical details of what I did for those who are interested.

Definitely looking forward to seeing the reconciliation of data between StackOverflow and GitHub.

Data journalism: How to find stories in numbers

February 27th, 2015

Data journalism: How to find stories in numbers by Sandra Crucianelli.

From the post:

Colleagues often ask me what data journalism is. They’re confused by why it needs its own name — don’t all journalists use data?

The term is shorthand for ‘database journalism’ or ‘data-driven journalism’, where journalists find stories, or angles for stories, within large volumes of data.

It overlaps with investigative journalism in requiring lots of research, sometimes against people’s wishes. It can also overlap with data visualisation, as it requires close collaboration between journalists and digital specialists to find the best ways of presenting data.

So why get involved with spreadsheets and visualisation tools? At its most basic, adding data can give a story a new, factual dimension. But delving into datasets can also reveal new stories, or new aspects to them, that may not have otherwise surfaced.

Data journalism can also sometimes tell complicated stories more easily or clearly than relying on words alone — so it’s particularly useful for science journalists.

It can seem daunting if you’re trained in print or broadcast media. But I’ll introduce you to some new skills, and show you some excellent digital tools, so you too can soon find your feet as a data journalist.

Sandra gives as good an introduction to data journalism as you are likely to find. Her post covers everything from finding story ideas, researching relevant data, data processing and of course, presenting your findings in a persuasive way.

A must read for starting journalists but also for anyone needing an introduction to looking at data that supports a story (or not).

Gregor Aisch – Information Visualization, Data Journalism and Interactive Graphics

February 26th, 2015

Gregor has two sites that I wanted to bring to your attention on information visualization, data journalism and interactive graphics.

The first one, driven-by-data.net are graphics from New York Times stories created by Gregor and others. Impressive graphics. If you are looking for visualization ideas, not a bad place to stop.

The second one, Vis4.net is a blog that features Gregor’s work. But more than a blog, if you choose the navigation links at the top of the page:

Color – Posts on color.

Code – Posts focused on code.

Cartography – Posts on cartography.

Advice – Advice (not for the lovelorn).

Archive – Archive of his posts.

Rather than a long list of categories (ahem), Gregor has divided his material into easy to recognize and use divisions.

Always nice when you see a professional at work!

Enjoy!

Data Visualization with JavaScript

February 26th, 2015

Data Visualization with JavaScript by Stephen A. Thomas.

From the webpage:

It’s getting hard to ignore the importance of data in our lives. Data is critical to the largest social organizations in human history. It can affect even the least consequential of our everyday decisions. And its collection has widespread geopolitical implications. Yet it also seems to be getting easier to ignore the data itself. One estimate suggests that 99.5% of the data our systems collect goes to waste. No one ever analyzes it effectively.

Data visualization is a tool that addresses this gap.

Effective visualizations clarify; they transform collections of abstract artifacts (otherwise known as numbers) into shapes and forms that viewers quickly grasp and understand. The best visualizations, in fact, impart this understanding intuitively. Viewers comprehend the data immediately—without thinking. Such presentations free the viewer to more fully consider the implications of the data: the stories it tells, the insights it reveals, or even the warnings it offers. That, of course, defines the best kind of communication.

If you’re developing web sites or web applications today, there’s a good chance you have data to communicate, and that data may be begging for a good visualization. But how do you know what kind of visualization is appropriate? And, even more importantly, how do you actually create one? Answers to those very questions are the core of this book. In the chapters that follow, we explore dozens of different visualizations, techniques, and tool kits. Each example discusses the appropriateness of the visualization (and suggests possible alternatives) and provides step-by-step instructions for including the visualization in your own web pages.

With a publication date of March 2015 its hard to get any more current information on data visualization and JavaScript!

You can view the text online or buy a proper ebook/hard copy.

Enjoy!

Structure and Interpretation of Computer Programs (LFE Edition)

February 26th, 2015

Structure and Interpretation of Computer Programs (LFE Edition)

From the webpage:

This Gitbook (available here) is a work in progress, converting the MIT classic Structure and Interpretation of Computer Programs to Lisp Flavored Erlang. We are forever indebted to Harold Abelson, Gerald Jay Sussman, and Julie Sussman for their labor of love and intelligence. Needless to say, our gratitude also extends to the MIT press for their generosity in licensing this work as Creative Commons.

Contributing

This is a huge project, and we can use your help! Got an idea? Found a bug? Let us know!.

Writing, or re-writing if you are transposing a CS classic into another language, is far harder than most people imagine. Probably even more difficult than the original because your range of creativity is bound by the organization and themes of the underlying text.

I may have some cycles to donate to proof reading. Anyone else?

Making A Mouse Seem Like A Dragon

February 26th, 2015

Ishaan Tharoor writes of a new edition of ‘Mein Kampf in What George Orwell said about Hitler’s ‘Mein Kampf’ saying in part:

But, in my view, the most poignant section of Orwell’s article dwells less on the underpinnings of Nazism and more on Hitler’s dictatorial style. Orwell gazes at the portrait of Hitler published in the edition he’s reviewing:

It is a pathetic, dog-like face, the face of a man suffering under intolerable wrongs. In a rather more manly way it reproduces the expression of innumerable pictures of Christ crucified, and there is little doubt that that is how Hitler sees himself. The initial, personal cause of his grievance against the universe can only be guessed at; but at any rate the grievance is here. He is the martyr, the victim, Prometheus chained to the rock, the self-sacrificing hero who fights single-handed against impossible odds. If he were killing a mouse he would know how to make it seem like a dragon. One feels, as with Napoleon, that he is fighting against destiny, that he can’t win, and yet that he somehow deserves to.

The line:

If he were killing a mouse he would know how to make it seem like a dragon.

is particularly appropriate in a time of defense budgets at all time highs, restrictions on travel, social media, “homeland” a/k/a “fatherland” security, torture as an instrument of democratic governments, etc.

Where is this dragon that threatens us so? Multiple smallish bands of people with no country, not national industrial base, no navy, no airforce, no armored divisions, no ICBMs, no nuclear weapons, no CBW, who are most skilled with knives and lite arms.

How many terrorists? In How Many Terrorists Are There: Not As Many As You Might Think Becky Ackers does the math based on the helpful report from the U.S. Department of State, Country Reports on Terrorism.

Before I give you Becky’s total, which errs on the generous side of rounding up, know that the Department of Homeland security already has them outnumbered.

Try 184,000.

Yep, just 184,000. Even big, bad “Al-Qa’ida (AQ)” and its three affiliates (“Al-Qa’ida in the Arabian Peninsula”; “Al-Qa’ida in Iraq”; and “Al-Qa’ida in the Islamic Maghreb”) boast only 4000 bad guys combined. (The main Al-Qa’ida’s “strength” is “impossible to estimate,” but the Reports admits that its “core has been seriously degraded” following “the death or arrest of dozens of mid- and senior-level AQ operatives.” “Dozens,” not “hundreds.” Hmmm.)

And remember, 184,000 is a ridiculously inflated figure – both because of our generous accounting and also because governments often expand a word’s meaning well beyond the dictionary’s. You may recall the Feds’ contending with straight faces in 2004 that if “a little old lady in Switzerland gave money to a charity for an Afghan orphanage, and the money was passed to al Qaeda,” she met the definition of “enemy combatant.” Five years later, a federal Fusion Center decreed that “if you’re an anti-abortion activist, or if you display political paraphernalia supporting a third-party candidate or [Ron Paul], if you possess subversive literature, you very well might be a member of a domestic paramilitary group.” No telling how many confused Swiss grandmothers and readers of Techdirt’s subversive articles cluster among those 184,000.

That number grows even more absurd when we compare it with the aforementioned Homeland Security’s 240,000 Warriors on Terror. Meanwhile, something like 780,000 cops stalk us nationwide, whose duties also encompass tilting at terrorism’s windmill. And that’s to say nothing of the scores of other bureaucracies at the national, state, and local levels hunting these same 184,000 guerrillas as well as an additional 1,368,137 troops from the armed forces [click on “Rank/Grade – current month”].

Even if you round the absurd number of terrorists up to 200,000 and round our total down to 2,000,000, at present the United States along has the terrorists outnumbered 10 to 1. Now add in Europe, China, India, etc. and you get the idea that terrorists really are the mice of the world.

Personally I’m glad they are re-printing ‘Mein Kampf.’

Good opportunity to be reminded that leaders who are making dragons out of the mice of terrorism aren’t planning on sacrificing themselves, they are going to sacrifice us, each and every one.

Category Theory – Reading List

February 26th, 2015

Category Theory – Reading List by Peter Smith.

Notes along with pointers to other materials.

About Peter Smith:

These pages are by me, Peter Smith. I retired in 2011 from the Faculty of Philosophy at Cambridge. It was my greatest good fortune to have secure, decently paid, university posts for forty years in leisurely times, with a very great deal of freedom to follow my interests wherever they led. Like many of my generation, I am sure I didn’t at the time really appreciate just how lucky I and my contemporaries were. Some of the more student-orientated areas of this site, then, such as the Teach Yourself Logic Guide, constitute my small but heartfelt effort to give something back by way of thanks.

There is much to explore at Peter’s site beside his notes on category theory.

The Spy Cables [Videos]

February 26th, 2015

The Spy Cables by AJ+.

As of today, the following nine (9) videos on the Spy Cables are on YouTube:

If you ever have an unkind word for some governments, watch Are You A Terrorist? first.

The Aussies have a checklist for potential terrorists that they have shared with other spy agencies that includes “denouncing Western countries and governments, particularly the United States and Israel.” I’ll concede that the United States is a Western country/government but putting Israel in that category makes me doubt the state of education in Australia.

Oh, and the purchasing of explosives will get you on a terrorist checklist, which the AJ+ concedes but I don’t think they realize that gasoline and alcohol are both highly explosive materials. A large amount of both is sold every day in the United States.

Hopefully more Spy Cable videos are on the way and will be posted to this YouTube channel.

Enjoy!

PS: What’s really ironic is that for all the huffing and puffing about secrecy, when secrets do come out, you find that government agencies and leaders are as petty and spiteful as anyone you read about in the Hollywood tabloids. Is what they are trying to hide in the name of “national security?”

How Rdio Onboards New Users

February 26th, 2015

How Rdio Onboards New Users

User Onboarding does a teardown of Rdio, a highly successful music streaming site.

Highly successful does not equal perfect onboarding!

Interesting exercise to duplicate with your web/application interface.

CVE Details

February 26th, 2015

CVE Details: The Ultimate Security Vulnerability Datasource

From the webpage:

www.cvedetails.com provides an easy to use web interface to CVE vulnerability data. You can browse for vendors, products and versions and view cve entries, vulnerabilities, related to them. You can view statistics about vendors, products and versions of products. CVE details are displayed in a single, easy to use page, see a sample here.

CVE vulnerability data are taken from National Vulnerability Database (NVD) xml feeds provided by National Institue of Standards and Technology.

Additional data from several sources like exploits from www.exploit-db.com, vendor statements and additional vendor supplied data, Metasploit modules are also published in addition to NVD CVE data. Vulnerabilities are classified by cvedetails.com using keyword matching and cwe numbers if possible, but they are mostly based on keywords.

Unless otherwise stated CVSS scores listed on this site are “CVSS Base Scores” provided in NVD feeds. Vulnerability data are updated daily using NVD feeds. Please visit nvd.nist.gov for more details.

It is hard to say how much data about security issues is kept secret versus how much is made public. What is clear, however, is that organizing the public information leaves a lot to be desired.

Take the CVE advisory on the Superfish issue:

Vulnerability Details : CVE-2015-2077.

In addition to the information on the page you are invited to:

Search Twitter

Search YouTube

Search Google

No peeking! Without checking those links, what search string do you think appears in each one?

  • Komodia Redirector
  • man-in-the-middle attackers
  • Superfish

Would you believe, none of the above?

The actual search string is: “CVE-2015-2077.”

Yep, the identifier assigned by the CVE site is used as the search string.

The same is true for the drop down menu, External Links, which searches: Secunia Advisories, XForce Advisories, Vulnerability Details at NVD, Vulnerability Details at Mitre, Nessus Plugins, First CVSS Guide (except for First CVSS Guide, which is A Complete Guide to the Common Vulnerability Scoring System Version 2.0.)

Don’t get me wrong, CVE Details is a great information resource, but bound by the use of its own identifiers. You are going to miss blog posts, tweets, and other materials.

BTW, CVE = Common Vulnerabilities and Exposures.

Enjoy!

The World’s ‘Most Secure’ Operating System Adds a Bitcoin Wallet

February 26th, 2015

The World’s ‘Most Secure’ Operating System Adds a Bitcoin Wallet by Ian DeMartino.

From the post:

Tails OS, which experts consider the world’s most secure operating system and that the NSA called a “threat,” has released a new version that includes a Bitcoin wallet option.

In the fight for privacy, bitcoin has been an invaluable tool. While those in the know will be the first to tell you that bitcoin isn’t completely anonymous, the pseudonymous nature of bitcoin gives it far more privacy than credit card transactions, particularly if certain precautions are taken.

However, there is a larger battle in the war for privacy, and that is the battle for privacy of communication. A major advancement in this field is Tails OS, which was famously used by Glenn Greenwald and the other journalists that broke the Edward Snowden leaks. Yesterday, Tails OS announced that they have released version 1.3, and it includes an option for adding an Electrum Bitcoin wallet.

Just in case you are being proactive about your security and collecting funds in Bitcoins, you will find this of interest.

I was surprised to find that Tails comes with LibreOffice and not Emacs as an editor. Can always add it but I am curious why it isn’t present by default.

Enjoy!

Periodic Table of Machine Learning Libraries

February 26th, 2015

Periodic Table of Machine Learning Libraries

Interesting visual display but it isn’t apparent how libraries were associated with particular elements?

That is why would I look for GATE at 106?

What I would find more interesting would be a listing of all of these machine learning libraries with pointers to additional resources for each one.

Just a thought.

PACKT Publishing – FREE LEARNING – HELP YOURSELF

February 25th, 2015

PACKT Publishing – FREE LEARNING – HELP YOURSELF

I’m not sure when this started but according to the webpage, there will be one free book per day until March 5, 2015.

I will be checking back tomorrow to see if the selection changes day to day.

Worth a trip just to see if there is anything of interest.

Enjoy!

Elon Musk Must Be Wringing His Hands, Again

February 25th, 2015

Google develops computer program capable of learning tasks independently by Hannah Devlin.

From the post:

Google scientists have developed the first computer program capable of learning a wide variety of tasks independently, in what has been hailed as a significant step towards true artificial intelligence.

The same program, or “agent” as its creators call it, learnt to play 49 different retro computer games, and came up with its own strategies for winning. In the future, the same approach could be used to power self-driving cars, personal assistants in smartphones or conduct scientific research in fields from climate change to cosmology.

The research was carried out by DeepMind, the British company bought by Google last year for £400m, whose stated aim is to build “smart machines”.

Demis Hassabis, the company’s founder said: “This is the first significant rung of the ladder towards proving a general learning system can work. It can work on a challenging task that even humans find difficult. It’s the very first baby step towards that grander goal … but an important one.”

Truly a remarkable achievement.

I haven’t found a more detailed description of the strategies developed by the “agent,” but it would be interesting to try those out on retro computer games.

The post is a good one and worth your time to read.

It closes by contrasting Elon Musk’s fears of an AI apocalypse with Google’s assurance that any danger is decades away.

I take a great deal of reassurance from the “agent” being supplied with the retro video games.

The “agent” did not choose to become a master of Asteroids, with the intent of being the despair of all other gamers at the local arcade.

However good an “agent” may become, at any task, from video games to surgery, the question is who chooses for the task to be performed? Granting we probably want to lock out commands like: “Make me a suitcase size nuclear weapon.” and that sort of thing.

Working with Small Files in Hadoop – Part 1, Part 2, Part 3

February 25th, 2015

Working with Small Files in Hadoop – Part 1, Part 2, Part 3 by Chris Deptula.

From the post:

Why do small files occur?

The small file problem is an issue Inquidia Consulting frequently sees on Hadoop projects. There are a variety of reasons why companies may have small files in Hadoop, including:

  • Companies are increasingly hungry for data to be available near real time, causing Hadoop ingestion processes to run every hour/day/week with only, say, 10MB of new data generated per period.
  • The source system generates thousands of small files which are copied directly into Hadoop without modification.
  • The configuration of MapReduce jobs using more than the necessary number of reducers, each outputting its own file. Along the same lines, if there is a skew in the data that causes the majority of the data to go to one reducer, then the remaining reducers will process very little data and produce small output files.

Does it sound like you have small files? If so, this series by Chris is what you are looking for.

Learning Data Visualization using Processing

February 25th, 2015

Learning Data Visualization using Processing by C.P. O’Neill.

From the post:

Learning data visualization techniques using the Processing programming language has always been a skill that has been on my list of things to learn really well and I finally got around to get started. I’ve used other technologies and methods before for data visualization, most notably R and RStudio, so when I got the opportunity to learn how to take that skill to the next level I jumped at it. Here is a visualization of all the meteor strikes that have been collected around the world. The bigger the circles, the larger the impact. I’m not going to go into a hugh analysis since I’m sure it’s been done many times before, but I am excited to get cracking on other data sets in the near future.

GitHub: repo

Skillshare Class: Data Visualization: Designing Maps with Processing and Illustrator

A nice reminder about Processing.

I have seen the usual visualization of arms exporters (U.S. is #1 by the way) but wonder about a visualization of the deaths attributable to world leaders during their terms in office (20th/21st century). Some of the counts are iffy and how do you allocate Russian deaths between Germany and the Allies (for not supporting Russia)? Still, it could be an interesting exercise.

I first saw this in a tweet by Stéphane Fréchette.

DataStax – New for 2015 – Free Online Instructor Led Training

February 25th, 2015

DataStax – New for 2015 – Free Online Instructor Led Training

I count six (6) online free courses in March 2015:

As of today, both:

report being “sold out” and you can join a waiting list.

If you take one or more of these courses, don’t keep your attendance a secret. Provide feedback to DataStax and post your comments about the experience online.

High quality online training isn’t cheap and positive feedback will strengthen the hand of those responsible for these free training classes.

Everyone is an IA [Information Architecture]

February 25th, 2015

Everyone is an IA [Information Architecture] by Dan Ramsden.

From the post:

This is a post inspired by my talk from World IA Day. On the day I had 20 minutes to fill – I did a magic trick and talked about an imaginary uncle. This post has the benefit of an edit, but recreates the central argument – everyone makes IA.

Information architecture is everywhere, it’s a part of every project, every design includes it. But I think there’s often a perception that because it requires a level of specialization to do the most complicated types of IA, people are nervous about how and when they engage with it – no-one like to look out of their depth. And some IA requires a depth of thinking that deserves justification and explanation.

Even when you’ve built up trust with teams of other disciplines or clients, I think one of the most regular questions asked of an IA is probably, ‘Is it really that complicated?’ And if we want to be happier in ourselves, and spread happiness by creating meaningful, beautiful, wonderful things – we need to convince people that complex is different from complicated. We need to share our conviction that IA is a real thing and that thinking like an IA is probably one of the most effective ways of contributing to a more meaningful world.

But we have a challenge, IAs are usualy the minority. At the BBC we have a team of about 140 in UX&D, and IAs are the minority – we’re not quite 10%. It’s my job to work out how those less than 1 in 10 can be as effective as possible and have the biggest positive impact on the work we do and the experiences we offer to our audiences. I don’t think this is unique. A lot of the time IAs don’t work together, or there’s not enough IAs to work on every project that could benefit from an IA mindset, which is every project.

This is what troubled me. How could I make sure that it is always designed? My solution to this is simple. We become the majority. And because we can’t do that just by recruiting a legion of IAs we do it another way. We turn everyone in the team into an information architect.

Now this is a bit contentious. There’s legitimate certainty that IA is a specialism and that there are dangers of diluting it. But last year I talked about an IA mindset, a way of approaching any design challenge from an IA perspective. My point then was that the way we tend to think and therefore approach design challenges is usually a bit different from other designers. But I don’t believe we’re that special. I think other people can adopt that mindset and think a little bit more like we do. I think if we work hard enough we can find ways to help designers to adopt that IA mindset more regularly.

And we know the benefits on offer when every design starts from the architecture up. Well-architected things work better. They are more efficient, connected, resilient and meaningful – they’re more useful.

Dan goes onto say that information is everywhere. Much in the same way that I would say that subjects are everywhere.

Just as users must describe information architectures as they experience them, the same is true for users identifying the subjects that are important to them.

There is never a doubt that more IAs and more subjects exist, but the best anyone can do is to tell you about the ones that are important to them and how they have chosen to identify them.

To no small degree, I think terminology has been used to disenfranchise users from discussing subjects as they understand them.

From my own background, I remember a database project where the head of membership services, who ran reports by rote out of R&R, insisted on saying where data needed to reside in tables during a complete re-write of the database. I keep trying, with little success, to get them to describe what they wanted to store and what capabilities they needed.

In retrospect, I should have allowed membership services to use their terminology to describe the database because whether they understood the underlying data architecture or not wasn’t a design goal. The easier course would have been to provide them with a view that accorded with their idea of the database structure and to run their reports. That other “views” of the data existed would have been neither here nor there to them.

As “experts,” we should listen to the description of information architectures and/or identifications of subjects and their relationships as a voyage of discovery. We are discovering the way someone else views the world, not for our correction to the “right” way but so we can enable their view to be more productive and useful to them.

That approach takes more work on the part of “experts” but think of all the things you will learn along the way.

Download the Hive-on-Spark Beta

February 25th, 2015

Download the Hive-on-Spark Beta by Xuefu Zhang.

From the post:

The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.

Many anxious users have inquired about its availability in the last few months. Some users even built Hive-on-Spark from the branch code and tried it in their testing environments, and then provided us valuable feedback. The team is thrilled to see this level of excitement and early adoption, and has been working around the clock to deliver the product at an accelerated pace.

Thanks to this hard work, significant progress has been made in the last six months. (The project is currently incubating in Cloudera Labs.) All major functionality is now in place, including different flavors of joins and integration with Spark, HiveServer2, and YARN, and the team has made initial but important investments in performance optimization, including split generation and grouping, supporting vectorization and cost-based optimization, and more. We are currently focused on running benchmarks, identifying and prototyping optimization areas such as dynamic partition pruning and table caching, and creating a roadmap for further performance enhancements for the near future.

Two month ago, we announced the availability of an Amazon Machine Image (AMI) for a hands-on experience. Today, we even more proudly present you a Hive-on-Spark beta via CDH parcel. You can download that parcel here. (Please note that in this beta release only HDFS, YARN, Apache ZooKeeper, and Hive are supported. Other components, such as Apache Pig, Apache Oozie, and Impala, might not work as expected.) The “Getting Started” guide will help you get your Hive queries up and running on the Spark engine without much trouble.

We welcome your feedback. For assistance, please use user@hive.apache.org or the Cloudera Labs discussion board.

We will update you again when GA is available. Stay tuned!

If you are snowbound this week, this may be what you have been looking for!

I have listed this under both Hive and Spark separately but am confident enough of its success that I created Hive-on-Spark as well.

Enjoy!

Typography Teardown of Advertising Age

February 25th, 2015

Typography Teardown of Advertising Age by Jeremiah Shoaf.

From the post:

I’m a huge fan of Samuel Hulick’s user onboarding teardowns so I thought it would be fun to try a new feature on Typewolf where I do a “typography teardown” of a popular website. I’ll review the design from a typographic perspective and discuss what makes the type work and what could potentially have been done better.

In this first edition I’m going to take a deep dive into the type behind the Advertising Age website. But first, a disclaimer.

Disclaimer: The following site was created by designers way more talented than myself. This is simply my opinion on the typography and how, at times, I may have approached things differently. Rules in typography are meant to be broken.

As you already know, I’m at least graphically challenged if not worse. ;-)

Still, it doesn’t prevent me from enjoying graphics and layouts, I just have a hard time originating them. And I keep trying by reading resources such as this one.

While a website is reviewed by Jeremiah, the same principles should apply to an application interface.

Enjoy!

Google As Censor

February 25th, 2015

Google bans sexually explicit content on Blogger by Lisa Vaas.

From the post:

Google hasn’t changed its policy’s messaging around censorship, stating that “censoring this content is contrary to a service that bases itself on freedom of expression.”

How Google will manage, with Blogger, to increase “the availability of information, [encourage] healthy debate, and [make] possible new connections between people” while still curbing “abuses that threaten our ability to provide this service and the freedom of expression it encourages” remains to be seen.

I wrote an entire post, complete with Supreme Court citations, etc., on the basis that Google was really trying to be a moral censor without saying so. As I neared the end of the post, the penny dropped and the explanation for Google’s banning of “sexually explicit content” became clear.

Read that last part of the Google quote carefully:

“abuses that threaten our ability to provide this service and the freedom of expression it encourages”

Who would have the power to threaten Google’s sponsorship of Blogger and “the freedom of expression it encourages?”

Hmmm, does China come to mind?

China relaxes on pornography but YouTube is still blocked by Malcolm Moore.

Whether China is planning on new restrictions on pornography in general or Google is attempting to sweeten a deal with China by self-policing isn’t clear.

I find that a great deal more plausible than thinking Google has suddenly lost interest what can be highly lucrative content.

When they see “sexually explicit content” Google and its offended Chinese censor buddies:

could effectively avoid further bombardment of their sensibilities simply by averting their eyes.

Cohen v. California, 403 U.S. 15 (1971).

Averting your eyes is even easier with a web browser because you have to seek out the offensive content. If material offends you, don’t go there. Problem solved.

Google’s role as censor isn’t going to start with deleting large numbers of books from Google Books and heavy handed censoring of search results.

No, Google will start by censoring IS and other groups unpopular with one government or another. Then, as here, Google will move up to making some content harder to post, again at the behest of some government. By the time Google censorship reaches you, the principle of censorship will be well established and the only question left being where the line is drawn.

PS: Obviously I am speculating that China is behind the censoring of Blogger by Google but let’s first call this action what it is in fact: censorship. I don’t have any cables between China and Google but I feel sure someone does. Perhaps there is a leaky Google employee who can clear up this mystery for us all.

Start of a new era: Apache HBase™ 1.0

February 25th, 2015

Start of a new era: Apache HBase™ 1.0

From the post:

The Apache HBase community has released Apache HBase 1.0.0. Seven years in the making, it marks a major milestone in the Apache HBase project’s development, offers some exciting features and new API’s without sacrificing stability, and is both on-wire and on-disk compatible with HBase 0.98.x.

In this blog, we look at the past, present and future of Apache HBase project. 

The 1.0.0 release has three goals:

1) to lay a stable foundation for future 1.x releases;

2) to stabilize running HBase cluster and its clients; and

3) make versioning and compatibility dimensions explicit 

Seven (7) years is a long time so kudos to everyone who contributed to getting Apache HBase to this point!

For those of you who like documentation, see the Apache HBase™ Reference Guide.

Black Site in USA – Location and Details

February 24th, 2015

The disappeared: Chicago police detain Americans at abuse-laden ‘black site’ by Spencer Ackerman.

From the post:

The Chicago police department operates an off-the-books interrogation compound, rendering Americans unable to be found by family or attorneys while locked inside what lawyers say is the domestic equivalent of a CIA black site.

The facility, a nondescript warehouse on Chicago’s west side known as Homan Square, has long been the scene of secretive work by special police units. Interviews with local attorneys and one protester who spent the better part of a day shackled in Homan Square describe operations that deny access to basic constitutional rights.

Alleged police practices at Homan Square, according to those familiar with the facility who spoke out to the Guardian after its investigation into Chicago police abuse, include:

  • Keeping arrestees out of official booking databases.
  • Beating by police, resulting in head wounds.
  • Shackling for prolonged periods.
  • Denying attorneys access to the “secure” facility.
  • Holding people without legal counsel for between 12 and 24 hours, including people as young as 15.

At least one man was found unresponsive in a Homan Square “interview room” and later pronounced dead.

And it gets worse, far worse.

It is a detailed post but merits a slow read, particularly the statement by Jim Trainum, a former DC homicide detective:


“I’ve never known any kind of organized, secret place where they go and just hold somebody before booking for hours and hours and hours. That scares the hell out of me that that even exists or might exist,” said Trainum, who now studies national policing issues, to include interrogations, for the Innocence Project and the Constitution Project.

If a detective who lived with death and violence on a day to day basis is frightened of police black sites, what should our reaction be?

MILJS : Brand New JavaScript Libraries for Matrix Calculation and Machine Learning

February 24th, 2015

MILJS : Brand New JavaScript Libraries for Matrix Calculation and Machine Learning by Ken Miura, et al.

Abstract:

MILJS is a collection of state-of-the-art, platform-independent, scalable, fast JavaScript libraries for matrix calculation and machine learning. Our core library offering a matrix calculation is called Sushi, which exhibits far better performance than any other leading machine learning libraries written in JavaScript. Especially, our matrix multiplication is 177 times faster than the fastest JavaScript benchmark. Based on Sushi, a machine learning library called Tempura is provided, which supports various algorithms widely used in machine learning research. We also provide Soba as a visualization library. The implementations of our libraries are clearly written, properly documented and thus can are easy to get started with, as long as there is a web browser. These libraries are available from this http URL under the MIT license.

Where “this http URL” = http://mil-tokyo.github.io/. It’s a hyperlink with that text in the original so I didn’t want to change the surface text.

The paper is a brief introduction to the JavaScript Libraries and ends with several short demos.

On this one, yes, run and get the code: http://mil-tokyo.github.io/.

Happy coding!

Using NLP to measure democracy

February 24th, 2015

Using NLP to measure democracy by Thiago Marzagão.

Abstract:

This paper uses natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The ADS are based on 42 million news articles from 6,043 different sources and cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS are replicable and have standard errors small enough to actually distinguish between cases.

The ADS are produced with supervised learning. Three approaches are tried: a) a combination of Latent Semantic Analysis and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods; and c) the Wordscores algorithm. The Wordscores algorithm outperforms the alternatives, so it is the one on which the ADS are based.

There is a web application where anyone can change the training set and see how the results change: democracy-scores.org

Automated Democracy Scores Part of the PhD work of Thiago Marzagão. An online interface that allows you to change democracy scores by the year and country and run the analysis against 200 billion data points on an Amazon cluster.

Quite remarkable although I suspect this level of PhD work and public access to it will grow rapidly in the near future.

Do read the paper and don’t jump straight to the data. ;-) Take a minute to see what results Thiago has reached thus far.

Personally I was expecting the United States and China to be running neck and neck. Mostly because the wealthy choose candidates for public office in the United States and in China the Party chooses them. Not all that different, perhaps a bit more formalized and less chaotic in China. Certainly less in the way of campaign costs. (humor)

I was seriously surprised to find that democracy was lowest in Africa and the Middle East. Evaluated on a national basis that may be correct but Western definitions aren’t easy to apply to Africa and the Middle East. Nation, Tribe and Ethnic Group in Africa And Democracy and Consensus in African Traditional Politics for one tip of the iceberg on decision making in Africa.