Archive for December, 2015

16 Journalism Tools & Resources to Explore in 2016

Thursday, December 31st, 2015

16 Journalism Tools & Resources to Explore in 2016

From the post:

Every year we start with a fresh and very personal selection of tools and resources that offer a glimpse of the future of journalism. Mobile, virtual, highly visual, allowing to share and verify news in new ways, putting the audience first,…

Enjoy. Let 2016 be a great year to discover and tell great stories!

JournalismTools list their project, last, but in all honesty, it should have been first.

If that is the only link you follow, you will have gotten a lot out of their listing. Follow the others too but follow first.

No, I have no connection with the project or any of its members, but I can recognize dedication to fact finding when I see it.

We may differ on what the “facts” are or what they may or may not mean, but they are the starting point of having something to say.

Something to be captured by a topic map for instance.


XQilla-2.3.2 – Tooling up for 2016 (Part 1) (XQuery)

Thursday, December 31st, 2015

Along with other end of the year tasks, I’m installing several different XQuery tools. Not all tools support all extensions and so a variety of tools can be a useful thing.

The README for XQila-2.3.2 comes close to winning a prize for being terse:

1. Download a source distribution of Xerces-C 3.1.2

2. Build Xerces-C

cd xerces-c-3.1.2/

4. Build XQilla

cd xqilla/
./configure –with-xerces=`pwd`/../xerces-c-3.1.2/

A few notes that may help:

Obtain Xerces-c-3.1.2 homepage.

Xerces project homepage. Home of Apache Xerces C++, Apache Xerces2 Java, Apache Xerces Perl, and, Apache XML Commons.

On configuring the make file for XQilla:

./configure –with-xerces=`pwd`/../xerces-c-3.1.2/

the README is presuming you built xerces-c-3.1.2 in a sub-directory of the XQilla source. You could, just out of habit I built xerces-c-3.1.2 in a separate directory.

The configuration file for XQilla reads in part:

–with-xerces=DIR Path of Xerces. DIR=”/usr/local”

So you could build XQilla with an existing install of xerces-c-3.1.2 if you are so-minded. But if you are that far along, you don’t need these notes. 😉

Strictly for my system (your paths will be different), after building xerces-c-3.1.2, I changed directories to XQilla-2.3.2 and typed:

./configure --with-xerces=/home/patrick/working/xerces-c-3.1.2

No error messages so I am now back at the command prompt and enter make.

Welllll, that was supposed to work!

Here is the error I got:

libtool: link: g++ -O2 -ftemplate-depth-50 -o .libs/xqilla 
.libs/ ./.libs/ -lnsl -lpthread -Wl,-rpath 
/usr/bin/ld: warning:, needed by 
   not found (try using -rpath or -rpath-link)
   undefined reference to `uset_close_55'
   undefined reference to `ucnv_fromUnicode_55'
...[omitted numerous undefined references]...
collect2: error: ld returned 1 exit status
make[1]: *** [xqilla] Error 1
make[1]: Leaving directory `/home/patrick/working/XQilla-2.3.2'
make: *** [all-recursive] Error 1

To help you avoid surfing the web to track down this issue, realize that Ubuntu doesn’t use the latest releases. Of anything as far as I can tell.

The bottom line being that Ubuntu 14.04 doesn’t have

If I manually upgrade libraries, I might create an inconsistency package management tools can’t fix. 🙁 And break working tools. Bad joss!

Fear Not! There is a solution, which I will cover in my next XQilla-2.3.2 post!

PS: I didn’t get back to the sorting post in time to finish it today. Not to mention that I encountered another nasty list in Most Vulnerable Software of 2015! (Perils of Interpretation!, Advice for 2016).

I say “nasty,” you should see some of the lists you can find at Valid XML I’ll concede but not as useful as they could be.

Improving online lists, combining them with other data, etc., are some of the things I want to cover this coming year.

Most Vulnerable Software of 2015! (Perils of Interpretation!, Advice for 2016)

Thursday, December 31st, 2015

Software with the most vulnerabilities in 2015: Mac OS X, iOS, and Flash by Emil Protalinski.

From the post:

Which software had the most publicly disclosed vulnerabilities this year? The winner is none other than Apple’s Mac OS X, with 384 vulnerabilities. The runner-up? Apple’s iOS, with 375 vulnerabilities.

Rounding out the top five are Adobe’s Flash Player, with 314 vulnerabilities; Adobe’s AIR SDK, with 246 vulnerabilities; and Adobe AIR itself, also with 246 vulnerabilities. For comparison, last year the top five (in order) were: Microsoft’s Internet Explorer, Apple’s Mac OS X, the Linux Kernel, Google’s Chrome, and Apple’s iOS.

For “comparison” purposes, also consider:

Most vulnerable operating systems and applications in 2014 by Cristian Florian.

And a cautionary post by Emmanuel Carabott, The Pitfalls of Interpreting Vulnerability Data.

Amazing isn’t it?

How the vagaries of data come to the fore if you disagree with its interpretation?

Instead of containing “actionable insights” waiting for the plucking, data is suddenly mixed, insufficient, complicated and subject to interpretation.

You should remember that in the next Big Data/Graph/Deep Learning presentation that promises certainly/profit/insight is just a license and/or support agreement away.

Anything is possible but I would prefer to articulate, with your assistance and data, a certainly, business ROI, or insight of interest to you.

Isn’t that what really matters?

“Every” innovative firm maybe investing in n-dimensional printing software but if you have an aging HP-4000 (like I do), an investment on your part won’t have any ROI.

My advice for 2016 is to not allow a vendor’s problem (need to make a sale) become your problem (now what do I do with X?).

History of Apache Storm and lessons learned

Thursday, December 31st, 2015

History of Apache Storm and lessons learned by Nathan Marz.

From the post:

Apache Storm recently became a top-level project, marking a huge milestone for the project and for me personally. It’s crazy to think that four years ago Storm was nothing more than an idea in my head, and now it’s a thriving project with a large community used by a ton of companies. In this post I want to look back at how Storm got to this point and the lessons I learned along the way.

The topics I will cover through Storm’s history naturally follow whatever key challenges I had to deal with at those points in time. The first 25% of this post is about how Storm was conceived and initially created, so the main topics covered there are the technical issues I had to figure out to enable the project to exist. The rest of the post is about releasing Storm and establishing it as a widely used project with active user and developer communities. The main topics discussed there are marketing, communication, and community development.

Any successful project requires two things:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

What I think many developers fail to understand is that achieving that second condition is as hard and as interesting as building the project itself. I hope this becomes apparent as you read through Storm’s history.

All projects are different but the requirements for success:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

sound universal to me!

To clarify point #2, “people” means “other people.”

Preaching to a mirror or choir isn’t going to lead to success.

Nor will focusing on “your problem” as opposed to “their problem.”

PS: New Year’s Eve advice – Don’t download large files. 😉 Slower than you want to think. Suspect people on my subnet are streaming football games and/or porno videos, perhaps both (screen within screen).

I first saw this in a tweet by Bob DuCharme.

A Greater Threat to the U.S. Than the Islamic State

Thursday, December 31st, 2015

Those Demanding Free Speech Limits to Fight ISIS Pose a Greater Threat to U.S. Than ISIS by Glenn Greenwald.

From the post:

In 2006 — years before ISIS replaced al Qaeda as the New and Unprecedentedly Evil Villain — Newt Gingrich gave a speech in New Hampshire in which, as he put it afterward, he “called for a serious debate about the First Amendment and how terrorists are abusing our rights — using them as they once used passenger jets — to threaten and kill Americans.” In that speech, Gingrich argued:

Either before we lose a city, or, if we are truly stupid, after we lose a city, we will adopt rules of engagement that use every technology we can find to break up (terrorists’) capacity to use the internet, to break up their capacity to use free speech [protections] and to go after people who want to kill us — to stop them from recruiting people before they get to reach out and convince young people to destroy their lives while destroying us.

In a follow-up article titled “The First Amendment is Not a Suicide Pact,” Gingrich went even further, arguing that terrorists should be “subject to a totally different set of rules,” and called for an international convention to decide “on what activities will not be protected by free speech claims.”

Greenwald writes that limits on freedom of speech are not a historical nutty-idea from the past but are being raised by Cass Sunstein (Obama adviser) and Eric Posner (law professor).

Even the advocates of limits on free speech concede the legal system won’t, yet, accept limits on freedom of speech, that could change.

Imagine telling parents in the 1990’s that post-2010 that allowing strangers to fondle your genitals and those of your children were a prerequisite to air travel.

Who would have said then they would meekly line up like sheep to be intimately touched by strangers?

Or allow their children to be groped by strangers?

But both of those have come to pass. With nary a flicker of opposition from Congress.

Read Greenwald’s post in full and know that limits on freedom of speech, like restrictions on your right to travel (rejection of state driver licenses as identification), violation of your personal space (groping at airports), are not very far away at all.

Sorting Slightly Soiled Data (Or The Danger of Perfect Example Data) – XQuery

Wednesday, December 30th, 2015

Continuing with the data from my post: Great R packages for data import, wrangling & visualization [+ XQuery], I have discovered the dangers of perfect example data!

The XQuery examples on sorting that I have read either enclose strings in quotes and/or have strings with no whitespaces.

How often to you see strings with no whitespaces? Outside of highly constrained environments?

Why is that a problem?

Well, take a look at my results from sorting on the short description and displaying the short description first and the package name second:

package development, package installation devtools
misc installr
data import readxl
data import, data export googlesheets
data import RMySQL
data import readr
data import, data export rio
data analysis psych
data wrangling, data analysis sqldf
data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod
data import, web scraping rvest
data wrangling, data analysis dplyr
data wrangling plyr
data wrangling reshape2
data wrangling tidyr
data wrangling, data analysis data.table
data wrangling stringr
data wrangling lubridate
data wrangling, data analysis zoo
data display editR
data display knitr
data display, data wrangling listviewer
data display DT
data visualization ggplot2
data visualization dygraphs
data visualization googleVis
data visualization metricsgraphics
data visualization RColorBrewer
data visualization plotly
mapping leaflet
mapping choroplethr
mapping tmap
misc fitbitScraper
Web analytics rga
Web analytics RSiteCatalyst
package development roxygen2
data visualization shiny
misc openxlsx
data wrangling, data analysis gmodels
data wrangling car
data visualization rcdimple
data wrangling foreach
data acquisition downloader
data wrangling scales
data visualization plotly

Err, that’s not right!

The XQuery from yesterday:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

XQuery from today, changes in red:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[2]/a))
  6. return <tr>{$row/td[2]} {$row/td[1]}</tr>
  7. }</table>
  8. </html>

First, how do you explain the failure? Looks like no sort order at all.

Truthfully it does have a sort order, just not the one you expected. The results appear in document sort order, as they appeared in the document.

Here’s a snippet of that document:

<td><a href="" target="_new">devtools</a></td>
<td>package development, package installation</td>
<td>While devtools is aimed at helping you create your own R packages, it's also 
essential if you want to easily install other packages from GitHub. Install it! 
Requires <a href="" target="_new">
Rtools</a> on Windows and <a href="" 
target="_new">XCode</a> on a Mac. On CRAN.</td>
<td>Hadley Wickham & others</td>
<td><a href="" target="_new">installr</a>
<td>Windows only: Update your installed version of R from within R. On CRAN.</td>
<td>Tal Galili & others</td>
<td><a href="" target="_new">readxl</a>
</td><td>data import</td>
<td>Fast way to read Excel files in R, without dependencies such as Java. CRAN.</td>
<td>read_excel("my-spreadsheet.xls", sheet = 1)</td>
<td>Hadley Wickham</td>

I haven’t run the problem entirely to ground but as you can see from the output:

data import, data wrangling jsonlite
data import, data wrangling XML
data import, data visualization, data analysis quantmod

Most of the descriptions have spaces, not to mention “,” separating categories.

It is always possible to clean up the data but I want to avoid that if at all possible.

Cleaning data involves the risk I may change the data and once changed, I may not be able to go back to the original.

I can think of at least two (2) ways to fix this problem but want to sleep on it first and pick that can be easily adapted to the next soiled data that comes through the door.

PS: Neither Saxon (9.7), nor BaseX (8.3) gave any error messages at the console for the failure of the sort request.

You could say that document order is about as large an error message as can be given. 😉

Playboy Exposed [Complete Archive]

Wednesday, December 30th, 2015

Playboy Exposed by Univision’s Data Visualization Unit.

From the post:

The first time Pamela Anderson got naked for a Playboy cover, with a straw hat covering her inner thighs, she was barely 22 years old. It was 1989 and the magazine was starting to favor displaying young blondes on its covers.

On Friday, December 11, 2015, a quarter century later, the popular American model, now 48, graced the historical last nude edition of the magazine, which lost the battle for undress and decided to cover up its women in order to survive.

Univision Noticias analyzed all the covers published in the US, starting with Playboy’s first issue in December 1953, to study the cover models’ physical attributes: hair and skin color, height, age and body measurements. With these statistics, a model of the prototype woman for each decade emerged. It can be viewed in this interactive special.

I’ve heard people say they bought Playboy magazine for the short stories but this is my first time to hear of someone just looking at the covers. 😉

The possibilities for analysis of Playboy and its contents are nearly endless.

Consider the history of “party jokes” or “Playboy Advisor,” not to mention the cartoons in every issue.

I did check the Playboy Store but wasn’t about to find a DVD set with all the issues.

You can subscribe to Playboy Archive for $8.00 a month and access every issue from the first issue to the current one.

I don’t have a subscription so I not sure how you would do the OCR to capture the jokes.

That Rascally Vowpal Wabbit (2015)

Wednesday, December 30th, 2015

The Rascally Vowpal Wabbit (2015) by Kai-Wei Chang, et al. (pdf of slides)

MLWave tweeted:

Latest Vowpal Wabbit Tutorial from NIPS 2015 (Learning to search + active learning + C# library + decision service)

Not the best organized slide deck but let me give you some performance numbers on Vowpal Wabbit (pages 26 in the pdf):

vw: 6 lines of code 10 seconds to train
CRFsgd: 1068 lines 6 minutes
CRF++: 777 lines hours

Named entity recognition (200 thousand words)

vw: 30 lines of code 5 seconds to train
CRFsgd: 1 minute (subopt accuracy)
CRF++: 10 minutes (subopt accuracy)
SVMstr: 876 lines 30 minutes (subopt accuracy)

Interested now?


Bloggers! Help Defend The Public Domain – Prepare To Host/Repost “Baby Blue”

Wednesday, December 30th, 2015

Harvard Law Review Freaks Out, Sends Christmas Eve Threat Over Public Domain Citation Guide by Mike Masnick.

From the post:

In the fall of 2014, we wrote about a plan by public documents guru Carl Malamud and law professor Chris Sprigman, to create a public domain book for legal citations (stay with me, this isn’t as boring as it sounds!). For decades, the “standard” for legal citations has been “the Bluebook” put out by Harvard Law Review, and technically owned by four top law schools. Harvard Law Review insists that this standard of how people can cite stuff in legal documents is covered by copyright. This seems nuts for a variety of reasons. A citation standard is just an method for how to cite stuff. That shouldn’t be copyrightable. But the issue has created ridiculous flare-ups over the years, with the fight between the Bluebook and the open source citation tool Zotero representing just one ridiculous example.

In looking over all of this, Sprigman and Malamud realized that the folks behind the Bluebook had failed to renew the copyright properly on the 10th edition of the book, which was published in 1958, meaning that that version of the book was in the public domain. The current version is the 19th edition, but there is plenty of overlap from that earlier version. Given that, Malamud and Sprigman announced plans to make an alternative to the Bluebook called Baby Blue, which would make use of the public domain material from 1958 (and, I’d assume, some of their own updates — including, perhaps, citations that it appears the Bluebook copied from others).

As soon as “Baby Blue” drops, one expects the Harvard Law Review with its hired thugs Ropes & Gray to swing into action against Carl Malamud and Jon Sprigman.

What if the world of bloggers even those odds just a bit?

What if as soon as Baby Blue hits the streets, law bloggers, law librarian bloggers, free speech bloggers, open access bloggers, and any other bloggers all post Baby Blue to their sites and post it to file repositories?

I’m game.

Are you?

PS: If you think this sounds risky, ask yourself how much racial change would have happened in the South in the 1960’s if Martin Luther King had marched alone?

Leave Your Passport At Home – Push Back At The TSA

Wednesday, December 30th, 2015

TSA threatens to stop accepting driver’s licenses from nine states as of Jan 10 by Cory Doctorow.

Cory reports that extensions to the 2005 Real ID act are due to expire on January 10, 2016. States/territories facing consequences include “Alaska, California, Illinoois, Missouri, New Jersey, New Mexico, South Carolina, and Washington (as well as Puerto Rico, Guam, and the US Virgin Islands.”

At issue is whether the TSA must accept your state driver’s license as legitimate identification.

Just checking the passenger traffic numbers for California’s two largest airports and one in Illinois, I found:

Los Angeles International – 68,491,451 passengers (Jan-Nov. 2015)

San Francisco International – 41,906,798 passengers (Jan-Oct. 2015)

Chicago O’Hare – 70,823,493 passengers (Jan-Nov. 2015)

I’m not an air travel analyst but 181,221,742 million customers must represent a substantial amount of airline revenue.

At these three airports alone, the TSA is conspiring to inconvenient, delay and harass that group of 181,221,742 million paying customers.

If I had that many customers, threatened by the Not-1-Terrorist-Caught TSA, I would be using face time with House/Senate members to head off this PR nightmare.

If I had a driver’s license from any of these states, that is all that I would take to the airport.

Remember the Stasi fell because people simply stopped obeying.

Maybe this overreaching by the TSA will be its downfall. Ending literally $Billions in lost productive time, groping of women and children, and passengers being humiliated in numerous ways.

It’s time to call the TSA to heel. Leave your passport at home.

It’s a “best practice” for citizens who want to live in a free country.

Windows 10 covertly sends your disk-encryption keys to Microsoft

Wednesday, December 30th, 2015

Windows 10 covertly sends your disk-encryption keys to Microsoft by Cory Doctorow.

Cory gives a harrowing list of “unprecedented anti-user features” in Windows 10.

It is a must read for anyone trying to build support for a move to an open source OS.

Given the public reception of the Snowden revelations, are the “unprecedented anti-user features” a deliberate strategy by Microsoft to escape the clutches of both US and other governments demanding invasion of user privacy?

There has to be a sufficient market for MS to transition to application and OS support for enterprise level open source software and weaning enterprises off of Windows 10 would be one way to establish that market.

After all, GM isn’t going to call your local IT shop for support, even with an open source OS. Much more likely to call Microsoft, which has the staff and historical expertise to manage enterprise systems.

Sure, MS may lose the thin-margin projects at the bottom if it becomes entirely an open source organization but imagine the impact it will have on big data startups.

The high end/high profit markets in software will remain whether the income is from licensing or support/customization services.

That would certainly explain the recent trend towards open source projects at MS. And driving customers away from Windows 10 is probably easier than spiking the Windows/Office teams embedded at MS.

Corporate politics, don’t you just love it? 😉

If management talks about switching to Windows 10, you know the sign to give your co-workers from Helix:


For non-Helix fans: RUN LIKE HELL!

Man Bites Dog – News!

Tuesday, December 29th, 2015

Raspberry Pi declines bribe to pre-install malware by Robert Abel.

Robert reports that the Raspberry Pi Foundation was offered a bribe to install malware on its product and refused!

I wonder how many US manufacturers could make the same claim for their hardware or software?

Of course, in the United States, the request would have been accompanied by a National Security Letter or some other offense against freedom of speech.

FYI, no oppressive government has ever been overthrown or reformed by people who meekly follow it arbitrary dictates. Just saying.

Nominations by the U.S. President

Tuesday, December 29th, 2015

Nominations by the U.S. President

From, a faceted listing of all nominations from 1981 to date.

Facets include Congress, Nomination Type (Civilian, Military, Select Only), Status of Nomination, Senate Committee, Nominees with US State or Territory Indicated.

I haven’t spent a lot of time with this resource but it appears to be unnecessarily difficult to use.

For example:

Let’s look up the nomination of Sonia Sotomayor to the United States Supreme Court:


Did you notice the absence of any hyperlinks to the three days of hearings, July 13-15, 2009? Or the absence of links to the Senate debate on August 4-5, 2009? Or the absence of links to any of the other documents or agreements?

I remember the WWW being around in 2009 and I am damned sure it is available now!

So, what’s with the lack of hyperlinks?

Do you think they are lurking beneath the surface, waiting to be turned on?

Afraid not. Here is a sample of the underlying content for that page:

<td class="date">07/16/2009</td><td class="actions">
   Committee on the Judiciary. Hearings held and completed. Hearings printed: S.Hrg. 111-503.</td>
<td class="date">07/15/2009</td><td class="actions">
   Committee on the Judiciary. Hearings held.</td>
<td class="date">07/14/2009</td><td class="actions">
   Committee on the Judiciary. Hearings held.</td>
<td class="date">07/13/2009</td><td class="actions">
   Committee on the Judiciary. Hearings held.</td>

I don’t see any hooks for hyperlinking later on. Do you?

Another data hook that is missing is linking historical campaign donations to nominees for offices, particularly in the State Department.

Surely you didn’t think ambassadors were appointed from the professional ranks of the Foreign Service? People who actually speak the languages of the host country and know it customs and habits. What an odd view of American (or any other) government you have.

Some of the larger ambassadorships do require some experience but out of 270 embassies around the world, there are ones that go to mega-donors.

I don’t know the going rate on ambassadorships but linking nominations to donation records could yield a target minimum for donors to shoot for.

Linking nominations to donations would be a non-trivial exercise but certainly doable.

Other suggestions for on these webpages? They respond well to suggestions. Not to say they always agree but they do respond. More than I can say for some government groups.

The most contested real estate on Earth? [Noble Sanctuary/Temple Mount]

Tuesday, December 29th, 2015

The most contested real estate on Earth? (PDF)

I won’t try to reproduce a smaller version of this image because it would simply befoul rather remarkable work.

From the image (top right):

Muslims call it the Noble Sanctuary. Jews and Christians call it the Temple Mount. Built atop Mount Moriah in Jerusalem, this 36-acre site is the place where seminal events in Islam, Judaism and Christianity are said to have taken place, and it has been a flash point of conflict for millenniums. Many aspects of its meaning and history are still disputed by religious and political leaders, scholars, and even archaeologists. Several cycles of building and destruction have shaped what is on this hilltop today.

Great as far as it goes but the lower left bottom gives the impression that Hezekiah expanded the temple mount after Ahaz (his predecessor) plundered it. So legend holds but that leaves the reader with the false impression that the Jewish temple came to the Noble Sanctuary/Temple Mount first.

If you recall your Sunday School lessons, David conquers Jerusalem (Jebus), as told in 1 Chronicles 11:4-9.

Jerusalem was a sacred site long before David or the Israelites appear in the historical record. How long? Several thousand years at least but the type of excavation required to detail that part of the city’s history won’t happen any time soon.

Do enjoy the map, it is very impressive.

RTLSDR-Airband v2 released [Tooling Up for 2016, the Year of Surveillance]

Tuesday, December 29th, 2015

RTLSDR-Airband v2 released

From the post:

Back in June of 2014 we posted about the released of a new program called RTLSDR-Airband. RTLSDR-Airband is a Windows and Linux compatible command line tool that allows you to simultaneously monitor multiple AM channels per dongle within the same chunk of bandwidth. It is great for monitoring aircraft voice communications and can be used to feed websites like

Since our post the development of the software has been taken over by a new developer szpajder, who wrote in to us to let us know that he has now updated RTLSDR-Airband to version 2.0.0. The new versions improves performance and support for small embedded platforms such as the Raspberry Pi 2, but the Windows port is now not actively maintained and probably does not work.

Depending on your surveillance needs, the RTLSDR-Airband v2 + hardware should be on your list.

Governments around the world are continuing at a breakneck pace to eliminate privacy on large and small scale.

Citizens must demonstrate to governments that fishbowl environments are more troubling to the governing than the governed.

The data vacuum of the NSA can suck up the Internet backbone indefinitely. But, dedicated citizens can collect relevant data, untroubled by fraud, waste, inefficiency, and sheer incompetence.

Think of it as the difference between carpet bombing square mile after square mile versus a single sniper round. The former is a military-industrial complex response, the latter is available to all players.

As I have mentioned before, there are far more citizen-observers than government agents.

Make their “see something, say something” mantra your own.

See government activity, report government activity to other citizens.

I have no idea what victory will look like versus a surveillance state. But being a passive goldfish is a sure recipe for defeat.

Great R packages for data import, wrangling & visualization [+ XQuery]

Tuesday, December 29th, 2015

Great R packages for data import, wrangling & visualization by Sharon Machlis.

From the post:

One of the great things about R is the thousands of packages users have written to solve specific problems in various disciplines — analyzing everything from weather or financial data to the human genome — not to mention analyzing computer security-breach data.

Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in the table are clickable if you want more information. To find out more about a package once you’ve installed it, type help(package = "packagename") in your R console (of course substituting the actual package name ).

Forty-seven (47) “favorites” sounds a bit on the high side but some people have more than one “favorite” ice cream, or obsession. 😉

You know how I feel about sort-order and I could not detect an obvious one in Sharon’s listing.

So, I extracted the package links/name plus the short description into a new table:

car data wrangling
choroplethr mapping
data.table data wrangling, data analysis
devtools package development, package installation
downloader data acquisition
dplyr data wrangling, data analysis
DT data display
dygraphs data visualization
editR data display
fitbitScraper misc
foreach data wrangling
ggplot2 data visualization
gmodels data wrangling, data analysis
googlesheets data import, data export
googleVis data visualization
installr misc
jsonlite data import, data wrangling
knitr data display
leaflet mapping
listviewer data display, data wrangling
lubridate data wrangling
metricsgraphics data visualization
openxlsx misc
plotly data visualization
plotly data visualization
plyr data wrangling
psych data analysis
quantmod data import, data visualization, data analysis
rcdimple data visualization
RColorBrewer data visualization
readr data import
readxl data import
reshape2 data wrangling
rga Web analytics
rio data import, data export
RMySQL data import
roxygen2 package development
RSiteCatalyst Web analytics
rvest data import, web scraping
scales data wrangling
shiny data visualization
sqldf data wrangling, data analysis
stringr data wrangling
tidyr data wrangling
tmap mapping
XML data import, data wrangling
zoo data wrangling, data analysis


I want to use XQuery at least once a day in 2016 on my blog. To keep myself honest, I will be posting any XQuery I use.

To sort and extract two of the columns from Mary’s table, I copied the table to a separate file and ran this XQuery:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

One of the nifty aspects of XQuery is that you can sort, as on line 5, in all lower-case on the first <td> element, while returning the same element as written in the original table. Which gives better (IMHO) sort order than UPPERCASE followed by lowercase.

This same technique should make you the master of any simple tables you encounter on the web.

PS: You should always acknowledge the source of your data and the original author.

I first saw Sharon’s list in a tweet by Christophe Lalanne.

Going Viral in 2016

Tuesday, December 29th, 2015

How To Go Viral: Lessons From The Most Shared Content of 2015 by Steve Rayson.

I offer this as at least as amusing as it may be useful.

The topic element of a viral post is said to include:

Trending topic (e.g. Zombies), Health & fitness, Cats & Dogs, Babies, Long Life, Love

Hard to get any of those in with technical blog but I could try:

TM’s produce healthy and fit ED-free 90 year-old bi-sexuals with dogs & cats as pets who love all non-Zombies.

That’s 115 characters if you are counting.

Produce random variations on that until I find one that goes viral. 😉

But, I have never cared for click-bait or false advertising. Personally I find it insulting when marketers falsify research.

I may have to document some of those cases in 2016. There is no shortage of it.

None of my tweets may go viral in 2016 but Steve’s post will make it more likely they will be re-tweeted.

Feel free to re-use my suggested tweet as I am fairly certain that “…healthy and fit ED-free 90 year-old bi-sexuals…” is in the public domain.

Mocking “public access” – Media Silence – Vichy Records

Tuesday, December 29th, 2015

News accounts are blaring France Makes Wartime Vichy Government Archive Available To The Public (NPR) or words to that effect.

NPR catches the surface facts:

The French government is making available for the first time more than 200,000 documents on the Vichy government, which collaborated with the Nazis during World War II.

The documents, which were previously only partially accessible to researchers, will make “information such as the activities of the special police, who hunted resistants, communists and Jews accessible to the public, as long as they have been cleared by defence and security chiefs,” French radio station RFI reported. These archives also “show the extra-legal prosecution of members of the French Resistance, as well as proceedings against French Jews,” says the Associated Press.

Of the fifteen sources I checked:

  1. ANSAmed (English)
  2. Arutz Sheva
  3. BBC News
  4. European Jewish Press
  5. France 24
  6. The Guardian
  7. Haaretz
  8. The Jerusalem Post
  9. National Public Radion (NPR)
  10. New York Times
  11. RFI (English)
  12. Smithsonian
  13. The Sun
  14. Washington Post
  15. Ynetnews

only the New York Times mentions where the Vichy records are held, at the “Police Museum in Paris,” a link to the Paris Official website of the Convention and Visitors Bureau entry on the Police Museum in Paris.

A more useful link takes you to the Police Museum in Paris website.

“Public access” as used in these stories means for me:

Does that sound like “public access” to you?

That may have qualified as “public access” in the 1970’s or even the 1980’s, but in 2015?

Not one of the fifteen media sources I checked, even mentions the lack of meaningful “public access” to the Vichy records.

Clearly “public access” means something different to these fifteen news organizations than it does to the average Net citizen.

A notion of “public access” so different that denying all the citizens of the Net access doesn’t even come up as a question.

How useful are news organizations that can’t recognize “public access” issues to government information?

If you are dissatisfied with second-hand reports without references to source documents, see: Decree of 24 December 2015 opening of archives pertaining to World War II, French Official Gazette No. 0300 of 27 December 2015 Page 24116, which authorized the release of these documents. Apologies for using the English translation but I wanted to quickly confirm reports such as in Ynetnews that the records were to be online were false.

Researchers have been granted broader access to request documents. No mean step but falls far short of “public access.”

PS: All statements about the contents of stories on other sites are as of today, 29 December 2015, at 14:25 EST. Those stories may change with or without notice.

Voter Record Privacy? WTF?

Monday, December 28th, 2015

Leaky database tramples privacy of 191 million American voters by Dell Cameron.

From the post:

The voter information of more than 191 million Americans—including full names, dates of birth, home addresses, and more—was exposed online for anyone who knew the right IP address.

The misconfigured database, which was reportedly shut down at around 7pm ET Monday night, was discovered by security researcher Chris Vickery. Less than two weeks ago, Vickery also exposed a flaw in MacKeeper’s database, similarly exposing 13 million customer records.

What amazes me about this “leak” is the outrage is focused on the 191+ million records being online.


What about the six or seven organizations who denied being the owners of the IP address in question?

I take it none of them denied having possession of the same or essentially the same data, just that they didn’t “leak” it.

Quick question: Was voter privacy breached when these six or seven organizations got the same data or when it went online?

I would say when the Gang of Six or Seven got the same data.

You don’t have any meaningful voter privacy, aside from your actual ballot, and with your credit record (also for sale), you voting behavior can be nailed too.

You don’t have privacy but the Gang of Six or Seven do.

Attempting to protect lost privacy is pointless.

Making corporate overlords lose their privacy as well has promise.

PS: Torrents of corporate overlord data? Much more interesting than voter data.

Awesome Deep Learning – Value-Add Curation?

Monday, December 28th, 2015

Awesome Deep Learning by Christos Christofidis.

Tweeted by Gregory Piatetsky as:

Awesome Curated #DeepLearning resources on #GitHub: books, courses, lectures, researchers…

What will you find there? (As of 28 December 2015):

  • Courses – 15
  • Datasets – 114
  • Free Online Books – 8
  • Frameworks – 35
  • Miscellaneous – 26
  • Papers – 32
  • Researchers – 96
  • Tutorials – 13
  • Videos and Lectures – 16
  • Websites – 24

By my count, that’s 359 resources.

We know from detailed analysis of PubMed search logs, that 80% of searchers choose a link from the first twenty “hits” returned for a search.

You could assume that out of “23 million user sessions and more than 58 million user queries” PubMed searchers and/or PubMed itself or both transcend the accuracy of searching observed in other contexts. That seems rather unlikely.

The authors note:

Two interesting phenomena are observed: first, the number of clicks for the documents in the later pages degrades exponentially (Figure 8). Second, PubMed users are more likely to click the first and last returned citation of each result page (Figure 9). This suggests that rather than simply following the retrieval order of PubMed, users are influenced by the results page format when selecting returned citations.

Result page format seems like a poor basis for choosing search results, in addition to being in the top twenty (20) results.

Eliminating all the cruft from search results to give you 359 resources is a value-add, but what value-add should added to this list of resources?

What are the top five (5) value-adds on your list?

Serious question because we have tools far beyond what were available to curators in the 1960’s but there is little (if any) curation to match of the Reader’s Guide to Periodical Literature.

There are sample pages from the 2014 Reader’s Guide to Periodical Literature online.

Here is a screen-shot of some of its contents:


If you can, tell me what search you would use to return that sort of result for “abortion” as a subject.

Nothing come to mind?

Just to get you started, would pointing to algorithms across these 359 resources be helpful? Would you want to know more than algorithm N occurs in resource Y? Some of the more popular ones may occur in every resource. How helpful is that?

So I repeat my earlier question:

What are the top five (5) value-adds on your list?

Please forward, repost, reblog, tweet. Thanks!

China Pulls Alongside US in Race to No Privacy

Sunday, December 27th, 2015

China passes law requiring tech firms to hand over encryption keys by Mark Wilson.

From the post:

Apple may have said that it opposes the idea of weakening encryption and providing governments with backdoors into products, but things are rather different in China. The Chinese parliament has just passed a law that requires technology companies to comply with government requests for information, including handing over encryption keys.

Mark doesn’t provide a link to the text of the new law and I don’t read Chinese in any event. I will look for an English translation to pass onto you.

Reading from Mark’s summary, I assume “handing over encryption keys” puts China alongside the United States as far as breaking into iPhones.

Apple doesn’t have the encryption keys for later models of iPhones and therefore possesses nothing to be surrendered.

Now that China is even with the United States, who will take the lead in diminishing privacy is a toss-up. Not to be forgotten is France, with its ongoing “state of emergency.” Will that become a permanent state of emergency in 2016?

Five reasons why we must NOT censor ISIS propaganda [news]

Sunday, December 27th, 2015

Five reasons why we must NOT censor ISIS propaganda by Dr. Azeem Ibrahim.

From the post:

First of all, censoring ISIS in this way is simply not feasible. We can very well demand that mainstream newspapers and TV news stations limit their coverage of these issues, but that would leave the entire field of discussion to the unregulated areas of the internet, the “blogosphere” and social media. ISIS would still dominate in these areas, except now we will have removed from the discourse those outlets that would be most capable to hold the ISIS narrative to scrutiny.

All of Dr. Ibrahim’s points are well taken but the ability to “…hold the ISIS narrative to scrutiny” is the most telling one.

In holding the Islamic State narrative to scrutiny, the West will learn some of that narrative is true.

Simon Cottee writes in Why It’s So Hard to Stop ISIS Propaganda:

The more immediate, but no less intractable, challenge is to change the reality on the ground in Syria and Iraq, so that ISIS’s narrative of Sunni Muslim persecution at the hands of the Assad regime and Iranian-backed Shiite militias commands less resonance among Sunnis. One problem in countering that narrative is that some of it happens to be true: Sunni Muslims are being persecuted in Syria and Iraq. This blunt empirical fact, just as much as ISIS’s success on the battlefield, and the rhetorical amplification and global dissemination of that success via ISIS propaganda, helps explain why ISIS has been so effective in recruiting so many foreign fighters to its cause.

A first step towards scrutiny of all narratives in the conflict with the Islamic State would be to stop referring to reports and/or news from the Islamic State as “propaganda.” It isn’t any more or less propaganda than the numerous direct and indirect reports placed at the direction of the United States government.

Yet, even traditionally skeptical news organizations, such as the New York Times, repeats government reports of the danger the United States faces from the Islamic State without question.

At best, the Islamic State may have 35,000 fighters in Syria/Iraq. Should a nuke-armed hyper-power with a military budget equal to the next nine (9) biggest spenders, more than a third of all military spending, be fearful of this ragged band of fighters?

To read the serious tone with which the New York Times reports the hand wringing and posturing from both Washington and the presidential campaign trail, you would think so. Instead of analysis and well-deserved mockery of those fearful positions, the Times reports them as “news.”

Censoring the narratives of the Islamic State and failing to question those of the United States, deprives the public, including young people, of an opportunity to reach their own evaluation of those narratives.

Small wonder they are all so mis-informed.

HOBBIT – Holistic Benchmarking of Big Linked Data

Saturday, December 26th, 2015

HOBBIT – Holistic Benchmarking of Big Linked Data

From the “about” page:

HOBBIT is driven by the needs of the European industry. Thus, the project objectives were derived from the needs of the European industry (represented by our industrial partners) in combination with the results of prior and ongoing efforts including BIG, BigDataEurope, LDBC Council and many more. The main objectives of HOBBIT are:

  1. Building a family of industry-relevant benchmarks,
  2. Implementing a generic evaluation platform for the Big Linked Data value chain,
  3. Providing periodic benchmarking results including diagnostics to further the improvement of BLD processing tools,
  4. (Co-)Organizing challenges and events to gather benchmarking results as well as industry-relevant KPIs and datasets,
  5. Supporting companies and academics during the creation of new challenges or the evaluation of tools.

As we found in Avoiding Big Data: More Business Intelligence Than You Would Think, 3/4 of businesses cannot extract value from data they already possess, making any investment in “big data” a sure loser for them.

Which makes me wonder about what “big data” the HOBBIT project intends to use for benchmarking “Big Linked Data?”

Then I saw on the homepage:

The HOBBIT partners such as TomTom, USU, AGT and others will provide more than 25 trillions of sensor data to be bechmarked within the HOBBIT project.

“…25 trillions of sensor data….?” sounds odd until you realize that TomTom is:

TomTom founded in 1991 is a world leader of products for in-car location and navigation products.

OK, so the “Big Linked Data” in question isn’t random “linked data,” but a specialized kind of “linked data.”

That’s less risky than building a human brain with no clear idea of where to start, but it addresses a narrow window on linked data.

The HOBBIT Kickoff meeting Luxembourg 18-19 January 2016 announcement still lacks a detailed agenda.

Verifying Russian Airstrikes… vs. Verifying Casualties

Saturday, December 26th, 2015

Verifying Russian airstrikes in Syria with Silk, two months on by Eliot Higgins.

From the post:

As British forces join a growing list of countries conducting bombing campaigns across Syria, tracking who exactly is bombing where and why is becoming increasingly difficult. Just yesterday, differing groups of activists reported strikes had killed 32 fighters in ISIS-controlled territory, but there were conflicting reports as to who had launched them.

Some governments have been open in releasing footage of strikes or posting videos to YouTube, making them verifiable by independent investigators. But not all have been accurate in their description.

On October 5th, Bellingcat launched a crowdsourced effort to identify the locations shown in Russian Ministry of Defense airstrike videos, using the Checkdesk platform to identify the locations of the airstrikes and adding the data generated to a publicly available Silk database.

Readers of the Bellingcat website examined videos of Russian airstrikes in Syria posted to YouTube by the Russian Ministry of Defence, and scoured satellite imagery of Syria to match locations in the video with publicly available maps to verify if the claimed targets were all they purported to be.

As the database of claims and videos grew, Bellingcat team members double-checked any matches and updated the status of videos to either “False” or “Verified”. The details of the videos were then added to the Silk database, and updated as more videos were posted online by the Russian Ministry of Defence.

The verification of Russian airstrikes project is important because:

…it showed that with free tools, volunteers, and a bit of effort, it is possible to challenge the narratives presented by governments and militaries using their own evidence, in a way that is transparent and open to all.

True to challenging some government narrative but not all such narratives.

Consider the secrecy shrouded “investigations” into civilian deaths by the U.S. military as reported in Civilian deaths claimed in 71 US-led airstrikes on Isis by Alice Ross.

From Alice’s post:

The US-led coalition’s bombing of Islamic State in Iraq and Syria, which has been described as the “most precise ever”, faces allegations that civilians have been killed in 71 separate air raids.

A spokesman for US central command (Centcom) disclosed the claims to the Guardian. Many of the claims have been dismissed, but he said 10 incidents were the subject of fuller, formal investigations. Five investigations have been concluded, although only one has been published.

To date, the coalition acknowledges civilian deaths in a single strike: in November 2014 a US strike on Syria killed two children, a Centcom investigation published in May found. Centcom said it will only publish investigations where a “preponderance of evidence” suggests civilians have died.

Monitoring groups questioned how thorough the investigations were.

The international coalition has carried out more than 6,500 strikes since last August. Lt Gen John Hesterman, the US commander who leads the international air campaign against Isis in Iraq and Syria, described the campaign in June as “the most precise and disciplined in the history of aerial warfare”.

Centcom outlined details of the reports of fatalities in response to questions about one of its internal documents on the investigations being obtained by journalist Joseph Trevithick of the blog War is Boring, which gives details of 45 strikes alleged to have caused fatalities.

None of the participants in the war against the Islamic State are being “transparent” in any meaningful sense of the word.

‘Picard and Dathon at El-Adrel’

Saturday, December 26th, 2015

Machines, Lost In Translation: The Dream Of Universal Understanding by Anne Li.

From the post:

It was early 1954 when computer scientists, for the first time, publicly revealed a machine that could translate between human languages. It became known as the Georgetown-IBM experiment: an “electronic brain” that translated sentences from Russian into English.

The scientists believed a universal translator, once developed, would not only give Americans a security edge over the Soviets but also promote world peace by eliminating language barriers.

They also believed this kind of progress was just around the corner: Leon Dostert, the Georgetown language scholar who initiated the collaboration with IBM founder Thomas Watson, suggested that people might be able to use electronic translators to bridge several languages within five years, or even less.

The process proved far slower. (So slow, in fact, that about a decade later, funders of the research launched an investigation into its lack of progress.) And more than 60 years later, a true real-time universal translator — a la C-3PO from Star Wars or the Babel Fish from The Hitchhiker’s Guide to the Galaxy — is still the stuff of science fiction.

How far are we from one, really? Expert opinions vary. As with so many other areas of machine learning, it depends on how quickly computers can be trained to emulate human thinking.

The Star Trek Next Generation episode Darmok was set during a five-year mission that began in 2364, some 349 years in our future. Faster than light travel, teleportation, etc. are day to day realities. One expects machine translation to have improved at least as much.

As Li reports exciting progress is being made with neural networks for translation but transposing words from one language to another, as illustrated in Darmok, isn’t a guarantee of “universal understanding.”

In fact, the transposition may be as opaque as the statement in its original language, such as “Darmok and Jalad at Tanagra,” leaves the hearer to wonder what happened at Tanagra, what was the relationship between Darmok and Jalad, etc.

In the early lines of The Story of the Shipwrecked Sailor, a Middle Kingdom (Egypt, 2000 BCE – 1700 BCE) story, there is a line that describes the sailor returning home and words to the effect “…we struck….” Then the next sentence picks up.

The words necessary to complete that statement don’t occur in the text. You have to know that mooring boats on the Nile did not involve piers, etc. but simply banking your boat and then driving a post (the unstated subject of “we struck”) to secure the vessel.

Transposition from Middle Egyptian to English leaves you without a clue as to the meaning of that passage.

To be sure, neural networks may clear away some of the rote work of transposition between languages but that is a far cry from “universal understanding.”

Both now and likely to continue into the 24th century.

Fun with facets in ggplot2 2.0

Saturday, December 26th, 2015

Fun with facets in ggplot2 2.0 by Bob Rudis.

From the post:

ggplot2 2.0 provides new facet labeling options that make it possible to create beautiful small multiple plots or panel charts without resorting to icky grob manipulation.

Very appropriate for this year in Georgia (US) at any rate. Facets are used to display temperature by year and temperature versus Kwh by year.

The high today, 26th of December, 2015, is projected to be 77°F.

Sigh, that’s just not December weather.

Five Key Phases of Software Development – Ambiguity

Saturday, December 26th, 2015


It isn’t clear to me if the answer is wrong because:

  • Failure to follow instructions: No description followed the five (5) stages.
  • Five stages as listed were incorrect?

A popular alternative answer to the same question:


I have heard rumors and exhortations about requirements and documentation/testing but their incidence in practice is said to be low to non-existent.

As far as “designing” the program, isn’t bargaining what “agile programming” is all about? Showing users the latest mis-understanding of their desires and arguing it is in fact better than their original requests? Sounds like bargaining to me.

Anger may be a bit brief for “code the program” but after having lost arguments with users and told to make the UI a particular, less than best way, isn’t anger a fair description?

Acceptance is a no-brainer for “operate and maintain the system.” If no one is actively trying to change the system, what other name would you have for that state?

On the whole, it was failure to follow instructions and supply a description of each stage that lead to the answer being marked as incorrect. 😉

However, should you ever take the same exam, may I suggest that you give the popular alternative, although mythic, answer to such a question.

Like everyone else, software professions don’t appreciate their myths being questioned or disputed.

I first saw the test results in a tweet by Elena Williams.

Virtual Kalimba

Saturday, December 26th, 2015

Virtual Kalimba


Visit the site for keyboard shortcuts, tips & tricks, and interactive production of sound!

The website is an experiment in Web Audio by Middle Ear Media.

The Web Audio Tutorials page at Middle Ear Media has eight (8) tutorials on Web Audio.

Demo apps:

Apps are everywhere. While native mobile apps get a lot of attention, web apps have become much more powerful in recent years. Hopefully you can find something here that will stimulate you or improve the quality of your life in some way.

Web Audio Loop Mixer

Web Audio Loop Mixer is a Web Audio experiment created with HTML5, CSS3, JavaScript, and the Web Audio API. This web app is a stand alone loop mixer with effects. It allows up to four audio loops to be boosted, attenuated, equalized, panned, muted, and effected by delay or distortion in the browser.

Virtual Kalimba

Virtual Kalimba is a Web Audio experiment created with HTML5, CSS3, and JavaScript. It uses the Web Audio API to recreate a Kalimba, also known as an Mbira or Thumb Piano. This is a traditional African instrument that belongs to the Lamellophone family of musical instruments.

Virtual Hang

Virtual Hang is a Web Audio experiment created with HTML5, CSS3, and JavaScript. It uses the Web Audio API to recreate a Hang, a steel hand pan instrument. The Hang is an amazing musical instrument developed by Felix Rohner and Sabina Schärer in Bern, Switzerland.

War Machine

War Machine is a Web Audio experiment created with HTML5, CSS3, and JavaScript. The App uses the Web Audio API to create a sample pad interface reminiscent of an Akai MPC. The purpose of War Machine is not to promote violence, but rather to create a safe (victimless) environment for the release of excess aggression.

Channel Strip

Channel Strip is a Web Audio experiment created with HTML5, CSS3, JavaScript, and the Web Audio API. This web app is a stand alone audio channel strip that allows an audio signal to be boosted, attenuated, equalized, panned, compressed and muted in the browser. The audio source is derived from user media via file select input.

Task Managment

A fast loading Web App for managing tasks online. This App offers functions such as editable list items, removable list items, and it uses localStorage to save your information in your own browser.

On War Machine, the top row, third pad from the left comes the closest to an actual gunshot sound.

Works real well with the chorus from Anders Osborne’s Five Bullets:

Boom , boom, boom, that American sound
Teenage kids on a naked ground
Boom, boom, boom, that American sound
Five bullets in Pigeon Town

For more details on Anders Osborne, including lyrics and tour dates, see: Ya Ya Nation.

I first saw this in a tweet by Chris Ford.

The Social-Network Illusion That Tricks Your Mind – (Terrorism As Majority Illusion)

Friday, December 25th, 2015

The Social-Network Illusion That Tricks Your Mind

From the post:

One of the curious things about social networks is the way that some messages, pictures, or ideas can spread like wildfire while others that seem just as catchy or interesting barely register at all. The content itself cannot be the source of this difference. Instead, there must be some property of the network that changes to allow some ideas to spread but not others.

Today, we get an insight into why this happens thanks to the work of Kristina Lerman and pals at the University of Southern California. These people have discovered an extraordinary illusion associated with social networks which can play tricks on the mind and explain everything from why some ideas become popular quickly to how risky or antisocial behavior can spread so easily.

Network scientists have known about the paradoxical nature of social networks for some time. The most famous example is the friendship paradox: on average your friends will have more friends than you do.

This comes about because the distribution of friends on social networks follows a power law. So while most people will have a small number of friends, a few individuals have huge numbers of friends. And these people skew the average.

Here’s an analogy. If you measure the height of all your male friends. you’ll find that the average is about 170 centimeters. If you are male, on average, your friends will be about the same height as you are. Indeed, the mathematical notion of “average” is a good way to capture the nature of this data.

But imagine that one of your friends was much taller than you—say, one kilometer or 10 kilometers tall. This person would dramatically skew the average, which would make your friends taller than you, on average. In this case, the “average” is a poor way to capture this data set.

If that has you interested, see:

The Majority Illusion in Social Networks by Kristina Lerman, Xiaoran Yan, Xin-Zeng Wu.


Social behaviors are often contagious, spreading through a population as individuals imitate the decisions and choices of others. A variety of global phenomena, from innovation adoption to the emergence of social norms and political movements, arise as a result of people following a simple local rule, such as copy what others are doing. However, individuals often lack global knowledge of the behaviors of others and must estimate them from the observations of their friends’ behaviors. In some cases, the structure of the underlying social network can dramatically skew an individual’s local observations, making a behavior appear far more common locally than it is globally. We trace the origins of this phenomenon, which we call “the majority illusion,” to the friendship paradox in social networks. As a result of this paradox, a behavior that is globally rare may be systematically overrepresented in the local neighborhoods of many people, i.e., among their friends. Thus, the “majority illusion” may facilitate the spread of social contagions in networks and also explain why systematic biases in social perceptions, for example, of risky behavior, arise. Using synthetic and real-world networks, we explore how the “majority illusion” depends on network structure and develop a statistical model to calculate its magnitude in a network.

Research has not reached the stage of enabling the manipulation of public opinion to reflect the true rarity of terrorist activity in the West.

That being the case, being factually correct that Western fear of terrorism is a majority illusion isn’t as profitable as product tying to that illusion.

Apache Ignite – In-Memory Data Fabric – With No Semantics

Friday, December 25th, 2015

I saw a tweet from the Apache Ignite project pointing to its contributors page: Start Contributing.

The documentation describes Apache Ignite™ as:

Apache Ignite™ In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.

If you think that is impressive, here’s a block representation of Ignite:


Or a more textual view:

You can view Ignite as a collection of independent, well-integrated, in-memory components geared to improve performance and scalability of your application. Some of these components include:

Imagine my surprise when as search on “semantics” said

No Results Found.”

Even without data, whose semantics could be documented, there should be hooks for documenting of the semantics of future data.

I’m not advocating Apache Ignite jury-rig some means of documenting the semantics of data and Ignite processes.

The need for semantic documentation varies what is sufficient for one case will be wholly inadequate for another. Not to mention that documentation and semantics, often require different skills than possessed by most developers.

What semantics do you need documented with your Apache Ignite installation?