Archive for December, 2013

GALEX Unique Source Catalogs (Seibert et al.)

Tuesday, December 31st, 2013

GALEX Unique Source Catalogs (Seibert et al.)

From the webpage:

GALEX has been undertaking a number of surveys covering large areas of sky at a variety of depths. However, making use of this large data set can be difficult because the standard GALEX database contains all of the detected sources, which include many duplicate observations of the same sources, as well as numerous spurious low signal-to-noise sources. At the same time, the sky footprint associated with GALEX observations has not been well defined or presented in an easily usable format.

In order to remedy these problems, Seibert et al. have constructed three catalogs of GALEX measurements; namely the GALEX All-Sky Survey Source Catalog (GASC), the GALEX Medium Imaging Survey Catalog (GMSC), and the Kepler GCAT. Our intention is that these catalogs will provide the primary reference catalog useful for matching GALEX measurements with other large surveys of the sky at other wavelengths.

sky survey
All sky orthographic projection in Galactic coordinates of the NUV sky background in the GASC derived from the GR6 data release. The North Galactic cap is on the left while the South Galactic Cap is shown on the right.

Once astronomers move away from locating objects in the sky, they are no more immune to semantic ambiguity, synonymy and/or polysemy than any other profession.

In case you are looking for a new hobby for 2014, may I suggest amateur astroinformatics?

Tips on Long Term Emacs Productivity

Tuesday, December 31st, 2013

Tips on Long Term Emacs Productivity by Xah Lee.

There are seven (7) tips:

  1. Everything is a Command
  2. Master Window Splitting
  3. Master Dired
  4. Master Buffer Switching
  5. Remap Most Frequently Used Keys
  6. Master Find/Replace and Emacs Regex
  7. Get A Good Keyboard

The post on drawing suggested warmup exercises. Might not be a bad idea for Emacs as well.

A pandas cookbook

Tuesday, December 31st, 2013

A pandas cookbook by Julia Evans.

From the post:

A few people have told me recently that they find the slides for my talks really helpful for getting started with pandas, a Python library for manipulating data. But then they get out of date, and it’s tough to support slides for a talk that I gave a year ago.

So I was procrastinating packing to leave New York yesterday, and I started writing up some examples, with explanations! A lot of them are taken from talks I’ve given, but I also want to give some new examples, like

  • how to deal with timestamps
  • what is a pivot table and why would you ever want one?
  • how to deal with “big” data

I’ve put it in a GitHub repository called pandas-cookbook. It’s along the same lines as the pandas talks I’ve given – take a real dataset or three, play around with it, and learn how to use pandas along the way.

From what I have seen recently, “cookbooks” are going to be a big item in 2014!

PyPi interactive dependency graph

Tuesday, December 31st, 2013

PyPi interactive dependency graph

The graph takes a moment or two to load but is well worth the wait.

Mouse-over for popup labels.

The code is available on GitHub.

I don’t know the use case for displaying all the dependencies (or rather all the identified dependencies) in PyPi.

Or to put it another way, being able to hide some common dependencies by package or even class could prove to be helpful.

Seeing data in its aggregate isn’t as useful as discovering important data in the process of hiding common aggregate data.

NSA Catalog

Tuesday, December 31st, 2013

NSA’s ANT Division Catalog of Exploits for Nearly Every Major Software/Hardware/Firmware

Just in case you missed the news, the NSA has a catalog of hardware and software hacks.

Two points to bear in mind:

First, the catalog dates from 2008, which makes me wonder if it hasn’t been updated or is there a later version of the catalog that will be leaked later?

If shopping fromm a five year old catalog is any indication, small wonder the NSA is collecting lots of information to no avail.

Second, when you get to the catalog pages, note the parts that are blacked out.

Either a copy of the catalog was stolen along with the blackouts already in place or the news agency is censoring the information it distributes.

If it is the latter, I find that real curious.

The U.S. government hides information from us and when the press obtains that information, the press hides information as well.

I suppose I should feel lucky that we get any information at all.

Curated Dataset Lists

Tuesday, December 31st, 2013

6 dataset lists curated by data scientists by Scott Haylon.

From the post:

Since we do a lot of experimenting with data, we’re always excited to find new datasets to use with Mortar. We’re saving bookmarks and sharing datasets with our team on a nearly-daily basis.

There are tons of resources throughout the web, but given our love for the data scientist community, we thought we’d pick out a few of the best dataset lists curated by data scientists.

Below is a collection of six great dataset lists from both famous data scientists and those who aren’t well-known:

Here you will find lists of datasets by:

  • Peter Skomoroch
  • Hilary Mason
  • Kevin Chai
  • Jeff Hammerbacher
  • Jerry Smith
  • Gregory Piatetsky-Shapiro

Great lists of datasets, unfortunately, not deduped nor ranked by the # of collections in which they appear.


Tuesday, December 31st, 2013

Augur: a Modeling Language for Data-Parallel Probabilistic Inference by Jean-Baptiste Tristan,


It is time-consuming and error-prone to implement inference procedures for each new probabilistic model. Probabilistic programming addresses this problem by allowing a user to specify the model and having a compiler automatically generate an inference procedure for it. For this approach to be practical, it is important to generate inference code that has reasonable performance. In this paper, we present a probabilistic programming language and compiler for Bayesian networks designed to make effective use of data-parallel architectures such as GPUs. Our language is fully integrated within the Scala programming language and benefits from tools such as IDE support, type-checking, and code completion. We show that the compiler can generate data-parallel inference code scalable to thousands of GPU cores by making use of the conditional independence relationships in the Bayesian network.

A very good paper but the authors should highlight the caveat in the introduction:

We claim that many MCMC inference algorithms are highly data-parallel (Hillis & Steele, 1986; Blelloch, 1996) if we take advantage of the conditional independence relationships of the input model (e.g. the assumption of i.i.d. data makes the likelihood independent across data points).

(Where i.i.d. = Independent and identically distributed random variables.)

That assumption does allow for parallel processing, but users should be cautious about accepting assumptions about data.

The algorithms will still work, even if your assumptions about the data are incorrect.

But the answer you get may not be as useful as you would like.

I first saw this in a tweet by Stefano Bertolo.

Efficient Large-Scale Graph Processing…

Tuesday, December 31st, 2013

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems by Abdullah Gharaibeh, Elizeu Santos-Neto, Lauro Beltrao Costa, and Matei Ripeanu.


The increasing scale and wealth of inter-connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable knowledge from large-scale graphs. However, real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint, but also most graph algorithms entail memory access patterns with poor locality, data-dependent parallelism and a low compute-to-memory access ratio. Moreover, most real-world graphs have a highly heterogeneous node degree distribution, hence partitioning these graphs for parallel processing and simultaneously achieving access locality and load-balancing is difficult.

This work starts from the hypothesis that hybrid platforms (e.g., GPU-accelerated systems) have both the potential to cope with the heterogeneous structure of real graphs and to offer a cost-effective platform for high-performance graph processing. This work assesses this hypothesis and presents an extensive exploration of the opportunity to harness hybrid systems to process large-scale graphs efficiently. In particular, (i) we present a performance model that estimates the achievable performance on hybrid platforms; (ii) informed by the performance model, we design and develop TOTEM – a processing engine that provides a convenient environment to implement graph algorithms on hybrid platforms; (iii) we show that further performance gains can be extracted using partitioning strategies that aim to produce partitions that each matches the strengths of the processing element it is allocated to, finally, (iv) we demonstrate the performance advantages of the hybrid system through a comprehensive evaluation that uses real and synthetic workloads (as large as 16 billion edges), multiple graph algorithms that stress the system in various ways, and a variety of hardware configurations.

Graph processing that avoids the problems with clusters by using a single node.

Yes, a single node. Best to avoid this solution if you are a DoD contractor. 😉

If you are not a DoD (or NSA) contractor, the Totem project (subject of this paper), describes itself this way:

The goal of this project is to understand the challenges in supporting graph algorithms on commodity, hybrid platforms; platforms that consist of processors optimized for sequential processing and accelerators optimized for massively-parallel processing.

This will fill the gap between current graph processing platforms that are either expensive (e.g., supercomputers) or inefficient (e.g., commodity clusters). Our hypothesis is that hybrid platforms (e.g., GPU-supported large-memory nodes and GPU supported clusters) can bridge the performance-cost chasm, and offer an attractive graph-processing solution for many graph-based applications such as social networks and web analysis.

If you are facing performance-cost issues with graph processing, this is definitely research you need to be watching.

Totem software is available for downloading.

I first saw this in a tweet by Stefano Bertolo.

Provable Algorithms for Machine Learning Problems

Tuesday, December 31st, 2013

Provable Algorithms for Machine Learning Problems by Rong Ge.


Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NP-hard problems in average case using heuristics. What properties of the input allow it to be solved effciently? Theoretically analyzing the heuristics is very challenging. Few results were known.

This thesis takes a di fferent approach: we identify natural properties of the input, then design new algorithms that provably works assuming the input has these properties. We are able to give new, provable and sometimes practical algorithms for learning tasks related to text corpus, images and social networks.

The first part of the thesis presents new algorithms for learning thematic structure in documents. We show under a reasonable assumption, it is possible to provably learn many topic models, including the famous Latent Dirichlet Allocation. Our algorithm is the first provable algorithms for topic modeling. An implementation runs 50 times faster than latest MCMC implementation and produces comparable results.

The second part of the thesis provides ideas for provably learning deep, sparse representations. We start with sparse linear representations, and give the fi rst algorithm for dictionary learning problem with provable guarantees. Then we apply similar ideas to deep learning: under reasonable assumptions our algorithms can learn a deep network built by denoising autoencoders.

The fi nal part of the thesis develops a framework for learning latent variable models. We demonstrate how various latent variable models can be reduced to orthogonal tensor decomposition, and then be solved using tensor power method. We give a tight sample complexity analysis for tensor power method, which reduces the number of sample required for learning many latent variable models.

In theory, the assumptions in this thesis help us understand why intractable problems in machine learning can often be solved; in practice, the results suggest inherently new approaches for machine learning. We hope the assumptions and algorithms inspire new research problems and learning algorithms.

Admittedly an odd notion, starting with the data rather than an answer and working back towards data but it does happen. 😉

Given the performance improvements for LDA (50X), I anticipate this approach being applied to algorithms for “big data.”

I first saw this in a tweet by Chris Deihl.

NSA Cloud On The “Open Internet”

Tuesday, December 31st, 2013

The FCC defines the “Open Internet” as:

The “Open Internet” is the Internet as we know it. It’s open because it uses free, publicly available standards that anyone can access and build to, and it treats all traffic that flows across the network in roughly the same way. The principle of the Open Internet is sometimes referred to as “net neutrality.” Under this principle, consumers can make their own choices about what applications and services to use and are free to decide what lawful content they want to access, create, or share with others. This openness promotes competition and enables investment and innovation.

The Open Internet also makes it possible for anyone, anywhere to easily launch innovative applications and services, revolutionizing the way people communicate, participate, create, and do business—think of email, blogs, voice and video conferencing, streaming video, and online shopping. Once you’re online, you don’t have to ask permission or pay tolls to broadband providers to reach others on the network. If you develop an innovative new website, you don’t have to get permission to share it with the world.

Pay particular attention to the line:

This openness promotes competition and enables investment and innovation.

The National Security Agency (NSA) and other state-sponsored cyber-criminals are dark clouds on that “openness.”

For years, many of us have seen:

MS error report

But as the Spiegel staff report in: Inside TAO: Documents Reveal Top NSA Hacking Unit

NSA staff capture such reports and mock Microsoft with slides such as:

NSA image

(Both of the images are from the Spiegel story.)

It doesn’t require a lot of imagination to realize that Microsoft will have to rework its error reporting systems to encrypt such reports, resulting in more overhead for users, the Internet and Microsoft.

Other software vendors and services will be following suite, adding more cost and complexity to services on the Internet, rather than making services more innovative and useful.

The NSA and other state-sponsored cyber-criminals are a very dark cloud over the very idea of an “open Internet.”

What investments will be made to spur competition and innovation on the Internet in the future is unknown. What we do know is that left unchecked, the NSA and other state-sponsored cyber-criminals are going to make security, not innovation, the first priority in investment.

State-sponsored cyber-criminals are far more dangerous than state-sponsored terrorists. Terrorists harm a few people today. Cyber-criminals are stealing the future from everyone.

PS: The Spiegel story is in three parts: Part 1: Documents Reveal Top NSA Hacking Unit, Part 2: Targeting Mexico, Part 3: The NSA’s Shadow Network. Highly recommended for your reading.

Ready to learn Hadoop?

Monday, December 30th, 2013

Ready to learn Hadoop?

From the webpage:

Sign up for the challenge of learning the basics of Hadoop in two weeks! You will get one email every day for the next 14 days.

  • Hello World: Overview of Hadoop
  • Data Processing Using Apache Hadoop
  • Setting up ODBC Connections
  • Connecting to Enterprise Applications
  • Data Integration and ETL
  • Data Analytics
  • Data Visualization
  • Hadoop Use Cases: Web
  • Hadoop Use Cases: Business
  • Recap

You could do this entirely on your own but the daily email may help.

If nothing else, it will be a reminder that something fun is waiting for you after work.


IRI-DIM 2014…

Monday, December 30th, 2013

IRI-DIM 2014 : The Third IEEE International Workshop on Data Integration and Mining

April 4, 2014 Regular Paper submission deadline( Midnight PST )
May 4, 2014 Acceptance Notification
May 14, 2014 Camera-ready paper due
May 14, 2014 Conference author registration due
Aug. 13-15, 2014 Conference (San Francisco)

From the call for papers:

Given the emerging global Information-centric IT landscape that has tremendous social and economic implications, effectively processing and integrating humungous volumes of information from diverse sources to enable effective decision making and knowledge generation have become one of the most significant challenges of current times. Information Reuse and Integration (IRI) seeks to maximize the reuse of information by creating simple, rich, and reusable knowledge representations and consequently explores strategies for integrating this knowledge into systems and applications. IRI plays a pivotal role in the capture, representation, maintenance, integration, validation, and extrapolation of information; and applies both information and knowledge for enhancing decision-making in various application domains.

This conference explores three major tracks: information reuse, information integration, and reusable systems. Information reuse explores theory and practice of optimizing representation; information integration focuses on innovative strategies and algorithms for applying integration approaches in novel domains; and reusable systems focus on developing and deploying models and corresponding processes that enable Information Reuse and Integration to play a pivotal role in enhancing decision-making processes in various application domains.

Looks like I need to pull up the prior IRI proceedings. 😉

Name all the technologies you know that can address data structures as subjects? With properties and the ability to declare synonyms for components of data structures?

Did you say something other than topic maps?

Use owl:sameAs as an example. How would you represent properties of owl:sameAs?

This sounds very much like a topic maps conference!

Pattern recognition toolbox

Monday, December 30th, 2013

Pattern recognition toolbox by Thomas W. Rauber.

From the webpage:

TOOLDIAG is a collection of methods for statistical pattern recognition. The main area of application is classification. The application area is limited to multidimensional continuous features, without any missing values. No symbolic features (attributes) are allowed. The program in implemented in the ‘C’ programming language and was tested in several computing environments. The user interface is simple, command-line oriented, but the methods behind it are efficient and fast. You can customize your own methods on the application programming level with relatively little effort. If you wish a presentation of the theory behind the program at your university, feel free to contact me.

Command line classification. A higher learning curve that some but expect greater flexibility as well.

I thought the requirement of “no missing values” was curious.

If you have a data set with some legitimately missing values, how are you going to replace them in a neutral way?

Is Link Rot Destroying Stare Decisis…

Monday, December 30th, 2013

Is Link Rot Destroying Stare Decisis as We Know It? The Internet-Citation Practice of the Texas Appellate Courts by Arturo Torres (Journal of Appellate Practice and Process, Vol 13, No. 2, Fall 2012 )


In 1995 the first Internet-based citation was used in a federal court opinion. In 1996, a state appellate court followed suit; one month later, a member of the United States Supreme Court cited to the Internet; finally, in 1998 a Texas appellate court cited to the Internet in one of its opinions. In less than twenty years, it has become common to find appellate courts citing to Internet-based resources in opinions. Because of the current extent of Internet-citation practice varies by courts across jurisdictions, this paper will examine the Internet-citation practice of the Texas Appellate courts since 1998. Specifically, this study surveys the 1998 to 2011 published opinions of the Texas appellate courts and describes their Internet-citation practice.

A study that confirms what was found in …Link and Reference Rot in Legal Citations for the Harvard Law Review and the U.S. Supreme Court.

Curious that a West Key Numbers remain viable after more than a century of use (manual or electronic resolution) whereas Internet citations expire over the course of a few years.

What do you think is the difference in those citations, West Key Numbers versus URLs, that accounts for one being viable and the other only ephemerally so?

The Big Data story told through amusing pictures

Monday, December 30th, 2013

The Big Data story told through amusing pictures

Extremely funny!

My personal favorite:

Big Data Is Like Teenage Sex;
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is
doing it, so everyone claims they
are doing it….

There is far more truth in that quote than most vendors would care to admit.


6000 Companies Hiring Data Scientists

Monday, December 30th, 2013

6000 Companies Hiring Data Scientists by Vincent Granville.

From the post:

Search engines (Google, Microsoft), social networks (Twitter, Facebook, LinkedIn), financial institutions, Amazon, Apple, eBay, the health care industry, engineering companies (Boeing, Intel, Oil industry), retail analytics, mobile analytics, marketing agencies, data science vendors (for instance, Pivotal, Teradata, Tableau, SAS, Alpine Labs), environment, utilities government and defense routinely hire data scientists, though the job title is sometimes different. Traditional companies (manufacturing) tend to call them operations research analysts.

The comprehensive list of > 6,000 companies isn’t as helpful as you might imagine.

An Excel spreadsheet with two (2) columns. The first one is the company name and the second is the number of connections on LinkedIn.

I was thinking the list might be useful both in terms of employment but also for marketing data services.

In its present form, not useful at all.

But data scientists or wannabe data scientists should not accept less than useful data as a given.

What would you want to see added to the data? How would you make that a reality?

Recalling that we want maximum accuracy with a minimum amount of manual effort.

I’m going to thinking along those lines. Suggestions welcome!

175 Analytic and Data Science Web Sites

Monday, December 30th, 2013

175 Analytic and Data Science Web Sites by Vincent Granville.

From the post:

Following is a list (in alphabetical order) of top domains related to analytics, data science or big data, based on input from Data Science central members. These top domains were cited by at least 4 members. Some of them are pure data science web sites, while others are more general (but still tech-oriented) with strong emphasis on data issues at large, or regular data science content.

I created 175-DataSites-2013.txt from Vincent’s listing formatted as a Nutch seed text.

I would delete some of the entries prior to crawling.

For example,

Lots interesting content but If you are looking for data-centric resources, I would be more specific.

Linear algebra explained in four pages

Monday, December 30th, 2013

Linear algebra explained in four pages by Ivan Savov.

From the introduction:

This document will review the fundamental ideas of linear algebra. We will learn about matrices, matrix operations, linear transformations and discuss both the theoretical and computational aspects of linear algebra. The tools of linear algebra open the gateway to the study of more advanced mathematics. A lot of knowledge buzz awaits you if you choose to follow the path of understanding, instead of trying to memorize a bunch of formulas.

A rather dense four pages. 😉

It is based on the No Bullshit: Guide to Math and Physics.

The general tone of comments on the No Bullshit Guide… are positive, mostly very positive but I haven’t seen any professional reviews of it.

It you know of any professional reviews, please drop me a note.

I first saw this in Christophe Lalanne’s A bag of tweets / December 2013.

Scala as a platform…

Monday, December 30th, 2013

Scala as a platform for statistical computing and data science by Darren Wilkinson

From the post:

There has been a lot of discussion on-line recently about languages for data analysis, statistical computing, and data science more generally. I don’t really want to go into the detail of why I believe that all of the common choices are fundamentally and unfixably flawed – language wars are so unseemly. Instead I want to explain why I’ve been using the Scala programming language recently and why, despite being far from perfect, I personally consider it to be a good language to form a platform for efficient and scalable statistical computing. Obviously, language choice is to some extent a personal preference, implicitly taking into account subjective trade-offs between features different individuals consider to be important. So I’ll start by listing some language/library/ecosystem features that I think are important, and then explain why.

A feature wish list

It should:

  • be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
  • be free, open-source and platform independent
  • be fast and efficient
  • have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
  • have a strong type system, and be statically typed with good compile-time type checking and type safety
  • have reasonable type inference
  • have a REPL for interactive use
  • have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
  • have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
  • allow imperative programming for those (rare) occasions where it makes sense
  • be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example:

  • have excellent data viz capability built-in
  • have vast numbers of statistical routines in the standard library

Darren reviews Scala on each of these points.

Although he still uses R and Python, Darren has hopes for future development of Scala into a full featured data mining platform.

Perhaps his checklist will contribute the requirements needed to make that one of the futures of Scala.

I first saw this in Christophe Lalanne’s A bag of tweets / December 2013.

One Page R:…

Monday, December 30th, 2013

One Page R: A Survival Guide to Data Science with R by Graham Williams.

From the webpage:

Welcome to One Page R. This compendium of modules weaves together a collection of tools for the data miner, data scientist, and decision scientist. The tools are all part of the R Statistical Software Suite.

Each module begins with a Lecture on the topic, followed by a OnePageR oriented introduction to the topic. The OnePageR consists of multiple, one page recipes that cover many aspects of the topic. It can be worked through by the student and then used as a reference guide. Each page aims to be a bite sized chunk for hands-on learning, building on what has gone before. A laboratory session then follows, where the student is epxected to complete the tasks and submit material for assessment.

The R code sitting behind each OnePageR module is also provided and can be run standalone to replicate the material presented in the module.

By the author of: Data mining with Rattle and R : the art of excavating data for knowledge discovery (Springer, 2011).

Some of the topics are missing exercises and some topics remain to be done.

Don’t by shy!

BTW, Rattle resources.

I first saw this in Christophe Lalanne’s A bag of tweets / December 2013.

Part D Fraud

Sunday, December 29th, 2013

‘Let the Crime Spree Begin’: How Fraud Flourishes in Medicare’s Drug Plan by Tracy Weber and Charles Ornstein.

From the post:

With just a handful of prescriptions to his name, psychiatrist Ernest Bagner III was barely a blip in Medicare’s vast drug program in 2009.

But the next year he began churning them out at a furious rate. Not just the psych drugs expected in his specialty, but expensive pills for asthma and high cholesterol, heartburn and blood clots.

By the end of 2010, Medicare had paid $3.8 million for Bagner’s drugs — one of the highest tallies in the country. His prescriptions cost the program another $2.6 million the following year, records analyzed by ProPublica show.

Bagner, 46, says there’s just one problem with this accounting: The prescriptions aren’t his. “All of that stuff you have is false,” he said.

By his telling, someone stole his identity while he worked at a strip-mall clinic in Hollywood, Calif., then forged his signature on prescriptions for hundreds of Medicare patients he’d never seen. Whoever did it, he’s been told, likely pilfered those drugs and resold them.

“These people make more money off my name than I do,” said Bagner, who now works as a disability evaluator and says he no longer prescribes medications.

Today, credit card companies routinely scan their records for fraud, flagging or blocking suspicious charges as they happen. Yet Medicare’s massive drug program has a process so convoluted and poorly managed that fraud flourishes, giving rise to elaborate schemes that quickly siphon away millions of dollars.

Frustrated investigators for law enforcement, insurers and pharmacy chains say they don’t see evidence that Medicare officials are doing much to stop it.

“It’s kind of a black hole,” said Alanna Lavelle, director of investigations for WellPoint Inc., which provides drug coverage to about 1.4 million people in the program, known as Part D.

One of the problems that enables so much fraud is:

Part D is vulnerable because it requires insurance companies to pay for prescriptions issued by any licensed prescriber and filled by any willing pharmacy within 14 days. Insurers generally must cover even suspicious claims before investigating, an approach called “pay and chase.” By comparison, these same insurers have more time to review questionable medication claims for patients in their non-Medicare plans.

I wonder if the government would pay on a percentage of fraud reduction for a case like this?

Setting up the data streams from pharmacies would be the hardest part.

But once that was in place, it would a matter of getting some good average prescription data and crunching the numbers.

There would still be some minor fraud but nothing in the totals that are discussed in this article.

A topic map would be useful for some of the more sophisticated fraud schemes.

I make that sound easy and it would not be. There are financial/economic and social interests being served by the current Part D structures. And questions such as: How much fraud will you tolerate in order to get senior citizens their drugs? will need good answers.

Still, even routine data science tools and reporting should be able to lessen the financial hemorrhaging under Part D.

If the NSA can’t connect two dots….

Sunday, December 29th, 2013

Judge on NSA Case Cites 9/11 Report, But It Doesn’t Actually Support His Ruling by Justin Elliott.

From the post:

In a new decision in support of the NSA’s phone metadata surveillance program, U.S. district court Judge William Pauley cites an intelligence failure involving the agency in the lead-up to the 9/11 attacks. But the judge’s cited source, the 9/11 Commission Report, doesn’t actually include the account he gives in the ruling. What’s more, experts say the NSA could have avoided the pre-9/11 failure even without the metadata surveillance program.

We previously explored the key incident in question, involving calls made by hijacker Khalid al-Mihdhar from California to Yemen, in a story we did over the summer, which you can read below.

In his decision, Pauley writes: “The NSA intercepted those calls using overseas signals intelligence capabilities that could not capture al-Mihdhar’s telephone number identifier. Without that identifier, NSA analysts concluded mistakenly that al-Mihdhar was overseas and not in the United States.”

As his source, the judge writes in a footnote, “See generally, The 9/11 Commission Report.” In fact, the 9/11 Commission report does not detail the NSA’s intercepts of calls between al-Mihdhar and Yemen. As the executive director of the commission told us over the summer, “We could not, because the information was so highly classified publicly detail the nature of or limits on NSA monitoring of telephone or email communications.”

To this day, some details related to the incident and the NSA’s eavesdropping have never been aired publicly. And some experts told us that even before 9/11 — and before the creation of the metadata surveillance program — the NSA did have the ability to track the origins of the phone calls, but simply failed to do so.

Prior to 9/11, the NSA had a phone number in Yemen which it was monitoring and could have traced to a location with terrorists in San Diego. Under existing law at the time.

If the NSA can’t connect two dots, Mihdhar to Yemen in 2000, what reason is there to think they can usefully connect hundreds of millions, if not billions of dots?

Political Corruption Baseline

Sunday, December 29th, 2013

By the numbers: a 2013 money-in-politics index by Michael Beckel.

From the post:

Number of bills passed by Congress this year that have been signed into law: 58

Number of bills passed in 1948, the year President Harry Truman* assailed the “Do-Nothing Congress”: 511

Number of minutes Sen. Ted Cruz, R-Texas, spent reading Dr. Seuss’s “Green Eggs and Ham” during a 21-hour talk-a-thon in September: 5 ½

Number of hours per day the Democratic Congressional Campaign Committee recommends embattled freshmen spend fundraising: 4

Amount of campaign cash all members of Congress have reported raising so far in 2013: $403,952,012

If you are writing topic maps about political corruption outside the United States, you need to have some objective guidelines for what constitutes corruption.

May I suggest the U.S. Congress?

While Congress has carefully defined bribery to be a quid pro quo arrangement, “you vote for bill X for $$$,” it is clear that members of Congress vote based upon donations made to them, separate from discussions of particular legislation.

Without crunching the numbers, I would say the corruption rate in Congress easily exceeds 95% of both houses.

That will leave you with three (3) large categories, countries more corrupt than the United States (maybe Somalia?), countries about as corrupt as the United States (insert your candidates here) and those less corrupt than the United States (too numerous to list).

Something to keep in mind the next time the U.S. starts lecturing others about corruption.

I first saw this at: By the numbers: a 2013 money-in-politics index (Full Text Reports).

Sanity Checks

Sunday, December 29th, 2013

Being paranoid about data accuracy! by Kunal Jain.

Kunal knew a long meeting was developing after this exchange at its beginning:

Kunal: How many rows do you have in the data set?

Analyst 1: (After going through the data set) X rows

Kunal: How many rows do you expect?

Analyst 1 & 2: Blank look at their faces

Kunal: How many events / data points do you expect in the period / every month?

Analyst 1 & 2: …. (None of them had a clue)
The number of rows in the data set looked higher to me. The analysts had missed it clearly, because they did not benchmark it against business expectation (or did not have it in the first place). On digging deeper, we found that some events had multiple rows in the data sets and hence the higher number of rows.

You have probably seen them before but Kunal has seven (7) sanity check rules that should be applied to every data set.

Unless, of course, the inability to answer to simple questions about your data sets* is tolerated by your employer.

*Data sets become “yours” when you are asked to analyze them. Better to spot and report problems before they become evident in your results.

Data Analytic Recidivism Tool (DART) [DAFT?]

Sunday, December 29th, 2013

Data Analytic Recidivism Tool (DART)

From the website:

The Data Analytic Recidivism Tool (DART) helps answer questions about recidivism in New York City.

  • Are people that commit a certain type of crime more likely to be re-arrested?
  • What about people in a certain age group or those with prior convictions?

DART lets users look at recidivism rates for selected groups defined by characteristics of defendants and their cases.

A direct link to the DART homepage.

After looking at the interface, which groups recidivists in groups of 250, I’m not sure DART is all that useful.

It did spark an idea that might help with the federal government’s acquisition problems.

Why not create the equivalent of DART but call it:

Data Analytic Failure Tool (DAFT).

And in DAFT track federal contractors, their principals, contracts, and the program officers who play any role in those contracts.

So that when contractors fail, as so many of them do, it will be easy to track the individuals involved on both sides of the failure.

And every contract will have a preamble that recites any prior history of failure and the people involved in that failure, on all sides.

Such that any subsequent supervisor has to sign off with full knowledge of the prior lack of performance.

If criminal recidivism is to be avoided, shouldn’t failure recidivism be avoided as well?

How semantic search is killing the keyword

Sunday, December 29th, 2013

How semantic search is killing the keyword by Rich Benci.

From the post:

Keyword-driven results have dominated search engine results pages (SERPs) for years, and keyword-specific phrases have long been the standard used by marketers and SEO professionals alike to tailor their campaigns. However, Google’s major new algorithm update, affectionately known as Hummingbird because it is “precise and fast,” is quietly triggering a wholesale shift towards “semantic search,” which focuses on user intent (the purpose of a query) instead of individual search terms (the keywords in a query).

Attempts have been made (in the relatively short history of search engines) to explore the value of semantic results, which address the meaning of a query, rather than traditional results, which rely on strict keyword adherence. Most of these efforts have ended in failure. However, Google’s recent steps have had quite an impact in the internet marketing world. Google began emphasizing the importance of semantic search by showcasing its Knowledge Graph, a clear sign that search engines today (especially Google) care a lot more about displaying predictive, relevant, and more meaningful sites and web pages than ever before. This “graph” is a massive mapping system that connects real-world people, places, and things that are related to each other and that bring richer, more relevant results to users. The Knowledge Graph, like Hummingbird, is an example of how Google is increasingly focused on answering questions directly and producing results that match the meaning of the query, rather than matching just a few words.

“Hummingbird” takes flight

Google’s search chief, Amit Singhal, says that the Hummingbird update is “the first time since 2001 that a Google algorithm has been so dramatically rewritten.” This is how Danny Sullivan of Search Engine Land explains it: “Hummingbird pays more attention to each word in a query, ensuring that the whole query — the whole sentence or conversation or meaning — is taken into account, rather than particular words.”

The point of this new approach is to filter out less-relevant, less-desirable results, making for a more satisfying, more accurate answer that includes rich supporting information and easier navigation. Google’s Knowledge Graph, with its “connect the dots” type of approach, is important because users stick around longer as they discover more about related people, events, and topics. The results of a simple search for Hillary Clinton, for instance, include her birthday, her hometown, her family members, the books she’s written, a wide variety of images, and links to “similar” people, like Barack Obama, John McCain, and Joe Biden.

The key to making your website more amenable to “semantic search” is the use of the microformat you will find at

That is to say Google’s graph has pre-fabricated information in its knowledge graph that it can match up with information specified using markup.

Sounds remarkably like a topic map doesn’t it?

Useful if you are looking for “popular” people, places and things. Not so hot with intra-enterprise search results. Unless of course your enterprise is driven by “pop” culture.

Impressive if you want coarse semantic searching sufficient to sell advertising. (See Type Hierarchy at for all available types.

I say coarse semantic searching, my count on the types at, as of today, is seven hundred and nineteen (719) types. Is that what you get?

I ask because in scanning “InterAction,” I don’t see SexAction or any of its sub-categories. Under “ConsumeAction” I don’t see SmokeAction or SmokeCrackAction or SmokeWeedAction or any of the other sub-categories of “ConsumeAction.” Under “LocalBusiness” I did not see WhoreHouse, DrugDealer, S/MShop, etc.

I felt like I had fallen into BradyBunchville. 😉

Seriously, if they left out those mainstream activities, what are the chances they included what you need for your enterprise?

Not so good. That’s what I thought.

A topic map when paired with a search engine and your annotated content can take your enterprise beyond keyword search.

…if not incomprehensible to most citizens

Saturday, December 28th, 2013

A Right to Access Implies A Right to Know: An Open Online Platform for Research on the Readability of Law by Michael Curtotti and Eric McCreath. (Journal of Open Access to Law, Vol. 1, No. 1)


The widespread availability of legal materials online has opened the law to a new and greatly expanded readership. These new readers need the law to be readable by them when they encounter it. However, the available empirical research supports a conclusion that legislation is difficult to read if not incomprehensible to most citizens. We review approaches that have been used to measure the readability of text including readability metrics, cloze testing and application of machine learning. We report the creation and testing of an open online platform for readability research. This platform is made available to researchers interested in undertaking research on the readability of legal materials. To demonstrate the capabilities of the platform, we report its initial application to a corpus of legislation. Linguistic characteristics are extracted using the platform and then used as input features for machine learning using the Weka package. Wide divergences are found between sentences in a corpus of legislation and those in a corpus of graded reading material or in the Brown corpus (a balanced corpus of English written genres). Readability metrics are found to be of little value in classifying sentences by grade reading level (noting that such metrics were not designed to be used with isolated sentences).

What I found troubling about this paper as its conjuring of a right to have the law (the text of the law) to be “reasonably accessible” to individuals:

Leaving aside the theoretical justifications that might be advanced to support this view, the axiomatic position taken by this paper is that all individuals subject to law are entitled to know its content and therefore to have it written in a way which is reasonably accessible to them. (pp. 6-7)

I don’t dispute that the law should be freely available to everyone, it is difficult to obey what isn’t at least potentially available.

But, the authors’ “reasonably accessible” argument fails in two ways.

First, the authors fail to define a level of readability that supports “reasonably accessible.” How much change is necessary to achieve “reasonably accessible?” At least the authors don’t know.

Second, the amount of necessary change must be known in order to judge the feasibility of any revisions to make the law “reasonably accessible.”

The U.S. Internal Revenue Code (herein IRC) is a complex body of work that is based on prior court decisions, rulings by the I.R.S. and a commonly understood vocabulary among tax experts. And it is legislation that touches many other laws and regulations, both at a federal and state level. All of which are interwoven with complex meanings established by years of law, regulation and custom.

Even creating a vulgar version of important legislation would depend upon identification of a complex of subjects and relationships that are explicit only to an expert reader. Doable, but it would never have the force of law.

I first saw this at: Curtotti and McCreath: An Open Online Platform for Research on the Readability of Law.

Data Mining 22 Months of Kepler Data…

Saturday, December 28th, 2013

Data Mining 22 Months of Kepler Data Produces 472 New Potential Exoplanet Candidates by Will Baird.

Will’s report on:

Planetary Candidates Observed by Kepler IV: Planet Sample From Q1-Q8 (22 Months)


We provide updates to the Kepler planet candidate sample based upon nearly two years of high-precision photometry (i.e., Q1-Q8). From an initial list of nearly 13,400 Threshold Crossing Events (TCEs), 480 new host stars are identified from their flux time series as consistent with hosting transiting planets. Potential transit signals are subjected to further analysis using the pixel-level data, which allows background eclipsing binaries to be identified through small image position shifts during transit. We also re-evaluate Kepler Objects of Interest (KOI) 1-1609, which were identified early in the mission, using substantially more data to test for background false positives and to find additional multiple systems. Combining the new and previous KOI samples, we provide updated parameters for 2,738 Kepler planet candidates distributed across 2,017 host stars. From the combined Kepler planet candidates, 472 are new from the Q1-Q8 data examined in this study. The new Kepler planet candidates represent ~40% of the sample with Rp~1 Rearth and represent ~40% of the low equilibrium temperature (Teq less than 300 K) sample. We review the known biases in the current sample of Kepler planet candidates relevant to evaluating planet population statistics with the current Kepler planet candidate sample.

If you are interested in the Kepler data, you can visit the Kepler Data Archives or the Kepler Mission site.

Unlike some scientific “research,” with astronomy you don’t have to go hounding scientists for copies of their privately held data.

free-programming-books (878 books)

Saturday, December 28th, 2013


A much larger collection of free books than I pointed to at Free Programming Books in October of 2011.

I count eight hundred and seventy-eight (878 entries) along with twenty-two (22) pointers to other lists of free programming books.

If you were disappointed by the computer books you got for Christmas and/or didn’t get any computer books at all, you can find solace here. 😉

Visualization [Harvard/Python/D3]

Saturday, December 28th, 2013

Visualization [Harvard/Python/D3]

From the webpage:

The amount and complexity of information produced in science, engineering, business, and everyday human activity is increasing at staggering rates. The goal of this course is to expose you to visual representation methods and techniques that increase the understanding of complex data. Good visualizations not only present a visual interpretation of data, but do so by improving comprehension, communication, and decision making.

In this course you will learn how the human visual system processes and perceives images, good design practices for visualization, tools for visualization of data from a variety of fields, collecting data from web sites with Python, and programming of interactive web-based visualizations using D3.

Twenty-two (22) lectures, nine (9) labs (for some unknown reason, “lab” becomes “section”) and three (3) bonus videos.

Just as a sample, I tried Lab 3 Sketching Workshop I.

I don’t know that I will learn how to draw a straight line but if I don’t, it won’t be the fault of the instructor!

This looks very good.

I first saw this in a tweet by Christophe Viau.